Prev: `REQ:Tuxedo middleware with UBB---Dover-----9+ months contract
Next: for newsgroups, mozilla thunderbird 3 is now worth EXACTLY what I paid for it
From: Moooon Unit on 4 May 2010 01:21 This is a weird one. I was browsing a newsgroup in TB 2.0.0.24 and: 1. I clicked a post header. 2. TB displayed the post and marked it read. 3. I read the post. 4. I clicked a second post header in the same thread. 5. TB displayed the post and marked it read. 6. I read the post. 7. I clicked a third post header in the same thread. 8. TB displayed the post and marked it read. 9. I read the post. 10. I clicked a fourth post header in the same thread. 11. TB hung with the throbber in the lower right in the status area pulsing continuously and nothing else happening. It did mark the post read but it continued to display the previous post and not the one I'd selected. This remained the case for several seconds; the UI remained responsive but the networking back-end was not working correctly. Note: between items 1 and 11 I did not change anything, only a few seconds passed, and at this point I checked the health of my network connection by glancing at the tray icon for it and at my modem lights; everything looked normal. There was no logical reason for TB to treat the fourth post-header-click qualitatively differently from the first three, yet it did. 12. I hit Stop. 13. The throbber stopped. 14. I clicked the previous post. 15. I clicked the post from item 10 again. 16. Same symptoms as before: throbber stayed active, no actual new post appearing, UI responsive but networking back-end apparently lying down on the job. 17. I hit Stop again. 18. This time the throbber kept going. Note: between items 12 and 18 I did not change anything and only a few seconds passed. There was no logical reason for TB to abort its network activity the first time I clicked Stop but not the second time. 19. I marked the post unread. 20. I closed TB using the X button in the sole open TB window's upper right corner. 21. I clicked Start, etc., etc., Thunderbird. 22. I waited roughly 30 seconds before concluding that TB had failed to restart. 23. I clicked Start, etc., etc., Thunderbird. 24. I waited roughly 30 seconds and then consulted Task Manager. 25. TM showed that a TB process had launched and was using roughly 105MB(!) of memory, though no TB UI was being displayed. 26. I watched and for roughly 30 seconds the TB process used low amounts of CPU and its memory usage fluctuated a bit, more or less consistent with a normal instance of a running (somewhat bloated) application. 27. Then it spontaneously exited: its process size shrank rapidly over the space of a few seconds and vanished from the task list. I presume that the first attempt to relaunch TB did the same thing: process started up, bloated up, hung around for roughly a minute, and then self-destructed without ever presenting a UI, though I only saw this in TM the second time. 28. I clicked Start, etc., etc., Thunderbird. 29. A TB process reappeared in TM. 30. This time, it presented UI almost immediately. 31. I browsed back to the same newsgroup and immediately clicked the header for the offending post (from items 10 and 15). 32. This time, TB displayed the post promptly, as in items 2, 5, and 8. Note: I did not do anything differently the third time I tried to relaunch TB, and I did not do anything differently the third time I tried to view the post mentioned in items 10, 15, and 31. Obviously an error of some kind occurred. Equally obviously, the error was not mine, given that I used TB's UI a) normally and b) in the same way each time I tried to view the misbehavior-associated news post and that I also tried to relaunch TB in the same way all three times, yet TB behaved very differently from one try to the next in both cases. The error logically must therefore lie either with my network service provider, the news server, or somewhere else in the network, or else in TB. However, though a problem somewhere in the network could conceivably cause TB to fail to retrieve a post, such a problem could not, were TB functioning correctly, cause either of the following: 1. A failure of the Stop button in TB to function as advertised. 2. A newly-started TB process to fail to display UI and then, after a short time, self-terminate with no error message. Therefore, a bug must exist in TB, and having already postulated a bug in TB it is not necessary to separately postulate a simultaneous bug in the network, particularly not one that appears virtually instantly with no warning or apparent cause and then vanishes just as abruptly within three minutes. Application of Occam's razor leaves one hypothesis: a bug in TB, probably in the networking code, that can cause failure of the networking code in TB to receive or correctly process incoming data, failure of the networking code in TB to respond to the Stop button, and spontaneous abort of the process during start-up without error message. Some constraints on the precise character of this bug can be determined from its symptomology: 1. In all likelihood it lies in the networking code, where it could reasonably be expected to be able to cause all three symptoms. 2. It is nondeterministically triggered for all practical purposes, since qualitatively identical actions by the user could trigger it on some occasions and fail to on others and since attempting to retrieve a single specific news post even triggered it once and failed to at a later time. This points to a memory initialization, pointer, or array index error, or else a concurrency error such as a race condition. 3. The aborted startups add further weight to the specific hypothesis of a memory error, since that is the type of bug that can most reliably be expected to abort a process. The lack of an error message such as SIGSEGV could be a further symptom of the bug or a programmer oversight. 4. The Stop button failure adds still more weight: data to do with the networking state gets corrupted the when the bug is triggered (at item 10 above); the second attempt to view the same post trips over the corrupted networking state and locks or silently aborts the networking thread. Now there's nothing to either download the post and pass it to the UI thread or respond to the Stop button, ack that it was stopped, and thus halt the throbber. 5. Unfortunately, the corruption results in corruption of temporary files on disk of some sort, which then causes spontaneous aborts on startup. Perhaps two such files are corrupted, and the first failed startup gets rid of one and the second gets rid of the other, so the third attempt succeeds. It can still also be a concurrency bug: the attempt to retrieve the post at item 10 randomly (for all practical purposes) triggers a deadlock that freezes one networking thread and some other thread, perhaps a housekeeping thread of some sort. One connection to the server is now dead but TB had opened two. The Stop button sends a message to both threads via a message-passing thread, which the deadlocked thread never acts on but the other does. The message-passing thread perhaps stops the throbber after signaling the unfrozen thread, then tries to signal the frozen thread and locks up waiting for a monitor that will never be released. Meanwhile the user tries to download the article again, and the networking thread tries to work with the (frozen) housekeeping thread and wedges, or the (frozen) message-passing thread is supposed to talk to the networking thread. The UI thread hands this message off asynchronously (so, does not freeze) and starts the throbber. Subsequent attempts to use the Stop button fail because the message-passing thread is frozen. Furthermore, some of the threads froze in the middle of writing files to disk. These files are now corrupt, or else get truncated when TB is closed but (some of) the frozen threads can't write some internal buffers to disk and properly close the files. This then causes the startup problems, hypothetically, in the same manner as listed above as item number 5. I'd look for race conditions, deadlocks, and pointer/index abuse in the codebase, starting with the networking code and working outward from there. Particularly I'd consider a) any code paths that lead to a quiet abort during startup and b) the code path triggered by a Stop button click to find out what could cause a) a quiet abort during startup and b) the Stop button to neither work nor cause the UI to wedge, but instead simply have no effect whatsoever. That should yield some clues; for example if any deadlock capable of preventing posts from downloading also would make clicking the Stop button lock up the GUI then it's not a deadlock, and if startup has no-error-message "can't happen" aborts if files A, B, and C are corrupted (and also deletes or repairs the offending file) then places that write to files A, B, and C are implicated in the bug. Two fixes can be suggested immediately, though: a) Do not, under any circumstances, silently abort. Always present an error dialog on the way out, unless the exit is from the user saying to quit. b) If corrupt temp files cause aborts, change this so that they cause automatic restarts instead. So if there's code that tries to load file A, detects failure but internal data structures will have been corrupted, deletes file A, and then exits, change it so after deleting file A it fires off a "start thunderbird.exe" to the command interpreter and *then* exist. (The *nix version can execve the TB binary to restart; the Windows version should probably not create a child process but instead trigger Windows to relaunch it and then exit the current instance. Causing a child command interpreter to execute a "start thunderbird.exe" will do this; the command interpreter can die with the parent process as the Windows "start" command apparently uses IPC to pass the command to be launched to another part of Windows, which then launches it as if it had been user-launched. Or something like that.) |