Mysterious TB 2.0.0.24 hang and failure to start up [Mozilla]

Prev: `REQ:Tuxedo middleware with UBB---Dover-----9+ months contract
Next: for newsgroups, mozilla thunderbird 3 is now worth EXACTLY what I paid for it

From: Moooon Unit on 4 May 2010 01:21

This is a weird one. I was browsing a newsgroup in TB 2.0.0.24 and:

1. I clicked a post header.
2. TB displayed the post and marked it read.
3. I read the post.
4. I clicked a second post header in the same thread.
5. TB displayed the post and marked it read.
6. I read the post.
7. I clicked a third post header in the same thread.
8. TB displayed the post and marked it read.
9. I read the post.
10. I clicked a fourth post header in the same thread.
11. TB hung with the throbber in the lower right in the status
area pulsing continuously and nothing else happening. It did
mark the post read but it continued to display the
previous post and not the one I'd selected. This remained the
case for several seconds; the UI remained responsive but the
networking back-end was not working correctly.

Note: between items 1 and 11 I did not change anything, only a few
seconds passed, and at this point I checked the health of my network
connection by glancing at the tray icon for it and at my modem lights;
everything looked normal. There was no logical reason for TB to treat
the fourth post-header-click qualitatively differently from the first
three, yet it did.

12. I hit Stop.
13. The throbber stopped.
14. I clicked the previous post.
15. I clicked the post from item 10 again.
16. Same symptoms as before: throbber stayed active, no actual
new post appearing, UI responsive but networking back-end
apparently lying down on the job.
17. I hit Stop again.
18. This time the throbber kept going.

Note: between items 12 and 18 I did not change anything and only a few
seconds passed. There was no logical reason for TB to abort its network
activity the first time I clicked Stop but not the second time.

19. I marked the post unread.
20. I closed TB using the X button in the sole open TB window's upper
right corner.
21. I clicked Start, etc., etc., Thunderbird.
22. I waited roughly 30 seconds before concluding that TB had failed to
restart.
23. I clicked Start, etc., etc., Thunderbird.
24. I waited roughly 30 seconds and then consulted Task Manager.
25. TM showed that a TB process had launched and was using roughly
105MB(!) of memory, though no TB UI was being displayed.
26. I watched and for roughly 30 seconds the TB process used low
amounts of CPU and its memory usage fluctuated a bit, more or
less consistent with a normal instance of a running (somewhat
bloated) application.
27. Then it spontaneously exited: its process size shrank rapidly over
the space of a few seconds and vanished from the task list.

I presume that the first attempt to relaunch TB did the same thing:
process started up, bloated up, hung around for roughly a minute, and
then self-destructed without ever presenting a UI, though I only saw
this in TM the second time.

28. I clicked Start, etc., etc., Thunderbird.
29. A TB process reappeared in TM.
30. This time, it presented UI almost immediately.
31. I browsed back to the same newsgroup and immediately clicked the
header for the offending post (from items 10 and 15).
32. This time, TB displayed the post promptly, as in items 2, 5, and 8.

Note: I did not do anything differently the third time I tried to
relaunch TB, and I did not do anything differently the third time I
tried to view the post mentioned in items 10, 15, and 31.

Obviously an error of some kind occurred. Equally obviously, the error
was not mine, given that I used TB's UI a) normally and b) in the same
way each time I tried to view the misbehavior-associated news post and
that I also tried to relaunch TB in the same way all three times, yet TB
behaved very differently from one try to the next in both cases.

The error logically must therefore lie either with my network service
provider, the news server, or somewhere else in the network, or else in TB.

However, though a problem somewhere in the network could conceivably
cause TB to fail to retrieve a post, such a problem could not, were TB
functioning correctly, cause either of the following:

1. A failure of the Stop button in TB to function as advertised.
2. A newly-started TB process to fail to display UI and then, after a
short time, self-terminate with no error message.

Therefore, a bug must exist in TB, and having already postulated a bug
in TB it is not necessary to separately postulate a simultaneous bug in
the network, particularly not one that appears virtually instantly with
no warning or apparent cause and then vanishes just as abruptly within
three minutes. Application of Occam's razor leaves one hypothesis: a bug
in TB, probably in the networking code, that can cause failure of the
networking code in TB to receive or correctly process incoming data,
failure of the networking code in TB to respond to the Stop button, and
spontaneous abort of the process during start-up without error message.

Some constraints on the precise character of this bug can be determined
from its symptomology:

1. In all likelihood it lies in the networking code, where it could
reasonably be expected to be able to cause all three symptoms.
2. It is nondeterministically triggered for all practical purposes,
since qualitatively identical actions by the user could trigger it on
some occasions and fail to on others and since attempting to retrieve
a single specific news post even triggered it once and failed to at
a later time.
This points to a memory initialization, pointer, or array index
error, or else a concurrency error such as a race condition.
3. The aborted startups add further weight to the specific hypothesis of
a memory error, since that is the type of bug that can most reliably
be expected to abort a process. The lack of an error message such as
SIGSEGV could be a further symptom of the bug or a programmer
oversight.
4. The Stop button failure adds still more weight: data to do with the
networking state gets corrupted the when the bug is triggered (at
item 10 above); the second attempt to view the same post trips over
the corrupted networking state and locks or silently aborts the
networking thread. Now there's nothing to either download the post
and pass it to the UI thread or respond to the Stop button, ack
that it was stopped, and thus halt the throbber.
5. Unfortunately, the corruption results in corruption of temporary
files on disk of some sort, which then causes spontaneous aborts on
startup. Perhaps two such files are corrupted, and the first
failed startup gets rid of one and the second gets rid of the other,
so the third attempt succeeds.

It can still also be a concurrency bug: the attempt to retrieve the post
at item 10 randomly (for all practical purposes) triggers a deadlock
that freezes one networking thread and some other thread, perhaps a
housekeeping thread of some sort. One connection to the server is now
dead but TB had opened two. The Stop button sends a message to both
threads via a message-passing thread, which the deadlocked thread never
acts on but the other does. The message-passing thread perhaps stops the
throbber after signaling the unfrozen thread, then tries to signal the
frozen thread and locks up waiting for a monitor that will never be
released. Meanwhile the user tries to download the article again, and
the networking thread tries to work with the (frozen) housekeeping
thread and wedges, or the (frozen) message-passing thread is supposed to
talk to the networking thread. The UI thread hands this message off
asynchronously (so, does not freeze) and starts the throbber. Subsequent
attempts to use the Stop button fail because the message-passing thread
is frozen. Furthermore, some of the threads froze in the middle of
writing files to disk. These files are now corrupt, or else get
truncated when TB is closed but (some of) the frozen threads can't write
some internal buffers to disk and properly close the files. This then
causes the startup problems, hypothetically, in the same manner as
listed above as item number 5.

I'd look for race conditions, deadlocks, and pointer/index abuse in the
codebase, starting with the networking code and working outward from
there. Particularly I'd consider a) any code paths that lead to a quiet
abort during startup and b) the code path triggered by a Stop button
click to find out what could cause a) a quiet abort during startup and
b) the Stop button to neither work nor cause the UI to wedge, but
instead simply have no effect whatsoever. That should yield some clues;
for example if any deadlock capable of preventing posts from downloading
also would make clicking the Stop button lock up the GUI then it's not a
deadlock, and if startup has no-error-message "can't happen" aborts if
files A, B, and C are corrupted (and also deletes or repairs the
offending file) then places that write to files A, B, and C are
implicated in the bug.

Two fixes can be suggested immediately, though:
a) Do not, under any circumstances, silently abort. Always present an
error dialog on the way out, unless the exit is from the user saying
to quit.
b) If corrupt temp files cause aborts, change this so that they cause
automatic restarts instead. So if there's code that tries to load
file A, detects failure but internal data structures will have been
corrupted, deletes file A, and then exits, change it so after
deleting file A it fires off a "start thunderbird.exe" to the command
interpreter and *then* exist. (The *nix version can execve the TB
binary to restart; the Windows version should probably not create a
child process but instead trigger Windows to relaunch it and then
exit the current instance. Causing a child command interpreter to
execute a "start thunderbird.exe" will do this; the command
interpreter can die with the parent process as the Windows "start"
command apparently uses IPC to pass the command to be launched to
another part of Windows, which then launches it as if it had been
user-launched. Or something like that.)

|
Pages: 1
Prev: `REQ:Tuxedo middleware with UBB---Dover-----9+ months contract
Next: for newsgroups, mozilla thunderbird 3 is now worth EXACTLY what I paid for it