Prev: Data Entry Jobs ,Part time Jobs, Online Jobs
Next: avoid bouncing of rejected messages [secondary MX, spam filters on primaries]
From: John Marshall on 20 Feb 2010 07:12 Environment: Sendmail 8.14.4 on FreeBSD 8.0 I posted here earlier this month asking for troubleshooting tips for this problem. The problem is VERY intermittent and has only just recurred (on a different system). This time I was armed with clues on how to get gdb to attach to the stuck process and produce a backtrace. I got a backtrace from the stuck queue runner processes and thought I'd get a backtrace from a happy queue runner as well. They were identical! When I exited gdb I noticed that the 'stuck' queue runner had woken up and flushed its queues. I'm thinking that this smells like the operating system. PROCESS STATES BEFORE DEBUG (stuck runners are in 'I' state) PID TT STAT TIME COMMAND 80298 ?? Ss 0:17.68 sendmail: accepting connections (sendmail) 80299 ?? I 0:46.62 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) 80300 ?? I 0:08.83 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) 80301 ?? S 0:31.58 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) 80302 ?? S 0:30.71 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) 80303 ?? S 0:33.29 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) 80304 ?? S 0:30.55 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) BACKTRACE OF STUCK PROCESS 80299 (gdb) bt #0 0x28346547 in sigsuspend () from /lib/libc.so.7 #1 0x28344e98 in sigpause () from /lib/libc.so.7 #2 0x2833be3e in pause () from /lib/libc.so.7 #3 0x080cc7c8 in sleep () #4 0x08099c51 in run_work_group () #5 0x08099ebf in runqueue () #6 0x0805538d in main () BACKTRACE OF HAPPY PROCESS 80301 (gdb) bt #0 0x28346547 in sigsuspend () from /lib/libc.so.7 #1 0x28344e98 in sigpause () from /lib/libc.so.7 #2 0x2833be3e in pause () from /lib/libc.so.7 #3 0x080cc7c8 in sleep () #4 0x08099c51 in run_work_group () #5 0x08099ebf in runqueue () #6 0x0805538d in main () PROCESS STATES AFTER DEBUG PID TT STAT TIME COMMAND 80298 ?? Ss 0:17.69 sendmail: accepting connections (sendmail) 80299 ?? S 0:46.66 sendmail: running queue: /var/spool/mqueue/qd1/df (sendmail) 80300 ?? S 0:08.85 sendmail: running queue: /var/spool/mqueue/mby/df (sendmail) 80301 ?? S 0:31.60 sendmail: running queue: /var/spool/mqueue/oz/df (sendmail) 80302 ?? S 0:30.73 sendmail: running queue: /var/spool/mqueue/rw2/df (sendmail) 80303 ?? S 0:33.32 sendmail: running queue: /var/spool/mqueue/hold/df (sendmail) 80304 ?? S 0:30.58 sendmail: running queue: /var/spool/mqueue/pgp/df (sendmail) Obviously the debugger woke up the soundly-sleeping processes. -- John Marshall
From: Claus Aßmann on 21 Feb 2010 19:45 John Marshall wrote: > I got a backtrace from the stuck queue runner processes and thought I'd > get a backtrace from a happy queue runner as well. They were identical! > When I exited gdb I noticed that the 'stuck' queue runner had woken up > and flushed its queues. I'm thinking that this smells like the > operating system. > BACKTRACE OF STUCK PROCESS 80299 > #0 0x28346547 in sigsuspend () from /lib/libc.so.7 > #1 0x28344e98 in sigpause () from /lib/libc.so.7 > #2 0x2833be3e in pause () from /lib/libc.so.7 > #3 0x080cc7c8 in sleep () > #4 0x08099c51 in run_work_group () sleep() should be one of these calls: if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) sleep(MIN_SLEEP_TIME); else if (WorkGrp[wgrp].wg_lowqintvl <= 0) sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); else sleep(WorkGrp[wgrp].wg_lowqintvl); Unless you have a really large value for one of these, the process should continue after a while. Note: on some systems sleep() has a problem when used in a multi-threaded program. sendmail should not be compiled with pthreads (only libmilter does). You could try the workaround from libmilter/listener.c: look for MI_SLEEP().
From: John Marshall on 22 Feb 2010 21:41 On 22 Feb 2010 00:45:22 GMT, Claus A?mann wrote: > John Marshall wrote: >> I got a backtrace from the stuck queue runner processes and thought I'd >> get a backtrace from a happy queue runner as well. They were identical! >> When I exited gdb I noticed that the 'stuck' queue runner had woken up >> and flushed its queues. I'm thinking that this smells like the >> operating system. > > sleep() should be one of these calls: > > if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME) > sleep(MIN_SLEEP_TIME); > else if (WorkGrp[wgrp].wg_lowqintvl <= 0) > sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME); > else > sleep(WorkGrp[wgrp].wg_lowqintvl); > > Unless you have a really large value for one of these, the process > should continue after a while. I'm not overriding any of the sendmail defaults, so this should always be 5s (from MIN_SLEEP_TIME) or 1s (from QueueIntvl). I've written this up and sent it to the FreeBSD-STABLE mailing list: <http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055368.html> -- John Marshall
From: John Marshall on 25 Feb 2010 05:04 On Tue, 23 Feb 2010 02:41:07 +0000 (UTC), John Marshall wrote: > On 22 Feb 2010 00:45:22 GMT, Claus A?mann wrote: >> sleep() should be one of these calls: -- snip -- >> Unless you have a really large value for one of these, the process >> should continue after a while. > > I'm not overriding any of the sendmail defaults, so this should always > be 5s (from MIN_SLEEP_TIME) or 1s (from QueueIntvl). SLEEPING (HAPPY) QUEUE RUNNER - 8.14.4 Build on FreeBSD 8.0 (gdb) bt #0 0x2833f6eb in sigsuspend () from /lib/libc.so.7 #1 0x2833e050 in sigpause () from /lib/libc.so.7 #2 0x28334fae in pause () from /lib/libc.so.7 #3 0x080cc975 in sleep () #4 0x08099d96 in run_work_group () #5 0x0809a016 in runqueue () #6 0x08055379 in main () SLEEPING (HAPPY) QUEUE RUNNER - 8.14.4 Hacked Build on FreeBSD 8.0 (gdb) bt #0 0x283c43f7 in nanosleep () from /lib/libc.so.7 #1 0x080cbf7a in sleep () #2 0x08099c81 in run_work_group () #3 0x08099eef in runqueue () #4 0x080553bd in main () Is there any reason why we shouldn't be using FreeBSD's nanosleep(2) in sendmail's sleep() in libsm/clock.c? Sendmail's sleep() will use nanosleep if HAVE_NANOSLEEP is defined. The only place I found that defined is in include/sm/conf.h, and only for SOLARIS. I defined HAVE_NANOSLEEP for FreeBSD in include/sm/conf.h and re-built to produce the above. Is there any reason systems other than SOLARIS shouldn't use their nanosleep? -- John Marshall
From: Claus Aßmann on 25 Feb 2010 23:40
John Marshall wrote: > Is there any reason why we shouldn't be using FreeBSD's nanosleep(2) in > sendmail's sleep() in libsm/clock.c? I don't know any. It's probably just because nobody ever told sendmail.org about it (unfortunately sendmail 8 uses a "hand coded" feature definition file and the developers have lost access to most OSs and get very little feedback -- except for the Sun version where someone at Sun always provides us with the latest updates.) If you can tell us which FreeBSD versions have a reliable nanosleep() implementation, then we can add it to conf.h (patches welcome :-) |