From: Chris Ridd on 27 Mar 2006 01:52 On 26/3/06 11:05, in article 44271029.6010007(a)southminster-branch-line.org.uk, "Dave (from the UK)" <see-my-signature(a)southminster-branch-line.org.uk> wrote: > solaris10 % truss -v pollsys -p 6205 > /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 > /3: fd=23 ev=POLLRDNORM rev=0 > /3: timeout: 0.000001000 sec Do you know what fd 23 is associated with? I've a feeling lsof might tell you this, or maybe dtrace can, or failing all that you'll have to truss the process starting up and watch for open() calls. Cheers, Chris
From: Dan Foster on 27 Mar 2006 03:56 In article <C04D4A54.1890E9%chrisridd(a)mac.com>, Chris Ridd <chrisridd(a)mac.com> wrote: > On 26/3/06 11:05, in article > 44271029.6010007(a)southminster-branch-line.org.uk, "Dave (from the UK)" ><see-my-signature(a)southminster-branch-line.org.uk> wrote: > >> solaris10 % truss -v pollsys -p 6205 >> /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 >> /3: fd=23 ev=POLLRDNORM rev=0 >> /3: timeout: 0.000001000 sec > > Do you know what fd 23 is associated with? I've a feeling lsof might tell > you this, or maybe dtrace can, or failing all that you'll have to truss the > process starting up and watch for open() calls. % lsof -p 12171 | awk '$4 ~ /^23/ { print $1 " " $NF }' should do the trick. Info on where to get lsof from its author: http://people.freebsd.org/~abe/ I think SunFreeware has a binary lsof, though building from source is most preferred as it will better match your running kernel's API interface. lsof is a small package and builds quickly. Another option: % /usr/proc/bin/pfiles 12171 (This will require SUNWesu be installed, and /proc mounted.) -Dan
From: Casper H.S. Dik on 27 Mar 2006 04:13 "Dave (from the UK)" <see-my-signature(a)southminster-branch-line.org.uk> writes: >the difference between the two systems seems to be that on the Solaris 9 >box, thread 3 is calling poll(), but on the Solaris 10 box, >thread 3 calls pollsys(). I guess it would be interesting to >know what happens to thread 3 on the Blade 1000 running Solaris 10. Ok, so in S9 it calls poll with a 1ms wait time; this causes a delay of around one clock tick (really 10-20ms); so the below trace looks somewhat strange if the lwp_park tmeout is the actual time slept (perhaps you can truss with one of the time options) >662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0 >Using the -v option on truss. >solaris9 % truss -v lwp_park -v poll -p 662 >662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME >662/1: timeout: 0.019998000 sec >662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0 >662/3: fd=23 ev=POLLRDNORM rev=0 >662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0 >662/3: fd=23 ev=POLLRDNORM rev=0 >662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0 >662/3: fd=23 ev=POLLRDNORM rev=0 >662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0 >662/3: fd=23 ev=POLLRDNORM rev=0 >I see a timeout of 20ms on lwp_park(), but nothing for poll() - I am >rather lost as to what is happening here. That's because truss doesn't show timeouts (it's the 3rd argument and it's in ms) >Whereas on the Solaris 10 box the CPU usage of MathKernel is excessive >(25%, which is basically one CPU flat out) >solaris10 % truss -v pollsys -p 6205 >/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 >/3: fd=23 ev=POLLRDNORM rev=0 >/3: timeout: 0.000001000 sec Hm, this timeout value looks suspicious. They might be calling select() with the smallest possible timeout (1us) which would map to 1ms poll in S9 but 1us pollsys in S10. Inspecting the code, it seems that pollsys() uses a cv_timedwait() which uses the real time to compute when it needs to wakeup. Because the *slow* systems can't manage poll setup in under 1us, they will hit cv_timedwait when we're past the 1us mark and the function will return immediately. Fast systems, OTOH, will be able to make the 1us deadline and cv_timedwait will sleep until the next clocktick. If this hypothesis is correct, then perhaps the following preload will work (preload w/ LD_PRELOAD=...); different options for 64 bit executable. /* * Select roundup preload. (casper.dik(a)you.know.where) * cc -G -Kpic select_preload.c -o select_preload.so * */ #include <dlfcn.h> #include <sys/time.h> #define FUN_PROTO(type,internal,parms) \ type internal parms #define DECLARE(type,name, parms) static FUN_PROTO(type,(*name), parms) #define CAST(type, parms) (FUN_PROTO(type,(*), parms)) DECLARE(int,next_select,(int, fd_set *, fd_set *, fd_set *, struct timeval *)); #ifdef __GNUC__ void loadit(void) __attribute__ ((constructor)); #else #pragma init(loadit) #endif void loadit(void) { extern char **environ; char **env; int offset; next_select = CAST(int, (int, fd_set *, fd_set *, fd_set *, struct timeval * ))dlsym(RTLD_NEXT, "select"); } int select(int nfds, fd_set *restrict readfds, fd_set *restrict writefds, fd_set *restrict errorfds, struct timeval *restrict timeout) { if (timeout != NULL && timeout->tv_sec == 0 && timeout->tv_usec > 0 && timeout->tv_usec < 1000) timeout->tv_usec = 1000; return (next_select(nfds, readfds, writefds, errorfds, timeout)); } Casper
From: Dave (from the UK) on 27 Mar 2006 04:42 Chris Ridd wrote: > On 26/3/06 11:05, in article > 44271029.6010007(a)southminster-branch-line.org.uk, "Dave (from the UK)" > <see-my-signature(a)southminster-branch-line.org.uk> wrote: > > >>solaris10 % truss -v pollsys -p 6205 >>/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 >>/3: fd=23 ev=POLLRDNORM rev=0 >>/3: timeout: 0.000001000 sec > > > Do you know what fd 23 is associated with? I certainly did not, but I might have found out what it is - see below. > I've a feeling lsof might tell First, I should explain something. The program causing the problem, which is Mathematica. http://www.wolfram.com/products/mathematica/index.html consists of 4 important parts. 1) A *graphical* front end which is called by the user as "mathematica" but that is in fact just a shell script which calls a 32-bit binary "Mathematica". I'm told this is 32-bit, as there is no advantage in having a 64-bit front end. 2) A command line driven front end. 3) MathKernel, which is a 64-bit binary. This does all the calculations. Since Mathematica has rather a liking for memory, the kernel is 64-bit. 4) A link called 'mathlink' http://www.wolfram.com/solutions/mathlink/ whcih allows a front end to communicate with the kernel. Normally, I start the GUI front end and let it communicate with the computational kernel over 'mathlink'. But one can write ones own front end if one wishes. What is interesting is that if the GUI is used, the MathKernel uses loads of CPU time. But if the command line interface is used, the MathKernal does not use excessive CPU time. So changing the front end causes the kenel to either behave itself or not as the case may be. The front end and kernel do not have to be run on the same machine, but I have tried running them on differnet machines, but the problem still exists. Here is the output from prstat an hour or so after mathemtica was asked to compute 1+1. As you see, it has used over an hour of CPU time and is CPU bound on the quad processor machine. sparrow /export/home/drkirkby % uname -a SunOS sparrow 5.10 Generic_118833-02 sun4u sparc SUNW,Ultra-80 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 8637 drkirkby 63M 17M cpu3 10 0 1:05:39 25% MathKernel/3 6705 drkirkby 236M 168M sleep 49 0 0:38:25 1.5% mozilla-bin/3 6453 drkirkby 180M 78M cpu0 52 0 0:16:31 0.8% Xsun/1 6611 drkirkby 80M 25M sleep 59 0 0:03:38 0.4% metacity/1 8305 drkirkby 205M 105M sleep 49 0 0:04:13 0.4% acroread/1 6688 drkirkby 233M 65M sleep 49 0 0:08:24 0.3% java/21 6758 drkirkby 99M 33M cpu3 49 0 0:01:47 0.2% gnome-terminal/2 13499 drkirkby 4824K 4456K cpu2 57 0 0:00:00 0.1% prstat/1 6630 drkirkby 73M 16M sleep 49 0 0:01:10 0.1% wnck-applet/1 6634 drkirkby 71M 13M sleep 49 0 0:01:04 0.0% gnome-netstatus/1 6609 drkirkby 10M 5096K sleep 59 0 0:00:10 0.0% gnome-smproxy/1 6636 drkirkby 71M 12M sleep 49 0 0:00:20 0.0% mixer_applet2/1 6615 drkirkby 75M 17M sleep 59 0 0:00:14 0.0% gnome-panel/1 190 root 2352K 1160K sleep 100 - 0:03:28 0.0% xntpd/1 13457 drkirkby 3064K 2512K sleep 57 0 0:00:00 0.0% tcsh/1 8607 drkirkby 15M 12M sleep 49 0 0:00:06 0.0% Mathematica/2 Just to check, we again see fd=23 in use. sparrow /export/home/drkirkby % truss -v pollsys -p 8637 /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 /3: fd=23 ev=POLLRDNORM rev=0 /3: timeout: 0.000001000 sec /1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 /3: fd=23 ev=POLLRDNORM rev=0 /3: timeout: 0.000001000 sec /3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 /3: fd=23 ev=POLLRDNORM rev=0 /3: timeout: 0.000001000 sec Looking with lsof (is this the sort of thing you were thinking of?) sparrow /export/home/drkirkby % lsof -d 23 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME inetd 433 root 23u VCHR 105,110 55050244 /devices/pseudo/tl@0:ticots->timod->tl fmd 570 root 23r VREG 32,30 13829 23832 /usr/lib/fm/dict/SUN4U.dict smbd 776 root 23u FIFO 0x60002699e60 0t0 406 (fifofs) PIPE->0x60002699ef0 Xsun 6453 drkirkby 23u unix 105,8584 0t2438596 55050244 /devices/pseudo/tl@0:ticots->/tmp/.X11-unix/X0 (0x60002db6678) gnome-ses 6576 drkirkby 23u unix 105,8581 0t11380 55050244 /devices/pseudo/tl@0:ticots->/tmp/.X11-unix/X0 (0x600032b3840->0x60009b8b9c0) gconfd-2 6585 drkirkby 23u unix 105,8606 0t49294 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19b9-0-44276f3ccec37 (0x60002db64e0) bonobo-ac 6591 drkirkby 23u unix 105,8552 0t0 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19b0-0-44270d1feae4c (0x60002db7998->0x600066e2ac0) gnome-set 6594 drkirkby 23u FIFO 0x60003126aa0 0t0 54056 (fifofs) PIPE->0x60003126b30 gnome-pan 6615 drkirkby 23u unix 105,8697 0t27416 55050244 /devices/pseudo/tl@0:ticots->/var/tmp/orbit-drkirkby/linc-19e6-0-44277c1c4e44 (0x6000314f368->0x60008c16f40) nautilus 6617 drkirkby 23u FIFO 0x60003127c20 0t0 54104 (fifofs) PIPE->0x60003127cb0 gnome-vfs 6622 drkirkby 23u FIFO 0x60003127a30 0t0 54106 (fifofs) PIPE->0x600031279a0 clock-app 6628 drkirkby 23u FIFO 0x60003127ae0 0t0 54117 (fifofs) PIPE->0x60003127b70 wnck-appl 6630 drkirkby 23u FIFO 0x60003da47a0 0t0 54122 (fifofs) PIPE->0x60003da4830 notificat 6632 drkirkby 23u FIFO 0x60003126320 0t0 54127 (fifofs) PIPE->0x600031263b0 gnome-net 6634 drkirkby 23u FIFO 0x60002f5b200 0t0 54132 (fifofs) PIPE->0x60002f5b290 mixer_app 6636 drkirkby 23u FIFO 0x60003da52e0 0t0 54137 (fifofs) PIPE->0x60003da5370 mozilla-b 6705 drkirkby 23wW VREG 32,7 0 460098 /export/home (/dev/dsk/c0t1d0s7) gnome-ter 6758 drkirkby 23u FIFO 0x600026984b0 0t0 54281 (fifofs) PIPE->0x60002698420 acroread 8305 drkirkby 23r VREG 32,30 644551 7092 /./usr/bin/../lib/AdobeReader/Reader/sparcsolaris/plug_ins/SendMail.api Mathemati 8607 drkirkby 23u FIFO 0x600026991e0 0t2 55844 (fifofs) PIPE->0x60002699270 MathKerne 8637 drkirkby 23u FIFO 0x60003da5060 0t2 55843 (fifofs) PIPE->0x60003da50f0 I assume there are multiple programs all with fd=23 in use. Most of them are irrelevant I guess. The last two lines of lsof output refer to processes Mathematica and MathKernal and what seems to be a pipe. This makes sense for the reasons I explained about the mathlink protocol connecting them. I'm a bit surprised there is nothing in the lsof output to indicate that these are connected together. The DEVICE and NAME of the last two lines does not agree. Should they? Perhaps there have lost communication, although that seems unlikely, as the front end can still get calculations done. This is making me think. Could the problem be that the GUI and front end have a timeout which is too short on the Ultra 80, but is sufficiently long on the faster Blade 1000? That could explain the fact it works on the faster machine, but not on the slower one. -- Dave K MCSE. MCSE = Minefield Consultant and Solitaire Expert. Please note my email address changes periodically to avoid spam. It is always of the form: month-year(a)domain. Hitting reply will work for a couple of months only. Later set it manually.
From: Dave (from the UK) on 27 Mar 2006 07:33 Casper H.S. Dik wrote: >>solaris10 % truss -v pollsys -p 6205 >>/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0 >>/3: fd=23 ev=POLLRDNORM rev=0 >>/3: timeout: 0.000001000 sec > > > Hm, this timeout value looks suspicious. > They might be calling select() with the smallest possible timeout (1us) > which would map to 1ms poll in S9 but 1us pollsys in S10. Ah, that makes some sense. > Inspecting the code, it seems that pollsys() uses a cv_timedwait() which > uses the real time to compute when it needs to wakeup. Cheers. > Because the *slow* systems can't manage poll setup in under 1us, > they will hit cv_timedwait when we're past the 1us mark and the > function will return immediately. Fast systems, OTOH, will be able > to make the 1us deadline and cv_timedwait will sleep until the next > clocktick. > > If this hypothesis is correct, then perhaps the following preload > will work (preload w/ LD_PRELOAD=...); different options for 64 bit > executable. > > /* > * Select roundup preload. (casper.dik(a)you.know.where) > * cc -G -Kpic select_preload.c -o select_preload.so > * > */ <snip> Thanks for that. I have compiled it 64 bit, as the Mathematica kernel is 64-bit. % cc -xtarget=ultra -xarch=v9 -G -Kpic select_preload.c -o select_preload.so su # mv select_preload.so /usr/local/lib/ The 64-bit kernal gets started by the front end. The front end calls a script which starts the kernel. The script ends in the line: exec "${MathKernel}" "$@" I'm not sure how I should alter that line, but I tried: LD_PRELOAD=/usr/local/lib/select_preload.so exec "${MathKernel}" "$@" That did not effect the CPU usage. I guess there might be some point in starting the kernal with truss and logging the data? -- Dave K MCSE. MCSE = Minefield Consultant and Solitaire Expert. Please note my email address changes periodically to avoid spam. It is always of the form: month-year(a)domain. Hitting reply will work for a couple of months only. Later set it manually.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: Can't roll the log for .... Next: usb stick volume label |