From: Casper H.S. Dik on
"Dave (from the UK)" <see-my-signature(a)southminster-branch-line.org.uk> writes:

>Casper H.S. Dik wrote:
>> "dave (from the uk)" <sorry_no_known_address(a)hotmail.com> writes:
>>
>>
>>>I thought some my like to know I have raised this with Wolfram Research
>>>who are investiating it. I've not received confirmation that they have
>>>verified it. I'd hope they contact Sun about
>>>this, as some recent Solaris changes (eithe Solaris 10 and/or patches)
>>>have caused this to appear, although that does not mean it is a Solaris
>>>bug - it might well be a Mathematica bug.
>>
>>
>> You don't have this problem on older versions of S10?
>>
>> Have you tried this with Solaris 10 (unpatched) vs Solaris 10 (patched).
>>
>> Casper

>Casper,

>I have not tried this with an unpatched Solaris 10 system, and I don't
>know anyone that has. All cases I know of are on patched Solaris 10
>(3/05 or 1/06) SPARC systems.

>I had resisted the temptation to set up a system for testing this, but
>your post has inspired me (sad person I am!!), so I will do a fresh
>install and test it without any patches.

>But I noticed this bug before 19th Jan 06. Given I have no Sun contract,
>this probably means there were no patches on my system when I first
>found the problem. Or would there have been publically available patches
>for Solaris 10 update 1 on the 19th Jan 2006??

I'm interested about what happens on S10 03/05 (01/06 is a "patchy"
release; it has tons of patches applied in the factory)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
From: Casper H.S. Dik on
"Dave (from the UK)" <see-my-signature(a)southminster-branch-line.org.uk> writes:

>1) Ultra 60, fresh install of Solaris 9 (first release, whenever that
>was). No problems - Mathematica 5.2 works fine.

>2) Ultra 60, fresh install of Solaris 10 3/05, with no patches. Problem
>present.

Ok, so this is an issue introduced in S10.

>I don't personally have access to a more modern machine with Solaris 10
>on it. The machines either have Solaris 10 and are pretty old, or are
>new and have Solaris 9.

I have no idea how this could be a CPU specific issue.

>So it seems you need all these 3 have a problem.

>1) Mathematica 5.1 or 5.2. Both 4.0 and 4.2 are unaffected. I can't say
>for Mathematica 5, as I don't have it installed.

>2) Solaris 10 (either release, with or without patches)

>3) UltraSPARC II or IIe CPU. Given there is only one Solaris 10 machine
>that does not suffer the problem, it may be unwise to say only
>UltraSPARC II and IIe are affected.

>Any thoughts on how one might be able to get a workaround? Using truss
>on the CPU bound process (MathKernel), I found this a few weeks ago.

>http://groups.google.co.uk/group/comp.unix.solaris/browse_frm/thread/3f751c580fffde23/35233d67d579039f?lnk=raot&hl=en#35233d67d579039f

Looping in poll, seems like? prstat confirms that /3 is the looping
thread?

Casper
From: Dave (from the UK) on
Casper H.S. Dik wrote:

>>I don't personally have access to a more modern machine with Solaris 10
>>on it. The machines either have Solaris 10 and are pretty old, or are
>>new and have Solaris 9.

I thought it a bit odd.

Could Solaris be making use of instructions that are present in the
Blade 1000's processor for maximum speed, but emulating them when the
CPU does not have them?

Perhaps bits of Solaris are compiled with -xarch=v8plusb to generate
code for the UltrSPARC III processor?

>>Any thoughts on how one might be able to get a workaround? Using truss
>>on the CPU bound process (MathKernel), I found this a few weeks ago.
>
>
>>http://groups.google.co.uk/group/comp.unix.solaris/browse_frm/thread/3f751c580fffde23/35233d67d579039f?lnk=raot&hl=en#35233d67d579039f
>
>
> Looping in poll, seems like? prstat confirms that /3 is the looping
> thread?
>
> Casper

prstat does indeed indicate that. Here is is on my Ultra 80 after
computing 1+1=2. This has 4 CPUs, so the 25% basically means it's CPU
bound.

sparrow /export/home/drkirkby % prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1


--
Dave K MCSE.

MCSE = Minefield Consultant and Solitaire Expert.

Please note my email address changes periodically to avoid spam.
It is always of the form: month-year(a)domain. Hitting reply will work
for a couple of months only. Later set it manually.
From: Casper H.S. Dik on
"Dave (from the UK)" <see-my-signature(a)southminster-branch-line.org.uk> writes:

>Could Solaris be making use of instructions that are present in the
>Blade 1000's processor for maximum speed, but emulating them when the
>CPU does not have them?

No, it generally refuses to execute such binaries (with the exception
if the V8 instructions V7 hardware and VIS instructions on some
hardware.

>Perhaps bits of Solaris are compiled with -xarch=v8plusb to generate
>code for the UltrSPARC III processor?

No, except bits which are only meant to run on US-III CPUs.

>prstat does indeed indicate that. Here is is on my Ultra 80 after
>computing 1+1=2. This has 4 CPUs, so the 25% basically means it's CPU
>bound.

>sparrow /export/home/drkirkby % prstat
>PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
>2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
>2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1

What does this thread do when there is no issue?

And does "truss -v" show anything interesting about the arguments?

E.g., is the timeout different on the platforms.

Casper
From: Dave (from the UK) on
Casper H.S. Dik wrote:

>>sparrow /export/home/drkirkby % prstat
>>PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
>>3743 drkirkby 63M 18M cpu3 10 0 0:02:45 25% MathKernel/3
>>2418 drkirkby 169M 102M sleep 14 0 0:09:20 2.8% mozilla-bin/3
>>2276 drkirkby 72M 17M sleep 41 0 0:00:55 1.2% metacity/1
>
>
> What does this thread do when there is no issue?

Casper,

I'm a bit "out of my depth" here, but I'll try.

First, I don't have access to the Blade 1000, so I can't do any testing
on the Solaris 10 machine which works fine. I assume some truss data
from that would be useful. Perhaps Rainer Beushausen (bs(a)nwowhv.de) will
post some if he is able to. I've copied this to him, in case he is not
following the thread.

Comparing the truss output from the two machine I have access to:

1) 360 MHz Ultra 60 running Solaris 9 (Mathematica 5.2 OK)
2) 4 x 450 MHz Ultra 80 running Solaris 10. (Mathematica 5.2 has problems)

the difference between the two systems seems to be that on the Solaris 9
box, thread 3 is calling poll(), but on the Solaris 10 box,
thread 3 calls pollsys(). I guess it would be interesting to
know what happens to thread 3 on the Blade 1000 running Solaris 10.

On a Solaris 9 box, the CPU usage is fine, with MathKernel (pid=662)
using 0.0% of the CPU time.

solaris9 % prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
713 drkirkby 4256K 3952K cpu0 39 0 0:00:00 0.3% prstat/1
693 drkirkby 6952K 5080K sleep 59 0 0:00:00 0.1% dtterm/1
620 drkirkby 9816K 8056K sleep 49 0 0:00:00 0.1% dtfile/1
695 drkirkby 2768K 2176K sleep 49 0 0:00:00 0.1% tcsh/1
662 drkirkby 62M 18M sleep 59 0 0:00:01 0.0% MathKernel/3

Truss shows a lot of calls to poll().

solaris9 % truss -p 662

662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0


Using the -v option on truss.

solaris9 % truss -v lwp_park -v poll -p 662
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/1: timeout: 0.019998000 sec
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/1: lwp_park(0xFFFFFFFF7FFFE240, 0) Err#62 ETIME
662/1: timeout: 0.019998000 sec
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0
662/3: poll(0xFFFFFFFF7A175460, 1, 1) = 0
662/3: fd=23 ev=POLLRDNORM rev=0

I see a timeout of 20ms on lwp_park(), but nothing for poll() - I am
rather lost as to what is happening here.

Whereas on the Solaris 10 box the CPU usage of MathKernel is excessive
(25%, which is basically one CPU flat out)

solaris10 % prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
6205 drkirkby 63M 17M cpu2 40 0 0:01:48 25% MathKernel/3
6124 drkirkby 136M 68M sleep 59 0 0:00:58 2.5% mozilla-bin/3
5848 drkirkby 139M 22M sleep 59 0 0:00:23 0.7% Xsun/1
6050 drkirkby 233M 78M sleep 49 0 0:00:15 0.5% java/21
6151 drkirkby 84M 20M sleep 49 0 0:00:06 0.3% gnome-terminal/2

and truss shows repeated calls to pollsys() on thread 3.

solaris10 % truss -p 6205

/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/1: lwp_park(0xFFFFFFFF7FFFE020, 0) Err#62 ETIME
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/2: pollsys(0xFFFFFFFF79EF9CD0, 1, 0xFFFFFFFF79EF9E40, 0x00000000)
(sleeping...)
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0


> And does "truss -v" show anything interesting about the arguments?
>

I can't seem to see any way of getting a timeout on the Solaris 9 box,
but on the Solaris 10 there is a timeout of 1 us as you can see below,
which is clearly a lot shorter than the 20ms shown on the Solaris 9
machine. But these are not the same functions, so I am not sure.

solaris10 % truss -v pollsys -p 6205
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec
/3: pollsys(0xFFFFFFFF79A75450, 1, 0xFFFFFFFF79AF5510, 0x00000000) = 0
/3: fd=23 ev=POLLRDNORM rev=0
/3: timeout: 0.000001000 sec


--
Dave K MCSE.

MCSE = Minefield Consultant and Solitaire Expert.

Please note my email address changes periodically to avoid spam.
It is always of the form: month-year(a)domain. Hitting reply will work
for a couple of months only. Later set it manually.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5
Prev: Can't roll the log for ....
Next: usb stick volume label