From: Robert Klemme on
On 02.06.2010 06:57, Mike Schilling wrote:
>
>
> "Arne Vajhøj" <arne(a)vajhoej.dk> wrote in message
> news:4c059872$0$272$14726298(a)news.sunsite.dk...
>> On 01-06-2010 00:21, Kevin McMurtrie wrote:
>>> I've been assisting in load testing some new high performance servers
>>> running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
>>> suspending threads for time-slicing in very unfortunate locations.
>>
>> That should not come as a surprise.
>>
>> The thread scheduler does not examine the code for convenience.
>>
>> Correct code must work no matter when the in and out of
>> CPU happens.
>>
>> High performance code must work efficiently no matter when the
>> in and out of CPU happens.
>>
>> > For
>>> example, a thread might suspend in Hashtable.get(Object) after a call to
>>> getProperty(String) on the system properties. It's a synchronized
>>> global so a few hundred threads might pile up until the lock holder
>>> resumes. Odds are that those hundreds of threads won't finish before
>>> another one stops to time slice again. The performance hit has a ton of
>>> hysteresis so the server doesn't recover until it has a lower load than
>>> before the backlog started.
>>>
>>> The brute force fix is of course to eliminate calls to shared
>>> synchronized objects. All of the easy stuff has been done. Some
>>> operations aren't well suited to simple CAS. Bottlenecks that are part
>>> of well established Java APIs are time consuming to fix/avoid.
>>
>> High performance code need to be designed not to synchronize
>> extensively.
>>
>> If the code does and there is a performance problem, then fix
>> the code.
>>
>> There are no miracles.
>
> Though giving a thread higher priority while it holds a shared lock
> isn't exactly rocket science; VMS did it back in the early 80s. JVMs
> could do a really nice job of this, noticing which monitors cause
> contention and how long they tend to be held. A shame they don't.

I can imagine that changing a thread's priority frequently is causing
severe overhead because the OS scheduler has to adjust all the time.
Thread and process priorities are usually set once to indicate overall
processing priority - not to speed up certain operations. Also,
changing the priority does not guarantee anything - there could be other
threads with higher priority around.

I don't think it's a viable approach - especially if applied to fix
broken code (or even design).

Kind regards

robert


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Kevin McMurtrie on
In article <4c048acd$0$22090$742ec2ed(a)news.sonic.net>,
Kevin McMurtrie <mcmurtrie(a)pixelmemory.us> wrote:

> I've been assisting in load testing some new high performance servers
> running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
> suspending threads for time-slicing in very unfortunate locations. For
> example, a thread might suspend in Hashtable.get(Object) after a call to
> getProperty(String) on the system properties. It's a synchronized
> global so a few hundred threads might pile up until the lock holder
> resumes. Odds are that those hundreds of threads won't finish before
> another one stops to time slice again. The performance hit has a ton of
> hysteresis so the server doesn't recover until it has a lower load than
> before the backlog started.
>
> The brute force fix is of course to eliminate calls to shared
> synchronized objects. All of the easy stuff has been done. Some
> operations aren't well suited to simple CAS. Bottlenecks that are part
> of well established Java APIs are time consuming to fix/avoid.
>
> Is there JVM or Linux tuning that will change the behavior of thread
> time slicing or preemption? I checked the JDK 6 options page but didn't
> find anything that appears to be applicable.

To clarify a bit, this isn't hammering a shared resource. I'm talking
about 100 to 800 synchronizations on a shared object per second for a
duration of 10 to 1000 nanoseconds. Yes, nanoseconds. That shouldn't
cause a complete collapse of concurrency.

My older 4 core Mac Xenon can have 64 threads call getProperty(String)
on a shared Property instance 2 million times each in only 21 real
seconds. That's one call every 164 ns. It's not as good as
ConcurrentHashMap (one per 0.30 ns) but it's no collapse.

Many of the basic Sun Java classes are synchronized. Eliminating all
shared synchronized objects without making a mess of 3rd party library
integration is no easy task.

Next up is looking at the Linux scheduler version and the HotSpot
spinlock timeout. Maybe the two don't mesh and a thread is very likely
to enter a semaphore right as its quanta runs out.
--
I won't see Google Groups replies because I must filter them as spam
From: Mike Schilling on


"Robert Klemme" <shortcutter(a)googlemail.com> wrote in message
news:86m8vjF4ulU2(a)mid.individual.net...
> On 02.06.2010 06:57, Mike Schilling wrote:
>>
>>
>> "Arne Vajh�j" <arne(a)vajhoej.dk> wrote in message
>> news:4c059872$0$272$14726298(a)news.sunsite.dk...
>>> On 01-06-2010 00:21, Kevin McMurtrie wrote:
>>>> I've been assisting in load testing some new high performance servers
>>>> running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
>>>> suspending threads for time-slicing in very unfortunate locations.
>>>
>>> That should not come as a surprise.
>>>
>>> The thread scheduler does not examine the code for convenience.
>>>
>>> Correct code must work no matter when the in and out of
>>> CPU happens.
>>>
>>> High performance code must work efficiently no matter when the
>>> in and out of CPU happens.
>>>
>>> > For
>>>> example, a thread might suspend in Hashtable.get(Object) after a call
>>>> to
>>>> getProperty(String) on the system properties. It's a synchronized
>>>> global so a few hundred threads might pile up until the lock holder
>>>> resumes. Odds are that those hundreds of threads won't finish before
>>>> another one stops to time slice again. The performance hit has a ton of
>>>> hysteresis so the server doesn't recover until it has a lower load than
>>>> before the backlog started.
>>>>
>>>> The brute force fix is of course to eliminate calls to shared
>>>> synchronized objects. All of the easy stuff has been done. Some
>>>> operations aren't well suited to simple CAS. Bottlenecks that are part
>>>> of well established Java APIs are time consuming to fix/avoid.
>>>
>>> High performance code need to be designed not to synchronize
>>> extensively.
>>>
>>> If the code does and there is a performance problem, then fix
>>> the code.
>>>
>>> There are no miracles.
>>
>> Though giving a thread higher priority while it holds a shared lock
>> isn't exactly rocket science; VMS did it back in the early 80s. JVMs
>> could do a really nice job of this, noticing which monitors cause
>> contention and how long they tend to be held. A shame they don't.
>
> I can imagine that changing a thread's priority frequently is causing
> severe overhead because the OS scheduler has to adjust all the time.
> Thread and process priorities are usually set once to indicate overall
> processing priority - not to speed up certain operations.

Not at all. In time-sharing systems, it's a common scheduling algorithm to
adjust the effective priority of a process dynamically, e.g. processes that
require user input get a boost above compute-bound ones, to help keep
response times low. As I said, I'm not inventing this: it was state of the
art about 30 years ago.


From: Robert Klemme on
On 02.06.2010 08:02, Mike Schilling wrote:
>
>
> "Robert Klemme" <shortcutter(a)googlemail.com> wrote in message
> news:86m8vjF4ulU2(a)mid.individual.net...
>> On 02.06.2010 06:57, Mike Schilling wrote:
>>>
>>>
>>> "Arne Vajhøj" <arne(a)vajhoej.dk> wrote in message
>>> news:4c059872$0$272$14726298(a)news.sunsite.dk...
>>>> On 01-06-2010 00:21, Kevin McMurtrie wrote:
>>>>> I've been assisting in load testing some new high performance servers
>>>>> running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or
>>>>> Linux is
>>>>> suspending threads for time-slicing in very unfortunate locations.
>>>>
>>>> That should not come as a surprise.
>>>>
>>>> The thread scheduler does not examine the code for convenience.
>>>>
>>>> Correct code must work no matter when the in and out of
>>>> CPU happens.
>>>>
>>>> High performance code must work efficiently no matter when the
>>>> in and out of CPU happens.
>>>>
>>>> > For
>>>>> example, a thread might suspend in Hashtable.get(Object) after a
>>>>> call to
>>>>> getProperty(String) on the system properties. It's a synchronized
>>>>> global so a few hundred threads might pile up until the lock holder
>>>>> resumes. Odds are that those hundreds of threads won't finish before
>>>>> another one stops to time slice again. The performance hit has a
>>>>> ton of
>>>>> hysteresis so the server doesn't recover until it has a lower load
>>>>> than
>>>>> before the backlog started.
>>>>>
>>>>> The brute force fix is of course to eliminate calls to shared
>>>>> synchronized objects. All of the easy stuff has been done. Some
>>>>> operations aren't well suited to simple CAS. Bottlenecks that are part
>>>>> of well established Java APIs are time consuming to fix/avoid.
>>>>
>>>> High performance code need to be designed not to synchronize
>>>> extensively.
>>>>
>>>> If the code does and there is a performance problem, then fix
>>>> the code.
>>>>
>>>> There are no miracles.
>>>
>>> Though giving a thread higher priority while it holds a shared lock
>>> isn't exactly rocket science; VMS did it back in the early 80s. JVMs
>>> could do a really nice job of this, noticing which monitors cause
>>> contention and how long they tend to be held. A shame they don't.
>>
>> I can imagine that changing a thread's priority frequently is causing
>> severe overhead because the OS scheduler has to adjust all the time.
>> Thread and process priorities are usually set once to indicate overall
>> processing priority - not to speed up certain operations.
>
> Not at all. In time-sharing systems, it's a common scheduling algorithm
> to adjust the effective priority of a process dynamically, e.g.
> processes that require user input get a boost above compute-bound ones,
> to help keep response times low. As I said, I'm not inventing this: it
> was state of the art about 30 years ago.

That's true but in these cases it's the OS that does it - not the JVM.
From the OS point of view the JVM is just another process and I doubt
there is an interface for the adjustment of the automatic priority
(which in a way would defy "automatic"). The base priority on the other
hand is to indicate the general priority of a thread / process and I
still don't think it's a good idea to change it all the time.

So, either the OS does honor thread state (mutext, IO etc.) and adjusts
prio accordingly or it doesn't. But I don't think it's the job of the JVM.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Robert Klemme on
On 02.06.2010 07:45, Kevin McMurtrie wrote:
> In article<4c048acd$0$22090$742ec2ed(a)news.sonic.net>,
> Kevin McMurtrie<mcmurtrie(a)pixelmemory.us> wrote:
>
>> I've been assisting in load testing some new high performance servers
>> running Tomcat 6 and Java 1.6.0_20. It appears that the JVM or Linux is
>> suspending threads for time-slicing in very unfortunate locations. For
>> example, a thread might suspend in Hashtable.get(Object) after a call to
>> getProperty(String) on the system properties. It's a synchronized
>> global so a few hundred threads might pile up until the lock holder
>> resumes. Odds are that those hundreds of threads won't finish before
>> another one stops to time slice again. The performance hit has a ton of
>> hysteresis so the server doesn't recover until it has a lower load than
>> before the backlog started.
>>
>> The brute force fix is of course to eliminate calls to shared
>> synchronized objects. All of the easy stuff has been done. Some
>> operations aren't well suited to simple CAS. Bottlenecks that are part
>> of well established Java APIs are time consuming to fix/avoid.
>>
>> Is there JVM or Linux tuning that will change the behavior of thread
>> time slicing or preemption? I checked the JDK 6 options page but didn't
>> find anything that appears to be applicable.
>
> To clarify a bit, this isn't hammering a shared resource. I'm talking
> about 100 to 800 synchronizations on a shared object per second for a
> duration of 10 to 1000 nanoseconds. Yes, nanoseconds. That shouldn't
> cause a complete collapse of concurrency.

It's the nature of locking issues. Up to a particular point it works
pretty well and then locking delays explode because of the positive
feedback.

If you have "a few hundred threads" accessing a single shared lock with
a frequency of 800Hz then you have a design issue - whether you call it
"hammering" or not. It's simply not scalable and if it doesn't break
now it likely breaks with the next step of load increasing.

> My older 4 core Mac Xenon can have 64 threads call getProperty(String)
> on a shared Property instance 2 million times each in only 21 real
> seconds. That's one call every 164 ns. It's not as good as
> ConcurrentHashMap (one per 0.30 ns) but it's no collapse.

Well, then stick with the old CPU. :-) It's not uncommon that moving to
newer hardware with increased processing resources uncovers issues like
this.

> Many of the basic Sun Java classes are synchronized. Eliminating all
> shared synchronized objects without making a mess of 3rd party library
> integration is no easy task.

It would certainly help the discussion if you pointed out which exact
classes and methods you are referring to. I would readily agree that
Sun did a few things wrong initially in the std lib (Vector) which they
partly fixed later. But I am not inclined to believe in a massive (i.e.
affecting many areas) concurrency problem in the std lib.

If they synchronize they do it for good reasons - and you simply need to
limit the number of threads that try to access a resource. A globally
synchronized, frequently accessed resource in a system with several
hundred threads is a design problem - but not necessarily in the
implementation of the resource used but rather in the usage.

> Next up is looking at the Linux scheduler version and the HotSpot
> spinlock timeout. Maybe the two don't mesh and a thread is very likely
> to enter a semaphore right as its quanta runs out.

Btw, as far as I can see you didn't yet disclose how you found out about
the point where the thread is suspended. I'm still curios to learn how
you found out. Might be a valuable addition to my toolbox.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/