Prev: zfs snapshort says "dataset is busy". Any better solution than remounting file system ?
Next: Sun v40z + 64bit Windows
From: Andrew Gabriel on 6 Feb 2010 20:03 In article <hkhl3l$503$1(a)kil-nws-1.ucis.dal.ca>, hume.spamfilter(a)bofh.ca writes: > Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote: >>> In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the >>> x86 runs at twice the clock speed; but it delivers ten times the performance >>> (both machines unloaded). >> >> You're only using somewhere between 1% and 6% of the T5140. > > I realize this. I KNOW the 5140 will blow the Xeon out of the water under > massive load. But... for that one, single-threaded process... running at > half the clock rate you'd expect the process to take twice as long, while > the other 127 vcpus twiddled their thumbs because they couldn't help out. > > The question I'm being asked by the developers is: if the Sun runs at half > the clock rate, 40 ms becomes 80 ms, being generous and round it up to 100 ms. Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64, these have long pipelines and process several instructions in parallel. This enables them to look ahead and predict what memory accesses they'll need and fire off the requests in advance so they don't waste as much time later with a pipeline stall. The logic supporting this pipeline is much bigger than the logic performing the conventional CPU functions. The T series processors don't have this. Instead, they are designed to handle pipeline stalls simply by doing a very fast context switch to another thread, and leaving the stalled thread to do its memory access whilst another thread is running. This works very well when you have lots of runnable threads - you find the stalled time when a T series core can't do anything is typically much less than that of a long pipeline core, which is why its performance flies, and it doesn't have all that extra heat-generating pipeline logic. However, if you only have one thread, that's going to get loads more pipeline stalls than it would on a long pipeline processor, so even at the same clock speed, it will be significantly slower. > Where is the other 290 ms going? Is it being lost to context switching? Is There's no context switching when you have only one thread. It's lost in pipeline stalls because the logic to avoid them isn't there. > the nature of the way PHP does substring calls hostile to the cache? (I've > run into that problem before, though not with PHP...) Something else? There's something else which might add to this. If the flow of logic through the compiled PHP binary keeps calling and returning through lots of deeply stacked functions, it will be generating lots of spill/fill register window traps. Sparc is very fast at function calls because of the way it keeps multiple register sets in the CPU, but when you exceed the CPU's capability to store them, it has to spill them out to memory, and conversely fill them back up again as you return through the large number of stack frames. > I managed to squeeze another 14% performance out of PHP by recompiling PHP > with SS12u1 and enabling the -fast CFLAGS. If you aren't already, see if -xO4 makes any difference; this should perform function inlining and tail-call optimisation, both of which will reduce number of register window sets used, if this is part of the problem. (A longer read through the cc options might reveal some other appropriate ones here - not something I know off the top of my head.) -- Andrew Gabriel [email address is not usable -- followup in the newsgroup]
From: Andrew Gabriel on 6 Feb 2010 20:07 In article <a8b10051-7348-4558-a4db-1d042242da67(a)a1g2000vbl.googlegroups.com>, ChrisS <chris.scarff(a)gmail.com> writes: > Not to start a fight between admins and developers, but after admins > have thrown more horse-power at a web application it's time to get the > developers to earnestly re-look at their own code. I've had our web > developers do that after I've exhausted server-side solutions. The > developers, more times than not, find a better way of writing their > code, and speeding up their apps 2 or 3-fold. In a few instances it > was simply changing the logical order of processing their code. I > love when they admit defeat. :-) Having a truly open dialog between > admin & devs is priceless. Something I've done in this circumstance many times is to run analyzer(1) on the app, and then hand the histograms back to the developers. It usually results in comments like "but we shouldn't even be going in to this code", whilst pointing at something which is using 90% of the CPU, such as some debugging functions... -- Andrew Gabriel [email address is not usable -- followup in the newsgroup]
From: hume.spamfilter on 7 Feb 2010 06:44
Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote: > Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64, > these have long pipelines and process several instructions in parallel. This is exactly the kind of explanation I was looking for (and educational to myself to boot). Thanks for taking the time to write it out. > If you aren't already, see if -xO4 makes any difference; this should > perform function inlining and tail-call optimisation, both of which -fast is a macro that turns on -xO5... so that's taken care of. The next step is using -xprofile to turn on profiling collect/use, but that increases compile time by orders of magnitude and I'm not experienced enough in how to use it properly. There's a guide on wiki.sun.com, even specialized for profiling PHP, but the information there seems incomplete. -- Brandon Hume - hume -> BOFH.Ca, http://WWW.BOFH.Ca/ |