From: Andrew Gabriel on
In article <hkhl3l$503$1(a)kil-nws-1.ucis.dal.ca>,
hume.spamfilter(a)bofh.ca writes:
> Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote:
>>> In contrast, a Xeon box running Linux (2.2GHz) averages 40 ms. Yes, the
>>> x86 runs at twice the clock speed; but it delivers ten times the performance
>>> (both machines unloaded).
>>
>> You're only using somewhere between 1% and 6% of the T5140.
>
> I realize this. I KNOW the 5140 will blow the Xeon out of the water under
> massive load. But... for that one, single-threaded process... running at
> half the clock rate you'd expect the process to take twice as long, while
> the other 127 vcpus twiddled their thumbs because they couldn't help out.
>
> The question I'm being asked by the developers is: if the Sun runs at half
> the clock rate, 40 ms becomes 80 ms, being generous and round it up to 100 ms.

Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64,
these have long pipelines and process several instructions in parallel.
This enables them to look ahead and predict what memory accesses they'll
need and fire off the requests in advance so they don't waste as much
time later with a pipeline stall. The logic supporting this pipeline is
much bigger than the logic performing the conventional CPU functions.

The T series processors don't have this. Instead, they are designed to
handle pipeline stalls simply by doing a very fast context switch to
another thread, and leaving the stalled thread to do its memory access
whilst another thread is running. This works very well when you have
lots of runnable threads - you find the stalled time when a T series
core can't do anything is typically much less than that of a long
pipeline core, which is why its performance flies, and it doesn't have
all that extra heat-generating pipeline logic. However, if you only
have one thread, that's going to get loads more pipeline stalls than
it would on a long pipeline processor, so even at the same clock speed,
it will be significantly slower.

> Where is the other 290 ms going? Is it being lost to context switching? Is

There's no context switching when you have only one thread. It's lost
in pipeline stalls because the logic to avoid them isn't there.

> the nature of the way PHP does substring calls hostile to the cache? (I've
> run into that problem before, though not with PHP...) Something else?

There's something else which might add to this. If the flow of logic
through the compiled PHP binary keeps calling and returning through lots
of deeply stacked functions, it will be generating lots of spill/fill
register window traps. Sparc is very fast at function calls because of the
way it keeps multiple register sets in the CPU, but when you exceed the
CPU's capability to store them, it has to spill them out to memory, and
conversely fill them back up again as you return through the large number
of stack frames.

> I managed to squeeze another 14% performance out of PHP by recompiling PHP
> with SS12u1 and enabling the -fast CFLAGS.

If you aren't already, see if -xO4 makes any difference; this should
perform function inlining and tail-call optimisation, both of which
will reduce number of register window sets used, if this is part of
the problem. (A longer read through the cc options might reveal some
other appropriate ones here - not something I know off the top of my
head.)

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
From: Andrew Gabriel on
In article <a8b10051-7348-4558-a4db-1d042242da67(a)a1g2000vbl.googlegroups.com>,
ChrisS <chris.scarff(a)gmail.com> writes:
> Not to start a fight between admins and developers, but after admins
> have thrown more horse-power at a web application it's time to get the
> developers to earnestly re-look at their own code. I've had our web
> developers do that after I've exhausted server-side solutions. The
> developers, more times than not, find a better way of writing their
> code, and speeding up their apps 2 or 3-fold. In a few instances it
> was simply changing the logical order of processing their code. I
> love when they admit defeat. :-) Having a truly open dialog between
> admin & devs is priceless.

Something I've done in this circumstance many times is to run
analyzer(1) on the app, and then hand the histograms back to the
developers. It usually results in comments like "but we shouldn't
even be going in to this code", whilst pointing at something which
is using 90% of the CPU, such as some debugging functions...

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
From: hume.spamfilter on
Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote:
> Not as simple as that. If you look at a Xeon, or Ultrasparc, or Sparc64,
> these have long pipelines and process several instructions in parallel.

This is exactly the kind of explanation I was looking for (and educational
to myself to boot). Thanks for taking the time to write it out.

> If you aren't already, see if -xO4 makes any difference; this should
> perform function inlining and tail-call optimisation, both of which

-fast is a macro that turns on -xO5... so that's taken care of. The next
step is using -xprofile to turn on profiling collect/use, but that increases
compile time by orders of magnitude and I'm not experienced enough in how to
use it properly. There's a guide on wiki.sun.com, even specialized for
profiling PHP, but the information there seems incomplete.

--
Brandon Hume - hume -> BOFH.Ca, http://WWW.BOFH.Ca/