Processors stall on OLTP workloads about half the time--almost no matter what you do [Computer Architecture]

Prev: Looking for Sponsorship
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do

From: Rick Jones on 23 Apr 2010 16:08

I like interrupt coalescing - when it is done well anyway - when it is
done badly, it does rather nasty things to latency. And I've seen a
small, but measurable number of folks who want very, very low latency
on their "online transactions." This write-up is probably old enough
to be off on the syntax/constants (as it were) but the semantics
should remain the same:

ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

rick jones
--
No need to believe in either side, or any side. There is no cause.
There's only yourself. The belief is in your own precision. - Joubert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: Robert Myers on 22 Apr 2010 13:01

On Apr 22, 12:54 pm, n...(a)cam.ac.uk wrote:

>
> 2) The reason that throughput isn't pro-rata to the number of
> threads is that you also need cache size and bandwidth and memory
> bandwidth (and number of outstanding requests) pro rata to the
> number of threads. Now, back in the real world ....

I had thought the idea of having lots of threads was precisely to get
the memory requests out. You start a thread, get some memory requests
out, and let it stall, because it's going to stall, anyway.

Cache size and bandwidth and memory bandwidth are another matter.

Robert.

From: nmm1 on 22 Apr 2010 13:13

In article <47ad6c50-de94-4a20-8316-bc10d6dce54d(a)g23g2000yqn.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>>
>> =A0 =A0 2) The reason that throughput isn't pro-rata to the number of
>> threads is that you also need cache size and bandwidth and memory
>> bandwidth (and number of outstanding requests) pro rata to the
>> number of threads. =A0Now, back in the real world ....
>
>I had thought the idea of having lots of threads was precisely to get
>the memory requests out. You start a thread, get some memory requests
>out, and let it stall, because it's going to stall, anyway.

Yeah. I was taken in, too, but couldn't follow their explanation,
so I worked through the mathematics. I was pretty disgusted at
what I found - that claim is marketing, pure and simple.

Regards,
Nick Maclaren.

From: Robert Myers on 23 Apr 2010 13:11

On Apr 23, 3:51 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:

> My lament is that folks like Keeton and Patterson, as well as folks like Itanicides, etc., do these comparisons, BUT
> ONLY TO THE ARCHITECTURES THAT THEY CHOOSE. In some ways it almost looks as if they choose deliberate strawmen that
> will make their preferences look good. More likely, they are just not aware of work in more advanced microarchitecture.
>
> Now, it is fair to say that simple cores beat first generation OOO cores for simple OLTP workloads. But what do the
> numbers look like for advanced SpMT (Gen2 OOO) microarchitectures, on complex transaction workloads? You'll never know
> unless you try.
>
> So, it's happened again. Just as OOO CPU research was the poor cousin during the early days of the RISC revolution, so
> SpMT is the poor cousin in the early days of MultiCore and ManyCore. If M&M run out of steam, I hope that Haitham
> Akkary's SpMT research will be there, still ongoing, to pick up the pieces.

So here's Intel management, more worried about Wall Street than they
are about you or anyone else with deep technical insight.

Only one architecture makes sense from a business point of view for
them. Everything else is an expensive hobby.

Wintel got us cheap computers for everyone, but now we're stuck.

Whether they liked what Patterson was telling them or whether they
were capable of examining the conclusions critically, the provocative
message header with which I started this thread is something they can
understand--even if it is nothing more than a grotesque
oversimplification.

I don't know that any technical discipline works any differently.
Create a startup. Get bought. Count your money. Don't worry about
whether any of it really makes sense.

Robert.

From: MitchAlsup on 22 Apr 2010 23:05

On Apr 22, 2:33 pm, n...(a)cam.ac.uk wrote:
> That is certainly true, but we should compare a dual-threaded system
> with a dual-core one that shares at least level 2 cache. No gain there..

Going back to the numbers Nick quoted from above::

The 2 threaded CPU got 1.33X throughput. However, one could build two
0.67 throughput CPUs for a lot less die area than the one big 2
threaded
CPU without having to share ANY resources (including cache or TLB
resources).

Using really rough hand waving back or the envelope figures:

One big monothreaded CPU core will have an area of (say) 10 units and
deliver 1.0 in some comercial performance metric.
One big dual threaded CPU core will have an area of 11 units and
deliver
1.33 in the performance metric.
One little completely in order CPU will have an area of 1 unit and
deliver
about 0.4-0.5 in the performance metric.
One medium CPU with some bells and wistles will have an area of 3
units
and deliver 0.67 in the performance metric.

Both little and medium cores occupy less area (power and cost) and
burn less power (active and leakage) and deliver the same performance
as somthing twice as big (11 is approx 2*6).

It is entirely possible that the medium CPU enhanced to 10 units of
size
by throw cache (/TLB/buffers) at the missing die area would
significantly
outperform the great big OoO CPUs, threaded or not, in comercial apps.

Mitch

| Next | Last
Pages: 1 2 3 4 5
Prev: Looking for Sponsorship
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do