Processors stall on OLTP workloads about half the time--almostno matter what you do [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do

From: Stephen Fuld on 22 Apr 2010 11:40

On 4/22/2010 3:04 AM, Morten Reistad wrote:
> In article<86666a83-4bed-472c-aacd-9fc6ef47e9e6(a)k33g2000yqc.googlegroups.com>,
> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>> On Apr 21, 11:02 am, Robert Myers<rbmyers...(a)gmail.com> wrote:
>>> Even though the paper distinguishes between technical and commercial
>>> workloads, and draws its negative conclusion only for commercial
>>> workloads, it was interesting to me that, for instance, Blue Gene went
>>> the same direction--many simple processors--for a technical workload so
>>> as to achieve low power operation.
>>
>> Reading between the lines, Comercial and DB workloads are better
>> served by slower processors accessing a thinner cache/memory hierarchy
>> than by faster processors accessing a thicker cache/memory hierarchy.
>> That is: a comercial machine is better served with larger first level
>> cache backed up by large second cache running at slower frequencies,
>> while a technical machine would be better served with smaller first
>> level caches, medium second elvel cache and a large third level cache
>> running at higher frequencies.
>
> I can confirm this from benchmarks for real-life workloads for
> pretty static web servers, media servers and SIP telephony systems.
> The cache size means everything in this context.

Isn't this the kind of workload (relatively small instruction footprint,
lots of cache misses, and lots of independent threads)
that could benefit from a multi-threaded CPU?

> Actually, the Hyperchannel cache interconnects work very nicely to
> make all the on-chip caches work as a system-wide cache, not as
> die-wide. I may even suggest L4 cache attachment to static on-chip
> hyperchannel memory; like a Xeon with no CPUs.

Another advantage of this is that this "extra" chip provides more pins
thus allowing higher system memory bandwidth for those applications that
need it.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

From: nmm1 on 22 Apr 2010 12:54

In article <hqpqks$q26$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>On 4/22/2010 3:04 AM, Morten Reistad wrote:
>> In article<86666a83-4bed-472c-aacd-9fc6ef47e9e6(a)k33g2000yqc.googlegroups.com>,
>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>> On Apr 21, 11:02 am, Robert Myers<rbmyers...(a)gmail.com> wrote:
>>>> Even though the paper distinguishes between technical and commercial
>>>> workloads, and draws its negative conclusion only for commercial
>>>> workloads, it was interesting to me that, for instance, Blue Gene went
>>>> the same direction--many simple processors--for a technical workload so
>>>> as to achieve low power operation.
>>>
>>> Reading between the lines, Comercial and DB workloads are better
>>> served by slower processors accessing a thinner cache/memory hierarchy
>>> than by faster processors accessing a thicker cache/memory hierarchy.
>>> That is: a comercial machine is better served with larger first level
>>> cache backed up by large second cache running at slower frequencies,
>>> while a technical machine would be better served with smaller first
>>> level caches, medium second elvel cache and a large third level cache
>>> running at higher frequencies.
>>
>> I can confirm this from benchmarks for real-life workloads for
>> pretty static web servers, media servers and SIP telephony systems.
>> The cache size means everything in this context.
>
>Isn't this the kind of workload (relatively small instruction footprint,
>lots of cache misses, and lots of independent threads)
>that could benefit from a multi-threaded CPU?

Not really. That's a common mistake. There are two reasons:

1) Running lots of threads usually slows each thread down by
causing conflicts in some of the shared components. Some designs
do better and some worse, of course.

2) The reason that throughput isn't pro-rata to the number of
threads is that you also need cache size and bandwidth and memory
bandwidth (and number of outstanding requests) pro rata to the
number of threads. Now, back in the real world ....

Regards,
Nick Maclaren.

From: Stephen Fuld on 22 Apr 2010 13:10

On 4/22/2010 9:54 AM, nmm1(a)cam.ac.uk wrote:
> In article<hqpqks$q26$1(a)news.eternal-september.org>,
> Stephen Fuld<SFuld(a)Alumni.cmu.edu.invalid> wrote:
>> On 4/22/2010 3:04 AM, Morten Reistad wrote:

snip

>>> I can confirm this from benchmarks for real-life workloads for
>>> pretty static web servers, media servers and SIP telephony systems.
>>> The cache size means everything in this context.
>>
>> Isn't this the kind of workload (relatively small instruction footprint,
>> lots of cache misses, and lots of independent threads)
>> that could benefit from a multi-threaded CPU?
>
> Not really. That's a common mistake. There are two reasons:
>
> 1) Running lots of threads usually slows each thread down by
> causing conflicts in some of the shared components. Some designs
> do better and some worse, of course.

While I certainly agree that each thread will run slower that it would
without multi-threading, that isn't the relevant question for this type
of problem. Servers are throughput oriented, so the question is whether
you get more transactions per second. Thus even if each is slower, it
may be (of course depending upon workload and implementation) that the
sum of the throughputs is greater. I probably didn't state that well,
but I think you understand what I mean.

> 2) The reason that throughput isn't pro-rata to the number of
> threads is that you also need cache size and bandwidth and memory
> bandwidth (and number of outstanding requests) pro rata to the
> number of threads.

I didn't say that the throughput was pro-rata to the number of threads.
Just that the throughput may be greater with multiple threads. I
agree you need cache and memory bandwidth per thread. But in the example
in the original paper, and I presume in the the one Mortan was talking
about, there is a high hit rate in the I-cache, so adding another thread
won't hurt much there, and that the loss of "effective" d-cache size per
thread may be overcome by the increased number of threads.

Of course it might not be so, but it could be for example, that each
transaction takes 33% longer (due to higher cache miss rate, contention,
etc), but you get two transactions in that 133% of the time, thus making
it worthwhile.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

From: nmm1 on 22 Apr 2010 13:22

In article <hqpvtr$k26$1(a)news.eternal-september.org>,
Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote:
>
>>>> I can confirm this from benchmarks for real-life workloads for
>>>> pretty static web servers, media servers and SIP telephony systems.
>>>> The cache size means everything in this context.
>>>
>>> Isn't this the kind of workload (relatively small instruction footprint,
>>> lots of cache misses, and lots of independent threads)
>>> that could benefit from a multi-threaded CPU?
>>
>> Not really. That's a common mistake. There are two reasons:
>>
>> 1) Running lots of threads usually slows each thread down by
>> causing conflicts in some of the shared components. Some designs
>> do better and some worse, of course.
>
>While I certainly agree that each thread will run slower that it would
>without multi-threading, that isn't the relevant question for this type
>of problem. Servers are throughput oriented, so the question is whether
>you get more transactions per second. Thus even if each is slower, it
>may be (of course depending upon workload and implementation) that the
>sum of the throughputs is greater. I probably didn't state that well,
>but I think you understand what I mean.

Oh, yes, but I probably didn't make myself clear. What I mean is that
they slow down relatively more in a threaded CPU than a more separated
multi-core design.

When I first looked at the "SMT" papers, I thought that it was a
neat idea but, as I have just posted, I couldn't follow the explanation
and so worked out the mathematics. I then discovered the reason that
the throughput comparisons did NOT include an SMT design versus a
multi-core one with the same transistor count. The latter would have
made the former look silly.

>> 2) The reason that throughput isn't pro-rata to the number of
>> threads is that you also need cache size and bandwidth and memory
>> bandwidth (and number of outstanding requests) pro rata to the
>> number of threads.
>
>I didn't say that the throughput was pro-rata to the number of threads.
> Just that the throughput may be greater with multiple threads. I
>agree you need cache and memory bandwidth per thread. But in the example
>in the original paper, and I presume in the the one Mortan was talking
>about, there is a high hit rate in the I-cache, so adding another thread
>won't hurt much there, and that the loss of "effective" d-cache size per
>thread may be overcome by the increased number of threads.

Most of the experience that I have heard of is that it is not so.

>Of course it might not be so, but it could be for example, that each
>transaction takes 33% longer (due to higher cache miss rate, contention,
>etc), but you get two transactions in that 133% of the time, thus making
>it worthwhile.

Yes, but that's not my point, which was that threaded CPUs are a
silly idea. True multi-core with slower CPUs is better for a lot
of HPC work, too, incidentally.

Regards,
Nick Maclaren.

From: Stephen Fuld on 22 Apr 2010 16:15

On 4/22/2010 10:22 AM, nmm1(a)cam.ac.uk wrote:
> In article<hqpvtr$k26$1(a)news.eternal-september.org>,
> Stephen Fuld<SFuld(a)Alumni.cmu.edu.invalid> wrote:
>>
>>>>> I can confirm this from benchmarks for real-life workloads for
>>>>> pretty static web servers, media servers and SIP telephony systems.
>>>>> The cache size means everything in this context.
>>>>
>>>> Isn't this the kind of workload (relatively small instruction footprint,
>>>> lots of cache misses, and lots of independent threads)
>>>> that could benefit from a multi-threaded CPU?
>>>
>>> Not really. That's a common mistake. There are two reasons:
>>>
>>> 1) Running lots of threads usually slows each thread down by
>>> causing conflicts in some of the shared components. Some designs
>>> do better and some worse, of course.
>>
>> While I certainly agree that each thread will run slower that it would
>> without multi-threading, that isn't the relevant question for this type
>> of problem. Servers are throughput oriented, so the question is whether
>> you get more transactions per second. Thus even if each is slower, it
>> may be (of course depending upon workload and implementation) that the
>> sum of the throughputs is greater. I probably didn't state that well,
>> but I think you understand what I mean.
>
> Oh, yes, but I probably didn't make myself clear. What I mean is that
> they slow down relatively more in a threaded CPU than a more separated
> multi-core design.
>
> When I first looked at the "SMT" papers, I thought that it was a
> neat idea but, as I have just posted, I couldn't follow the explanation
> and so worked out the mathematics. I then discovered the reason that
> the throughput comparisons did NOT include an SMT design versus a
> multi-core one with the same transistor count.

I don't see how you get a multi-core design with the same transistor
count as a multi-threaded one. I have seen numbers of 5% additional
logic for a second thread. Mostly you duplicate the registers and add a
little logic. But with two cores, clearly you get 100% overhead,
duplicating the registers, the execution units, the L1 caches and all
the other logic.

Note that Intel's latest designs seem to use both, that is each chip has
multiple cores, each of which is multi-threaded. And a high end system
will have multiple of those on a single motherboard. I suspect that is
the correct approach.

> The latter would have
> made the former look silly.

Of course, comparing one design with nearly twice the number of
transistors could outperform the single core design.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

| Next | Last
Pages: 1 2 3 4 5
Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do