Processors stall on OLTP workloads about half the time--almostno matter what you do [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do

From: nmm1 on 25 Apr 2010 14:32

In article <4BD47F60.9000709(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/24/2010 2:10 AM, nmm1(a)cam.ac.uk wrote:
>
>> The killer is the amount of code that involves a load or branch
>> based on the contents of something that needs loading. You can
>> do nothing except either wait for the first load, or follow all
>> possible paths. The latter is O(log N) in a good case or
>> O(log log N) in a bad one!
>
>Tjaden and Flynn showed that it is sqrt(N), way back (1968?).
>
>I.e. that the speedup you could get by following all possible paths N levels deep was sqrt(N).
>
>This was empirical. For whatever workloads they had awy back tgen.

I didn't know that, but that needs A^N logic. I was using the fixed
size logic formula.

One 'scout' thread is a plausible design, two, perhaps. But it just
isn't going to scale.

Regards,
Nick Maclaren.

From: "Andy "Krazy" Glew" on 26 Apr 2010 22:38

On 4/25/2010 11:32 AM, nmm1(a)cam.ac.uk wrote:
> In article<4BD47F60.9000709(a)patten-glew.net>,
> Andy \"Krazy\" Glew<ag-news(a)patten-glew.net> wrote:
>> On 4/24/2010 2:10 AM, nmm1(a)cam.ac.uk wrote:
>>
>>> The killer is the amount of code that involves a load or branch
>>> based on the contents of something that needs loading. You can
>>> do nothing except either wait for the first load, or follow all
>>> possible paths. The latter is O(log N) in a good case or
>>> O(log log N) in a bad one!
>>
>> Tjaden and Flynn showed that it is sqrt(N), way back (1968?).
>>
>> I.e. that the speedup you could get by following all possible paths N levels deep was sqrt(N).
>>
>> This was empirical. For whatever workloads they had awy back tgen.
>
> I didn't know that, but that needs A^N logic. I was using the fixed
> size logic formula.
>
> One 'scout' thread is a plausible design, two, perhaps. But it just
> isn't going to scale.

Actually, I have had good speedups in simulations with 16 threads. And I have limit studies that have many, many, more
speculative threads. The biggest problem is trying to choose what subset of the possible threads is worth running on
hardware with limited threads.

I think that the big problem with past work such as Haitham's DMT was that it was often constrained to use too few
threads or processors. E.g. Haitham's Itanium work with two processors, as compared to his earlier DMT work with 4 threads.

I'll do a separate followup post, with the (vain) attempt to change the topic. I don't think I have ever described my
SpMT ideas to this newsgroup. Nothing proprietary, just the by now really old stuff that I worked on at Wisconsin.

--

I misremembered your post, Nick, and thought you said something like "SpMT won't scale (whereas explicut MP will)". I
composed a reply in my head which I will nevertheless share, because it is a topic.

My first reply was going to be: I know SpMT doesn't scale as well as explicit parallelism. Who cares? Use explicit
parallelism when you have it, and use SpMT to speed up some single threaded apps that have not been explicitly parallelized.

I'm not against explicit parallelism. A good half of my career has been devoted to it. (The other half of my career has
been devoted to out of order. With the remaining halves security, and performance measurement and tuning.) I'm just
trying to leverage parallelism to improve single threaded performance because, AFAICT, this will still help a
microprocessor acheive commercial success.

My second reply was going to be: Nothing scales. Not even parallelism.

E.g. take an exponential algorithm on a uniprocessor, time O(exp(N)). Assume perfect parallelism: O(1) work on each of
O(exp(N)) processors. The time to execute this would be O(1) if communications were free. But since communications is
not free, the time to execute is, in 2D VLSI, O(sqrt(#processors)) = O(sqrt(exp(N)). Which is O(exp(N)), since the
square root of an exponential is an exponential. Similary for 3D with cube roots.

From: nmm1 on 27 Apr 2010 03:17

In article <4BD64E09.8080605(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>>
>> One 'scout' thread is a plausible design, two, perhaps. But it just
>> isn't going to scale.
>
>Actually, I have had good speedups in simulations with 16 threads.
>And I have limit studies that have many, many, more speculative threads.

I was actually thinking of the benefit relative to cost, because
those threads will consume power.

Regards,
Nick Maclaren.

From: Robert Myers on 27 Apr 2010 12:56

George Neuner wrote:

>
> I'm particularly interested in any previous work on compiler driven
> speculation. AFAICT the research along that line died with Algol.
>

A search of portal.acm.org turns up over 1000 hits for "compiler
speculation," and much of the work is relatively recent.

I have 346 pdf documents in my own files that match "speculative," many
of them related to compiler driven speculation--so many that digging
through them to find the most relevant would be a big undertaking. Most
of the work is relatively recent.

Robert.

First | Prev |
Pages: 1 2 3 4 5
Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do