Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do
From: nmm1 on 25 Apr 2010 14:32 In article <4BD47F60.9000709(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >On 4/24/2010 2:10 AM, nmm1(a)cam.ac.uk wrote: > >> The killer is the amount of code that involves a load or branch >> based on the contents of something that needs loading. You can >> do nothing except either wait for the first load, or follow all >> possible paths. The latter is O(log N) in a good case or >> O(log log N) in a bad one! > >Tjaden and Flynn showed that it is sqrt(N), way back (1968?). > >I.e. that the speedup you could get by following all possible paths N levels deep was sqrt(N). > >This was empirical. For whatever workloads they had awy back tgen. I didn't know that, but that needs A^N logic. I was using the fixed size logic formula. One 'scout' thread is a plausible design, two, perhaps. But it just isn't going to scale. Regards, Nick Maclaren.
From: "Andy "Krazy" Glew" on 26 Apr 2010 22:38 On 4/25/2010 11:32 AM, nmm1(a)cam.ac.uk wrote: > In article<4BD47F60.9000709(a)patten-glew.net>, > Andy \"Krazy\" Glew<ag-news(a)patten-glew.net> wrote: >> On 4/24/2010 2:10 AM, nmm1(a)cam.ac.uk wrote: >> >>> The killer is the amount of code that involves a load or branch >>> based on the contents of something that needs loading. You can >>> do nothing except either wait for the first load, or follow all >>> possible paths. The latter is O(log N) in a good case or >>> O(log log N) in a bad one! >> >> Tjaden and Flynn showed that it is sqrt(N), way back (1968?). >> >> I.e. that the speedup you could get by following all possible paths N levels deep was sqrt(N). >> >> This was empirical. For whatever workloads they had awy back tgen. > > I didn't know that, but that needs A^N logic. I was using the fixed > size logic formula. > > One 'scout' thread is a plausible design, two, perhaps. But it just > isn't going to scale. Actually, I have had good speedups in simulations with 16 threads. And I have limit studies that have many, many, more speculative threads. The biggest problem is trying to choose what subset of the possible threads is worth running on hardware with limited threads. I think that the big problem with past work such as Haitham's DMT was that it was often constrained to use too few threads or processors. E.g. Haitham's Itanium work with two processors, as compared to his earlier DMT work with 4 threads. I'll do a separate followup post, with the (vain) attempt to change the topic. I don't think I have ever described my SpMT ideas to this newsgroup. Nothing proprietary, just the by now really old stuff that I worked on at Wisconsin. -- I misremembered your post, Nick, and thought you said something like "SpMT won't scale (whereas explicut MP will)". I composed a reply in my head which I will nevertheless share, because it is a topic. My first reply was going to be: I know SpMT doesn't scale as well as explicit parallelism. Who cares? Use explicit parallelism when you have it, and use SpMT to speed up some single threaded apps that have not been explicitly parallelized. I'm not against explicit parallelism. A good half of my career has been devoted to it. (The other half of my career has been devoted to out of order. With the remaining halves security, and performance measurement and tuning.) I'm just trying to leverage parallelism to improve single threaded performance because, AFAICT, this will still help a microprocessor acheive commercial success. My second reply was going to be: Nothing scales. Not even parallelism. E.g. take an exponential algorithm on a uniprocessor, time O(exp(N)). Assume perfect parallelism: O(1) work on each of O(exp(N)) processors. The time to execute this would be O(1) if communications were free. But since communications is not free, the time to execute is, in 2D VLSI, O(sqrt(#processors)) = O(sqrt(exp(N)). Which is O(exp(N)), since the square root of an exponential is an exponential. Similary for 3D with cube roots.
From: nmm1 on 27 Apr 2010 03:17 In article <4BD64E09.8080605(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >> >> One 'scout' thread is a plausible design, two, perhaps. But it just >> isn't going to scale. > >Actually, I have had good speedups in simulations with 16 threads. >And I have limit studies that have many, many, more speculative threads. I was actually thinking of the benefit relative to cost, because those threads will consume power. Regards, Nick Maclaren.
From: Robert Myers on 27 Apr 2010 12:56
George Neuner wrote: > > I'm particularly interested in any previous work on compiler driven > speculation. AFAICT the research along that line died with Algol. > A search of portal.acm.org turns up over 1000 hits for "compiler speculation," and much of the work is relatively recent. I have 346 pdf documents in my own files that match "speculative," many of them related to compiler driven speculation--so many that digging through them to find the most relevant would be a big undertaking. Most of the work is relatively recent. Robert. |