From: Robert Myers on 1 Jun 2010 14:28 On Jun 1, 10:55 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > > I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably > conservative. E.g. 4-wide. > It's a thought-provoking paper. I really didn't mean to be insulting, just provocative. I wanted you to talk more, and you did. I was thinking of David Dinucci's version of coarse-grained parallelism when I read your proposal. In that approach, you snip the problem into carefully-chosen pieces ahead of time, carefully specify what's coming in and what should go out, and leave it to a run-time scheduler to put it all back together. Has some nice formal properties. That was my context for trying to understand what you were proposing. I pictured you looking at his proposal and saying,"You want parallel lumps? Easy." You take a meat cleaver to a serial instruction stream, whack it into bite-sized chunks, and say, "There. What else are you looking for?" There had to be some serious potential ugliness hiding in what you were proposing. Or maybe the whole idea of trying to identify the parallel pieces on the programming end was just wrong- headed to begin with. I'm still thinking. Thinking about how much you might throw into the pot at once and what you might do with it once it's there is a useful exercise, whether it leads to a commercially-viable architecture or not. Robert.
From: nmm1 on 1 Jun 2010 14:48 In article <4C041839.90907(a)patten-glew.net>, Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote: > > So, first, it is well-known that there is some potential in simply > increasing instruction window size. The problem is how costly is the > hardware, esp. power wise, to do so. Yes. > E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for > integer, going from instruction windows of 128 to 4K instructions. > > Now, that is not a very good payoff. Definitely sublinear. .... Bluntly, it's dire. There are lots of other ways to optimise the 'floating-point' workloads to that scale, and a 50% improvement for a 32-fold increase in window size (for integer) is horrible. > But the whole point is not that we are trying to build processors that > are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction > of processor area, especially if you have big vector units like some > recent proposals. Second, more importantly, to build big effective > instruction windows you do not need to build all that hardware. As the > kilo-instruction papers, and my own work, shows. I have downloaded the paper and will look at it, but the very fact that people are considering such extreme measures shows that we are running out of viable ideas. I am not denying that small speedups can be obtained, expensively, but we have to rethink from a much more basic stating point if we want large ones. And there is no way that we will get there without pain - by which I mean to the perpetrators of the current code! We have absolutely got to simplify the hardware's task (which is one of the few points where I agree with the Itanic's designers). Regards, Nick Maclaren.
From: Robert Myers on 1 Jun 2010 15:31 On Jun 1, 2:48 pm, n...(a)cam.ac.uk wrote: > In article <4C041839.90...(a)patten-glew.net>, > Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > > > > > So, first, it is well-known that there is some potential in simply > > increasing instruction window size. The problem is how costly is the > > hardware, esp. power wise, to do so. > > Yes. > > > E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for > > integer, going from instruction windows of 128 to 4K instructions. > > > Now, that is not a very good payoff. Definitely sublinear. .... > > Bluntly, it's dire. There are lots of other ways to optimise the > 'floating-point' workloads to that scale, and a 50% improvement for > a 32-fold increase in window size (for integer) is horrible. > But what are the alternatives? Computers deal with ever-larger datasets, with ever-increasing numbers of processors, and the data gradually slide off the latency-tolerance horizon of any given processor. Faster memory? Chaa, right. Violate apparently basic laws of physics? Maybe with some as yet undiscovered twist on the EPR paradox. Personally, I think speed-up is the wrong focus. Sooner or later, for reasons that seem pretty fundamental to me, we will be back to bragging, as Intel once did, about instructions in flight. Robert.
From: nmm1 on 1 Jun 2010 15:52 In article <0b318bfc-a635-4867-bdae-0b51ecae7156(a)m21g2000vbr.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: > >But what are the alternatives? Computers deal with ever-larger >datasets, with ever-increasing numbers of processors, and the data >gradually slide off the latency-tolerance horizon of any given >processor. Faster memory? Chaa, right. Violate apparently basic >laws of physics? Maybe with some as yet undiscovered twist on the EPR >paradox. Hang on. You have seen me ride my hobby-horse off into the sunset often enough! We aren't going to get anywhere until we change the current approaches to programming - not just the languages, but the paradigms used. In the most extreme case, I take that as far as saying that we need to change the way that mathematics and algorithms are taught! But let's not go that far today (it is getting late). We know that we can do a lot better with far less difficulty than most people will admit - and will get better RAS, too! Fortran has done a lot of that, and that's a language from way back. But it assuredly doesn't come without effort. >Personally, I think speed-up is the wrong focus. Sooner or later, for >reasons that seem pretty fundamental to me, we will be back to >bragging, as Intel once did, about instructions in flight. God help us, yes :-( Regards, Nick Maclaren.
From: Andy 'Krazy' Glew on 2 Jun 2010 00:15
On 6/1/2010 11:28 AM, Robert Myers wrote: > On Jun 1, 10:55 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > >> >> I forgot to say that the Kilo-instruction paper speedups I report above make several assumptions that are probably >> conservative. E.g. 4-wide. >> > > It's a thought-provoking paper. > > I really didn't mean to be insulting, just provocative. I wanted you > to talk more, and you did. > > I was thinking of David Dinucci's version of coarse-grained > parallelism when I read your proposal. In that approach, you snip the > problem into carefully-chosen pieces ahead of time, carefully specify > what's coming in and what should go out, and leave it to a run-time > scheduler to put it all back together. Has some nice formal > properties. That was my context for trying to understand what you > were proposing. > > I pictured you looking at his proposal and saying,"You want parallel > lumps? Easy." You take a meat cleaver to a serial instruction > stream, whack it into bite-sized chunks, and say, "There. What else > are you looking for?" There had to be some serious potential ugliness > hiding in what you were proposing. Or maybe the whole idea of trying > to identify the parallel pieces on the programming end was just wrong- > headed to begin with. I'm still thinking. > > Thinking about how much you might throw into the pot at once and what > you might do with it once it's there is a useful exercise, whether it > leads to a commercially-viable architecture or not. Although I have been having supper once a month with David Dinucci, ever since the Portland Geek Dinner/Friends of Eugene Miya" started, I have not looked much at his work. Must do so. Actually, I have spent much of the last twenty years trying to identify the parallel pieces of the program. SpMT is part of that - restricting the pieces to start/stop at function boundaries (or loops, or IF/ENDIF). But while such control flow techniques work pretty well, I have not yet figured out how to build hardware to automatically analyze the data dependencies - or at least hardware that does such analysis in a way that is significantly cheaper than executing the code itself. In my terminology, I say that such analysis is accomplished a posteriori, after forking the candidate parallel threads, rather than a priori, beforehand. For this reason I am very, very, happy if a programmer or compiler is willing to do so for me. Even if just in the form of hints. Now, by the way, I am reasonably happy with approaches that do such compiler-style analysis in "hardware", or, rather, firmware or microcode. The log based mechanisms that I am so fond of even provide such firmware with much more data than they would have if statically compiling, and more than most JIT code generators have access to. You can execute the code once or a few times, record the log, and analyze the log to see where the dependences occur, and do not. Such an approach is certainly as good as a compiler or programmer can be - since we would always give them HINT instructions to control the fork mechanism themselves - and potentially better, since the firmware has more information to analyze than does a compiler. But it has been my observation that a priori predictors always do better than a posteriori. The cost of making those few initial wrong guesses can dominate, especially for code, such as transaction processing, that tends to thrash the I-cache. -- You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's strength lies in its simplicity. It does not have great upside, and it isn't supposed to. I am just assembling a set of tools or components for a next generation out-of-order machine: * Multicluster Multithreading * => cheap thread forking and migration between clusters and processors * Batch scheduling across cluster boundaries to use those clusters for single thread * SpMT ... ditto ... * Log based All, or nearly all, of these mechanisms also being useful for parallel programming. |