Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do
From: Anne & Lynn Wheeler on 22 Apr 2010 13:40 Robert Myers <rbmyersusa(a)gmail.com> writes: > I had thought the idea of having lots of threads was precisely to get > the memory requests out. You start a thread, get some memory requests > out, and let it stall, because it's going to stall, anyway. > > Cache size and bandwidth and memory bandwidth are another matter. in mid-70s, there was a multithreaded project for the 370/195 (that never shipped). The 370/195 had 64 instruction pipeline, but no branch prediction or speculative execution ... so common branches stalled the pipeline. Highly tuned codes with some kinds of looping branches within the pipeline could have peak thruput of 10mips ... however, branch stalls in most code tended to hold thruput to five mips. the objective of the emulated two-processor (double registers, instruction address, etc ... but no additional pipeline or execution units) was compensate for branch stalls (i.e. instructions, operations, resources in the pipeline would have one-bit flag as to instruction stream that they were associated with). Having a pair of instruction streams with normal codes (peaking at 5mip/sec thruput) ... then had chance of effectively utilizing/saturating the available 195 resources (10mip aggregate). however, retrofitting virtual memory to 370/195 was effectively impossible ... so possibly accounted for it never getting out (original 370/195 tweaked the 360/195 with the original new announced 370 features .... but that was before virtual memory was announced) even retrofitting virtual memory to 370/165 was a very difficult task .... and that difficulty accounted for dropping a lot of features in the original 370 virtual memory architecture. -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: Anne & Lynn Wheeler on 23 Apr 2010 09:32 Robert Myers <rbmyersusa(a)gmail.com> writes: > This logic always made sense to me, but Nick claims it doesn't work. > If it doesn't work, it has to be because of pressure on the cache or > because the thread that stalls is holding a lock that the other thread > needs. re: http://www.garlic.com/~lynn/2010h.html#44 Processors stall on OLTP workloads about half the time--almost no matter what you do http://www.garlic.com/~lynn/2010h.html#45 Processors stall on OLTP workloads about half the time--almost no matter what you do muliple processor operation introduces serialization operations that don't exist in purely single processor operation. this can be as bad as 20-30 percent overhead increase. in a single processor case, this can wipe out any expected benefits from running it as emulated two-processor using processor threads. it isn't as much of a factor if already running multi-processor operation (two or more real processors) and adding emulated additional processors with hardware threads. -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: George Neuner on 28 Apr 2010 14:33 On Tue, 27 Apr 2010 18:08:41 -0700 (PDT), Robert Myers <rbmyersusa(a)gmail.com> wrote: >On Apr 27, 5:08�pm, George Neuner <gneun...(a)comcast.net> wrote: >> On Tue, 27 Apr 2010 12:56:14 -0400, Robert Myers > >> I'm not really seeking a discussion on all of this because it will >> quickly become very technical (and redundant as some of the things >> have been discussed in comp.compilers). �I just wanted more >> information on what Andy was doing because his description sounded >> interesting. > >I'll probably have a look at what might have been said on >comp.compilers, but, as to your tone, this list is *not* >comp.compilers. Comp.arch has had long dry spells. At least people >are talking. If you need a place to be pompous, I suggest you choose >a moderated list where you are a part of the moderator's club. > >Robert. I apologize for my choice of words. I didn't mean to be pompous or for there to be any tone wrt the forum ... I really only meant to convey that a discussion would be off-topic here. George
From: George Neuner on 28 Apr 2010 15:36 On Tue, 27 Apr 2010 18:08:41 -0700 (PDT), Robert Myers <rbmyersusa(a)gmail.com> wrote: >Most of the work I'm aware of is aimed at identifying those execution >paths that can be speculatively executed to speed up garden variety >computation with what were at the time standard test cases (gcc, bzip, >etc.). The speculative paths are set up by the compiler without >programmer intervention, other than making required profiling runs. Yes, I've seen some of that work. The original lines of research pretty much dried up with the general adoption of hardware branch speculation (the ability to conditionally execute instructions while waiting for a branch condition to resolve and to abort the path if the branch goes against it). What remains mostly is research into ways of recognizing repetitious patterns of data access in linked data structures (lists, trees, graphs, tries, etc.) and automatically prefetching data in advance of its use. I haven't followed this research too closely, but my impression is that it remains a hard problem. George
From: George Neuner on 29 Apr 2010 15:37
On Wed, 28 Apr 2010 17:35:32 -0700 (PDT), Robert Myers <rbmyersusa(a)gmail.com> wrote: >On Apr 28, 3:36�pm, George Neuner <gneun...(a)comcast.net> wrote: > >> What remains mostly is research into ways of recognizing repetitious >> patterns of data access in linked data structures (lists, trees, >> graphs, tries, etc.) and automatically prefetching data in advance of >> its use. �I haven't followed this research too closely, but my >> impression is that it remains a hard problem. > >I suspect that explains a mysterious private email I got while >publicly discussing Itanium and profile-directed optimization. The >email claimed that a well-known compiler developer that he worked for >had found means to predict irregular data access from static analysis >so that the compiler could supply prefetch hints even for an irregular >memory stride. Interesting ... if it's true it's the first I heard about it. I've read about success prefetching in lists and in search trees (although prefetching N-way trees with large N creates a cache pollution problem), but AFAIK prefetching in more general graph structures has eluded a practical solution. George |