Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost no matter what you do
From: Morten Reistad on 22 Apr 2010 17:33 In article <hqpqks$q26$1(a)news.eternal-september.org>, Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote: >On 4/22/2010 3:04 AM, Morten Reistad wrote: >> In article<86666a83-4bed-472c-aacd-9fc6ef47e9e6(a)k33g2000yqc.googlegroups.com>, >> MitchAlsup<MitchAlsup(a)aol.com> wrote: >>> On Apr 21, 11:02 am, Robert Myers<rbmyers...(a)gmail.com> wrote: >>>> Even though the paper distinguishes between technical and commercial >>>> workloads, and draws its negative conclusion only for commercial >>>> workloads, it was interesting to me that, for instance, Blue Gene went >>>> the same direction--many simple processors--for a technical workload so >>>> as to achieve low power operation. >>> >>> Reading between the lines, Comercial and DB workloads are better >>> served by slower processors accessing a thinner cache/memory hierarchy >>> than by faster processors accessing a thicker cache/memory hierarchy. >>> That is: a comercial machine is better served with larger first level >>> cache backed up by large second cache running at slower frequencies, >>> while a technical machine would be better served with smaller first >>> level caches, medium second elvel cache and a large third level cache >>> running at higher frequencies. >> >> I can confirm this from benchmarks for real-life workloads for >> pretty static web servers, media servers and SIP telephony systems. >> The cache size means everything in this context. > >Isn't this the kind of workload (relatively small instruction footprint, >lots of cache misses, and lots of independent threads) >that could benefit from a multi-threaded CPU? Lots of these Internet services has such a signature. Relatively light processing compared to i/o, but simple transforms are done on the data. Instruction footprint is small, relatively good short-time (nanoseconds) locality on the data, but the mid-term (milliseconds) working set expands linearly, or more, with load. This applies to web caches, real media broad/manycasters, telephony, all kinds of authentication/profile servers (from dhcp to ldap/radius). They either end as cache limited, or they have sufficient processing for the cpus to get loaded, and the power usage to go up. >> Actually, the Hyperchannel cache interconnects work very nicely to >> make all the on-chip caches work as a system-wide cache, not as >> die-wide. I may even suggest L4 cache attachment to static on-chip >> hyperchannel memory; like a Xeon with no CPUs. > >Another advantage of this is that this "extra" chip provides more pins >thus allowing higher system memory bandwidth for those applications that >need it. Yep. The 8-way Xeon, 4 chips, multiple hyperchannels has a fantastic performance for these services. The 32-way, no cache-to-cache links off-chip, smaller L3 cache perfoms abysmally in comparison. But it makes a good transcoding and media mixer engine. -- mrr
From: Morten Reistad on 22 Apr 2010 17:39 In article <hqpv0v$3vq$1(a)smaug.linux.pwf.cam.ac.uk>, <nmm1(a)cam.ac.uk> wrote: >In article <hqpqks$q26$1(a)news.eternal-september.org>, >Stephen Fuld <SFuld(a)Alumni.cmu.edu.invalid> wrote: >>On 4/22/2010 3:04 AM, Morten Reistad wrote: >>Isn't this the kind of workload (relatively small instruction footprint, >>lots of cache misses, and lots of independent threads) >>that could benefit from a multi-threaded CPU? > >Not really. That's a common mistake. There are two reasons: > > 1) Running lots of threads usually slows each thread down by >causing conflicts in some of the shared components. Some designs >do better and some worse, of course. Some tasks are well scalable, and have good locality; and we can let Linux handle the coordination. Thus, the threads are full processes with just minimal semaphore etc handling. > 2) The reason that throughput isn't pro-rata to the number of >threads is that you also need cache size and bandwidth and memory >bandwidth (and number of outstanding requests) pro rata to the >number of threads. Now, back in the real world .... Exactly. But there are three things that help enourmously. Interrupt coalescing; from Linux 1.6.24; gives an order of magnitude more i/o performance on networks and similar devices. Inter-cache links helps with the cache flushes. It is amazing how much this helps. It leads me to suggest handling the whole cache more intelligently; those tasks that are not cpu-bound become cache-bound. And, this is from the real world. I compare such systems every day. -- mrr
From: "Andy "Krazy" Glew" on 23 Apr 2010 10:58 On 4/23/2010 12:51 AM, Andy "Krazy" Glew wrote: > There are two sorts of database workloads: > ... > > The second class tends to do things like joins. Which often have parts > that look like > > LOOP small partition of left that stays in cache > traverse left Btrees > LOOP over all right that misses cache > traverse Btree for right > END LOOP > END LOOP > > for various permutations of the inner and outer loops. > > For such JOINs, the inner loop may get clogged up wuth the cache misses, > but the next iteration of the outer loop is essentially independent. > E.g. if you are joining records, and not computing a sum. Even if you > are computing a sum, you can parallelize all except the sum dependence. > > > This is the classic example of code that does well with loop based > speculative multithreading. > > Yes, software can do the parallelization too. And it often does. But it > still tends to leave large chunks that look like the above. Which SpMT > can handle. By the way: in many ways such JOIN code is best expressed, in software, with explicit but lightweight multithreading: LOOP small partition of left that stays in cache traverse left Btrees LOOP over all right that misses cache FORK a thread to traverse Btree for right END LOOP END LOOP where "best expressed" means natural, easy to read, etc. (The non-threaded version above is okay to read; but I have seen attempts by compilers AND human programmers to software pipeline these things, i.e. interleaving separate index tree traversals in the same loop. Mostly unsuccessful. IMHO one of the most important things about multithreaded code is when it is easier and more natural to express than serial code.) i.e. a more-outer serial loop, that is managing cache locality, and a more-inner loop that can be expressed as parallel. This is a common pattern: serial outer, parallel inner, with the serial code managing locality. Probably with even more-outer loops parallelizing across multiple processors, multiple machines, etc. Why so? Management of locality. You could express everything in the maximally parallel form, but then your thread scheduler, your run-time, has to take into account everything related to parallelism and performance. Nikhil's comment on dataflow for serial machines versus paralll dataflow was the the Von Neuman "bottleneck" provides a lot of clues to things like load balancing, etc. There is more thnan enough parallelism. The problem is how to manage it. In the semi-serial form, your programmer can manage. But... if you use the semi-serial/parallel form, you don't ALWAYS want to spawn a new thread. Nor do you ALWAYS want to use task queues over existing threads. Rather, you want to do something like LOOP small partition of left that stays in cache traverse left Btrees LOOP over all right that misses cache IF the hardware situation is such that a new thread can run efficiently, and would improve performance THEN spawn a new thread ELSIF there are too many threads THEN kill a thread (perhaps after it has finished its current task, perhaps not) END IF Put an item on the task queue for one of the threads to run that: traverse Btree for right END LOOP END LOOP Two problems with this: (1) the IF statement in the innermost loop is really hardware and workload specific, and is also often dynamic. It depends stuff like CPU utilization that the software has trouble handling. Basically, it is something dynamic. Like managing a cache. Like a branch predictor. (2) "Putting an item on a task queue" is basically creating a continuation or closure. Which is a wonderful programming style. But which is not something available to people in many programing languages, like C and C++. I.e. you have to roll your own. Also, preparing a continuation or closure is often expensive. It's often a lot cheaper to just roll into the next iteration of the loop than it is to prepare a task packet. So you get into the issue "Do I prepare a task packet every 4 loop iterations? Every 16? ..." Whatever the programmer decies will change (a) by the time you are on a different machine, but also, worse (b) depending on workload, the type of datastructures or database you are looking at. My point is that the decisions for when to go parallel and when not are dynamic and hardware dependent. In some ways things like speculative multithreading amount to providing efficient support for closures and continuations - but only so long as those closures and continuations can fit within some register storage. Not generic support, where the data has to spill to memory. I personally actually prefer the EXPLICIT parallelism form of the code. Fork as many threads as possible, let the run-time, both hardware and software, take care of it. It's especially nice because those explicit threads are explicitly indicated not to have cross thread dependencies, except at synch points. Therefore, you don't need to have all of memory speculation hardware that OOO has. I.e. I like explicit parallelism because it allows us avoid unnecessary hardware. SpMT on single threaded code is just a way of making such dynamic hardware and workload dependent thread scheduling systems. In many ways, I would like to such dynamic hardware and workload dependent thread scheduling for fine grained explicit parallel lightweight threaded code. That is nicer. --- In my mind the debate is not between Heavyweight single threaded hardware OOO vs Manycore explicit parallelism on sngle core. It is, rather, between statically expressed parallelism, and dynamic parallelism. Where the dynamic parallelism can be managed (a) at compile time, (b) by dynamic JIT-like software, but also (c) via hardware mechanisms. Where the latter are more likely to be able to respond to things like current CPU utilization, etc. One of the reasons I am so excited about GPU microarchitectures is that they have founbd a new point in this space. They have fairly sophisticated dynamic schedulers in hardware - e.g. the "Ultra Threaded Dispatch Processor" - but the units of scheduling are heavier than individual instructions in OOO, albeit much lighter that threads and processes on conventional systems.
From: Morten Reistad on 23 Apr 2010 10:58 In article <iIadnbiYmNH4ak3WnZ2dnUVZ_vCdnZ2d(a)speakeasy.net>, Rob Warnock <rpw3(a)rpw3.org> wrote: >Morten Reistad <first(a)last.name> wrote: >+--------------- >| Interrupt coalescing; from Linux 1.6.24; gives an order of >| magnitude more i/o performance on networks and similar devices. >+--------------- > >While your first and last phrases are certainly quite correct, >the middle one gives the impression that Linux invented interrupt >coalescing. Actually, it came *quite* late to that game!! E.g., I >was doing interrupt coalescing in terminal device drivers[1] in >TOPS-10 circa 1972 [and, yes, it *seriously* improved terminal I/O >performance!!]; in Fortune Systems's FOR:OS in 1985; in SGI's Irix >network code circa 1990; etc. And I certainly wasn't the only one. >Linux is reinventing a very old wheel here. Yep, Linux is picking up old tricks. But it is getting pretty good at it; and has adapted a pretty long list of old tricks by now. I just point out how dramatic such stuff can be for real life performance. It isn't just for theorists; this is about real kilowatts in real internet services with tens or hundreds of thousands of daily users. And as much of an old BSD fan I may be, Linux is where the big open source internet applications happen. I was also referring to Linux because I have done extensive benchmarking on a lot of internet based service farms, and we end up time and time again to identify either memory performance/cache size or cpu power per watt as the limiting factors. All the internet services we are dealing with are either extremely frugal on system power, or handle 64+ processors reasonably well, as long as the logic they are configured with allow parallellism. We had a stunning comparison of a 32-way system and an 8-way system, almost exactly same clock rates and processor dhrystone performance, but with cache size, cache interconnect and interrupt coalescing stacked in the disfavour of the 32-way system. Otherwise we had the same kernel version, and the same application. The 8-way system handles 12800 streams, the 32-way system struggles with 2400. In terms of cpu power they are almost 5:1 (independent dhrystone sums) but in terms of throughput they are almost 1:5, in the other direction. That is more than a 20:1 performance hit due to these three factors. -- mrr
From: nmm1 on 23 Apr 2010 13:34
In article <4BD15197.3050202(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >So, it's happened again. Just as OOO CPU research was the poor cousin >during the early days of the RISC revolution, so SpMT is the poor >cousin in the early days of MultiCore and ManyCore. If M&M run out of >steam, I hope that Haitham Akkary's SpMT research will be there, still >ongoing, to pick up the pieces. Er, what have I missed? "A Minimal Dual-Core Speculative Multi-Threading Architecture" That abstract states a potential dual-core gain of 20% - not bad, but not exciting. Now, extensive experience with branch prediction and preloading analyses indicates that the gain with number of parallel lookaheads is rarely much better than log(N), so that isn't going to give more than a factor of two or so. Multi-core, on the other hand, has a potential LARGE gain, once people bite the bullet and start designing software for it. Yes, that's a big hurdle, and we know that some applications can't make use of it, but that's a hell of a lot more scalable. We know that some applications can get gains in the hundreds, and a few can do even better. Regards, Nick Maclaren. |