Prev: Processors stall on OLTP workloads about half the time--almost no matter what you do
Next: Processors stall on OLTP workloads about half the time--almost nomatter what you do
From: Robert Myers on 22 Apr 2010 13:49 Anne & Lynn Wheeler wrote: > Robert Myers <rbmyersusa(a)gmail.com> writes: >> I had thought the idea of having lots of threads was precisely to get >> the memory requests out. You start a thread, get some memory requests >> out, and let it stall, because it's going to stall, anyway. >> >> Cache size and bandwidth and memory bandwidth are another matter. > > in mid-70s, there was a multithreaded project for the 370/195 (that > never shipped). The 370/195 had 64 instruction pipeline, but no branch > prediction or speculative execution ... so common branches stalled the > pipeline. Highly tuned codes with some kinds of looping branches within > the pipeline could have peak thruput of 10mips ... however, branch > stalls in most code tended to hold thruput to five mips. > > the objective of the emulated two-processor (double registers, > instruction address, etc ... but no additional pipeline or execution > units) was compensate for branch stalls (i.e. instructions, operations, > resources in the pipeline would have one-bit flag as to instruction > stream that they were associated with). Having a pair of instruction > streams with normal codes (peaking at 5mip/sec thruput) ... then had > chance of effectively utilizing/saturating the available 195 resources > (10mip aggregate). This logic always made sense to me, but Nick claims it doesn't work. If it doesn't work, it has to be because of pressure on the cache or because the thread that stalls is holding a lock that the other thread needs. Robert.
From: nmm1 on 22 Apr 2010 14:28 In article <V_%zn.25199$Db6.3878(a)newsfe05.iad>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >Anne & Lynn Wheeler wrote: >> Robert Myers <rbmyersusa(a)gmail.com> writes: >>> I had thought the idea of having lots of threads was precisely to get >>> the memory requests out. You start a thread, get some memory requests >>> out, and let it stall, because it's going to stall, anyway. >>> >>> Cache size and bandwidth and memory bandwidth are another matter. >> >> in mid-70s, there was a multithreaded project for the 370/195 (that >> never shipped). The 370/195 had 64 instruction pipeline, but no branch >> prediction or speculative execution ... so common branches stalled the >> pipeline. Highly tuned codes with some kinds of looping branches within >> the pipeline could have peak thruput of 10mips ... however, branch >> stalls in most code tended to hold thruput to five mips. >> >> the objective of the emulated two-processor (double registers, >> instruction address, etc ... but no additional pipeline or execution >> units) was compensate for branch stalls (i.e. instructions, operations, >> resources in the pipeline would have one-bit flag as to instruction >> stream that they were associated with). Having a pair of instruction >> streams with normal codes (peaking at 5mip/sec thruput) ... then had >> chance of effectively utilizing/saturating the available 195 resources >> (10mip aggregate). > >This logic always made sense to me, but Nick claims it doesn't work. If >it doesn't work, it has to be because of pressure on the cache or >because the thread that stalls is holding a lock that the other thread >needs. Not quite. I have never claimed that it is without effect, merely that the effect isn't what is claimed! I omitted a paragraph where I said that there WAS a time when the technique would have worked, but it wasn't when it was used. Back in the 1970s, computational units were a scarce resource in CPU design, and the thing that the SMT approach does make better use of is computational units. So it would have worked then, as it would in the 1980s on microprocessors (when, again, computational units were a scarce resource, because of limited transistor count). However, by the year 2000, and even in the 1990s, they were NOT a scarce resource any longer, and the limits were invariably memory and cache bandwidth, transaction rate and conflict resolution. How would they help with that? Well, as the Tera MTA showed, they could - in a machine designed for that purpose. But in what we now know as a general-purpose CPU? To a first approximation, two threads or two cores have the same memory and cache requirements, so they don't do any better than multiple cores there. They still make better use of computational units, but at the expense of some extra logic and less performance compared to multi-core designs. How much? Well, when I looked at the papers, their efficiency was good for 2-way threading, but dropped off badly for 4-way and was definitely poor for 8-way. And that was analysing the simple, clean MIPS architecture - even done well, x86 would not have been as good. So a much better, more scalable, design is to forget about threading and simply go for more cores. Notice that even Intel has never delivered a CPU with more than 2-way threading, and there are a lot of people who say the route to performance is to disable even that. To put it another way, they are a solution to a problem of the 1970s and 1980s, not to one of the 1990s and later. Regards, Nick Maclaren.
From: Robert Myers on 22 Apr 2010 15:07 nmm1(a)cam.ac.uk wrote: > In article <V_%zn.25199$Db6.3878(a)newsfe05.iad>, > > Not quite. I have never claimed that it is without effect, merely > that the effect isn't what is claimed! I omitted a paragraph where > I said that there WAS a time when the technique would have worked, > but it wasn't when it was used. > > Back in the 1970s, computational units were a scarce resource in > CPU design, and the thing that the SMT approach does make better > use of is computational units. So it would have worked then, as > it would in the 1980s on microprocessors (when, again, computational > units were a scarce resource, because of limited transistor count). > > However, by the year 2000, and even in the 1990s, they were NOT a > scarce resource any longer, and the limits were invariably memory > and cache bandwidth, transaction rate and conflict resolution. > How would they help with that? > > Well, as the Tera MTA showed, they could - in a machine designed > for that purpose. But in what we now know as a general-purpose > CPU? > > To a first approximation, two threads or two cores have the same > memory and cache requirements, so they don't do any better than > multiple cores there. They still make better use of computational > units, but at the expense of some extra logic and less performance > compared to multi-core designs. How much? > > Well, when I looked at the papers, their efficiency was good for > 2-way threading, but dropped off badly for 4-way and was definitely > poor for 8-way. And that was analysing the simple, clean MIPS > architecture - even done well, x86 would not have been as good. > > So a much better, more scalable, design is to forget about threading > and simply go for more cores. Notice that even Intel has never > delivered a CPU with more than 2-way threading, and there are a > lot of people who say the route to performance is to disable even > that. > > To put it another way, they are a solution to a problem of the > 1970s and 1980s, not to one of the 1990s and later. > I think we've been through the computational resources are no longer scarce discussion wrt hyperthreading in this forum. But suppose the scarce resource isn't computational resources, but other things, like L1 and L2 cache and watts. You add more cores, you need more of both. I think that, with proper cache management, trashing L1 and perhaps even L2 for the thread that *can* advance makes more sense than duplicating expensive cache that will be idled on a separate core. I'm in *way* over my head here. As to the 2-threads vs. many-threads argument, I suspect that I agree with you, but that's based purely on seeing the point of diminishing returns with hyperthreading and by the fact that a factor of two seems just about right for core overloading. Robert.
From: nmm1 on 22 Apr 2010 15:33 In article <781An.11623$0_7.8171(a)newsfe25.iad>, Robert Myers <rbmyersusa(a)gmail.com> wrote: > >I think we've been through the computational resources are no longer >scarce discussion wrt hyperthreading in this forum. Yes. >But suppose the scarce resource isn't computational resources, but other >things, like L1 and L2 cache and watts. You add more cores, you need >more of both. Right. But see below. >I think that, with proper cache management, trashing L1 and perhaps even >L2 for the thread that *can* advance makes more sense than duplicating >expensive cache that will be idled on a separate core. I'm in *way* >over my head here. That is certainly true, but we should compare a dual-threaded system with a dual-core one that shares at least level 2 cache. No gain there. The question is whether the duplication and synchronisation of level 1 cache costs more than the register set juggling needed to run the two threads. No, I can't answer that any more than you can, but it looks as if it is pretty well balanced. So, on the above basis, it's purely a matter of taste. But now let's consider performance registers and tunability - threading more-or-less sacrifices those which, in turn, lowers the efficiency of the system because the applications are less well tuned. Well, somewhat. I don't think that CPU threading is completely insane, but it's not a solution to the problems it is often claimed to solve. Regards, Nick Maclaren.
From: Morten Reistad on 23 Apr 2010 10:38
In article <SzbAn.202687$Ye4.66545(a)newsfe11.iad>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >MitchAlsup wrote: > >> >> Also note: if you look at the volume of chips that go into servers and >> other big iron, it represents an aftenoon in the FAB per year compared >> to the desktop and notebooks,... A profitable afternoon, but not big >> enough for an Intel nor AMD to alter design team directions. >> > >If you are Google, though, you can make your own rules, if you want to >badly enough: > >http://www.channelregister.co.uk/2010/04/22/google_the_server_chip_designer/ > ><quote> > >But an earlier Times story indicated that Agnilux [recently acquired by >Google] was brewing "some kind of server." > ></quote> > >If anyone has the incentive to build a no-frills, low-power chip that >can afford to wait, if necessary, it would be Google. > >Data centers may not account for much chip volume, but they sure do >gobble electricity. These designs seem to be low-hanging fruit. If you are Intel, or even AMD, and possibly Via, you should be able to take a couple of years old design, implement it in a modern process, and use the leftover space to cache, cache, more cache and a cache interconnect. If bits and pieces could be powered on and off under os / hypervisor control it could be a real winner for laptops too; keeping a small cache and a single processor running when not doing anything major; and fire it all up when there is system load. The next issue is to be a little more intelligent about the cache replacement; since it is so vital for the performance of the system. -- mrr |