Prev: PEEEEEEP
Next: Texture units as a general function
From: jgd on 13 Dec 2009 09:02 In article <4B21DA4C.8010402(a)patten-glew.net>, ag-news(a)patten-glew.net ( Glew) wrote: > At SC09 the watchword was heterogeneity. > > E.g. a big OOO x86 core, with small efficient cores of your favorite > flavour. On the same chip. It's a nice idea, but it leaves some questions unanswered. The small cores are going to need access to memory, and that means more controllers in the packages, and more legs on the chip. That costs, whatever. Now, are the small cores cache-coherent with the big one? If so, that's more complexity, if not, it's harder to program. I suspect that if they share an instruction set with the big core, cache coherency is worthwhile, but if not, not. Overall, the main advantage of this idea seems to be having a low- latency link between main and small cores. That is not to be sneezed at: we've given up a co-processor project because of the geological ages needed to communicate across PCI-Express busses. Back-of-the-envelope calculations made it clear that even if the co-processor took zero time to do its work, we made a speed loss overall. > While you could put a bunch of small x86 cores on the side, I think > that you would probably be better off putting a bunch of small > non-x86 cores on the side. Like GPU cores. Like Nvidia. OR AMD/ATI > Fusion. > > Although this makes sense to me, I wonder if the people who want x86 > really want x86 everywhere - on both the big cores, and the small. > > Nobody likes the hetero programming model. But if you get a 100x > perf benefit from GPGPU... The stuff I produce is libraries, that get licensed to third parties and put into a wide range of apps. Those get run on all sorts of machines, from all sorts of manufacturers; we need to run on whatever the customer has, rather simply than what the software developers' managers chose to buy. That means "small efficient cores of your favourite flavour" are something of a pain: if there are several different varieties of such things out there, I have to support (and thus build for and test) most of them, or plump for one with a significant chance of being wrong, or wait for a dominant one to emerge. Which is easiest? That's the attraction of OpenCL as opposed to CUDA: it isn't tied to one manufacturer's hardware. However, AMD don't seem to be doing a great job of spreading it around at present. The great potential advantage, to me, of the small cores being x86 is not the x86 instruction set, or its familiarity, or its widespread development tools. It's having them standardised. That doesn't solve the problem of making good use of them, but it takes some logistic elements (and thus costs) out of it. -- John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: "Andy "Krazy" Glew" on 14 Dec 2009 09:57 jgd(a)cix.compulink.co.uk wrote: > In article <4B21DA4C.8010402(a)patten-glew.net>, ag-news(a)patten-glew.net ( > Glew) wrote: > >> At SC09 the watchword was heterogeneity. >> >> E.g. a big OOO x86 core, with small efficient cores of your favorite >> flavour. On the same chip. > > It's a nice idea, but it leaves some questions unanswered. The small > cores are going to need access to memory, and that means more > controllers in the packages, and more legs on the chip. That costs, > whatever. > > Now, are the small cores cache-coherent with the big one? If so, that's > more complexity, if not, it's harder to program. I suspect that if they > share an instruction set with the big core, cache coherency is > worthwhile, but if not, not. I must admit that I do not understand your term "legs on the chip". When I first saw it, I thought that you meant pins. Like, the old two chips in same package, or on same chip, not sharing a memory controller. But that does not make sense here. Whenever you have multicore, you have to arrange for memory access. The main way this is done is to arrange for all to access the same memory controller. (Multiple memory controllers are a possibility. Multiple MCs subdividing the address space, either by address ranges or by interleaved cache lines or similar blocks, a possibility. Multiple MCs with separate address spaces, dedicated to separate groups of processors, are possible. But I don't know what would would motivate that. Bandwidth - but non-cache coherent shared memory has the same bandwidth advantages. Security?) I therefore do not understand you when you say "that means more controllers in the package". The hetero chips would probably share the same memory controller. If you mean cache controllers, yes: if you want cache consistency, you will need cache controlers for every small processor, or at least group of processors. If you have a scalable interconnect on chip, then both big and small processors will connect to it. Having N big cores + M small cores is no more complex in that regard than having N+M big cores. Except... since the sizes and shapes of the big and small cores is different, the physical layout will be different. Timing, etc. (But if you are creating a protocol that is timing and layout sensitive, you deserve to be cancelled.) Logically, same complexity. Testing wise, of course, different complexity. You would have to test all of the combinations big/big, big/small, small/small, small/small on the ends of the IC, ... -- As for cache consistency, that is on and off. Folks like me aren't afraid to take the cache protocols that work on multichip systems, and put them on-chip. Integration is obvious. Where you get into problems is wrt tweaking. On the other hand, big MP / HPC systems tend to have nodes that consist of 4-8-16 cache consistent shared memory cores, and then run PGAS style non-cache-coherent shared memory between them, or MPI message passing. Since integration is inevitable as well as obvious, inevitably we will have more than one cache coherent domains on chip, which are PGAS or MPI non-cache coherent between the domains.
From: nmm1 on 14 Dec 2009 10:20 In article <4B265271.6020809(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >jgd(a)cix.compulink.co.uk wrote: >> >>> At SC09 the watchword was heterogeneity. >>> >>> E.g. a big OOO x86 core, with small efficient cores of your favorite >>> flavour. On the same chip. >> >> It's a nice idea, but it leaves some questions unanswered. ... >> >> Now, are the small cores cache-coherent with the big one? If so, that's >> more complexity, if not, it's harder to program. I suspect that if they >> share an instruction set with the big core, cache coherency is >> worthwhile, but if not, not. > >As for cache consistency, that is on and off. Folks like me aren't >afraid to take the cache protocols that work on multichip systems, and >put them on-chip. Integration is obvious. Where you get into problems >is wrt tweaking. Precisely. Therefore, when considering larger multi-core than today, one should look at the systems that have already delivered that using multiple chips, and see how they have done. It's not pretty. Now, it is POSSIBLE that multi-core coherence is easier to make reliable and efficient than multi-chip coherence, but a wise man will not assume that until he has investigated the causes of the previous problems and seen at least draft solutions. 8-way shouldn't be a big deal, 32-way will be a lot trickier, 128-way will be a serious problem and 512-way will be a nightmare. All numbers subject to scaling :-) >On the other hand, big MP / HPC systems tend to have nodes that consist >of 4-8-16 cache consistent shared memory cores, and then run PGAS style >non-cache-coherent shared memory between them, or MPI message passing. The move to that was a response to the reliability, efficiency and (most of all) cost problems on the previous multi-chip coherent systems. > Since integration is inevitable as well as obvious, inevitably we >will have more than one cache coherent domains on chip, which are PGAS >or MPI non-cache coherent between the domains. Extremely likely - nay, almost certain. Whether those domains will share an address space or not, it's hard to say. My suspicion is that they will, but there will be a SHMEM-like interface to them from their non-owning cores. Regards, Nick Maclaren.
From: jgd on 14 Dec 2009 16:03 In article <4B265271.6020809(a)patten-glew.net>, ag-news(a)patten-glew.net ( Glew) wrote: > I must admit that I do not understand your term "legs on the chip". > When I first saw it, I thought that you meant pins. Like, the old two > chips in same package, or on same chip, not sharing a memory > controller. But that does not make sense here. That is what I meant. I just wasn't clear enough. > Whenever you have multicore, you have to arrange for memory access. > The main way this is done is to arrange for all to access the same > memory controller. (Multiple memory controllers are a possibility. I wasn't explaining enough. A single memory controller does not seem to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory controllers; a Core i7 has three. This is inevitably pushing up pin count. If you add a bunch more small cores, you're going to need even more memory bandwidth, and thus presumably more memory controllers. This is do doubt achievable, but the price may be a problem. -- John Dallman, jgd(a)cix.co.uk, HTML mail is treated as probable spam.
From: Robert Myers on 14 Dec 2009 19:55
On Dec 14, 4:03 pm, j...(a)cix.compulink.co.uk wrote: > > I wasn't explaining enough. A single memory controller does not seem > to be enough for today's big OOO x86 cores. A Core 2 Duo has two memory > controllers; a Core i7 has three. This is inevitably pushing up pin > count. If you add a bunch more small cores, you're going to need even > more memory bandwidth, and thus presumably more memory controllers. This > is do doubt achievable, but the price may be a problem. Bandwidth. Bandwidth. Bandwidth. It must be in scripture somewhere. It is, but no one reads the Gospel according to Seymour any more. Is an optical fat link out of the question? I know that optical on- chip will take a miracle and maybe a Nobel prize, but just one fat link. Is that too much to ask? Robert. |