Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: Thomas Womack on 23 Jul 2010 13:19 In article <34ea667e-779a-44d8-ab63-c032df1cb067(a)q35g2000yqn.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >At a time when vector processors were still a fading memory (even in >the US), an occasional article would mention that "vector computers" >were easier to use for many scientists than thousands of cots >processors hooked together by whatever. Yes, this is certainly true. Earth Simulator demonstrated that you could build a pretty impressive vector processor, which (Journal of the Earth Simulator - one of the really good resources since it talks about both the science and the implementation issues) managed 90% performance on lots of tasks, partly because using it was very prestigious and you weren't allowed to use the whole machine on jobs which didn't manage very high performance on a 10% subset. But it was a $400 million project to build a 35Tflops machine, and the subsequent project to spend a similar amount this decade on a heftier machine came to nothing. I've worked at an establishment with an X1, and it was woefully under-used because the problems that came up didn't fit the vector organisation terribly well; it is not at all clear why they bought the X1 in the first place. >The real problem is not in how the computation is organized, but in >how memory is accessed. Replicating the memory access style of the >early Cray architectures isn't possible beyond a very limited memory >size, but it sure would be nice to figure out a way to simulate the >experience. I _think_ this starts, particularly for the crystalline memory access case, to be almost a language-design issue. Tom
From: Robert Myers on 23 Jul 2010 14:30 On Jul 23, 1:19 pm, Thomas Womack <twom...(a)chiark.greenend.org.uk> wrote: > In article <34ea667e-779a-44d8-ab63-c032df1cb...(a)q35g2000yqn.googlegroups..com>, > Robert Myers <rbmyers...(a)gmail.com> wrote: > > >At a time when vector processors were still a fading memory (even in > >the US), an occasional article would mention that "vector computers" > >were easier to use for many scientists than thousands of cots > >processors hooked together by whatever. > > Yes, this is certainly true. Earth Simulator demonstrated that you > could build a pretty impressive vector processor, which (Journal of > the Earth Simulator - one of the really good resources since it talks > about both the science and the implementation issues) managed 90% > performance on lots of tasks, partly because using it was very > prestigious and you weren't allowed to use the whole machine on jobs > which didn't manage very high performance on a 10% subset. But it was > a $400 million project to build a 35Tflops machine, and the subsequent > project to spend a similar amount this decade on a heftier machine > came to nothing. > > I've worked at an establishment with an X1, and it was woefully > under-used because the problems that came up didn't fit the vector > organisation terribly well; it is not at all clear why they bought the > X1 in the first place. > So, if you can cheaply build a machine with lots of flops that sometimes you can't use, who cares if the flops you *can* use are still more plentiful and less expensive than, say, an Earth Simulator style effort, especially if there are lots of problems for which the magnificently awesome vector processor is useless? That's essentially the argument to defend the purchasing decisions that are being made at a national level in the US. I would agree, if only I could wrestle a tiny concession from the empire-builders. The machines they are building are *not* scalable, and I wish they'd stop claiming they are. It would be like my cable company claiming that its system is scalable because it can hang as many users off the same cable as it can get away with. It's all very well until too many try to use the bandwidth at once. Having the bandwidth per flop drop to zero is no different from having the bandwidth per user drop to zero, and even my cable company, which has lots of gall, wouldn't have the gall to claim that it's not a problem and that they don't have to worry about it, because they do. > >The real problem is not in how the computation is organized, but in > >how memory is accessed. Replicating the memory access style of the > >early Cray architectures isn't possible beyond a very limited memory > >size, but it sure would be nice to figure out a way to simulate the > >experience. > > I _think_ this starts, particularly for the crystalline memory access > case, to be almost a language-design issue. > Engineers apparently find Mathlab easy to use. No slight to Matlab, but the disconnect with the hardware can be painful. I don't think the hardware and software issues can be separated. Robert.
From: Andy Glew "newsgroup at on 23 Jul 2010 15:01 On 7/21/2010 3:18 PM, George Neuner wrote: > On Tue, 20 Jul 2010 15:41:13 +0100 (BST), nmm1(a)cam.ac.uk wrote: > >> In article<04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>, >> George Neuner<gneuner2(a)comcast.net> wrote: >>> >>> ISTM bandwidth was the whole point behind pipelined vector processors >>> in the older supercomputers. ... >>> ... the staging data movement provided a lot of opportunity to >>> overlap with real computation. >>> >>> YMMV, but I think pipeline vector units need to make a comeback. >> >> NO chance! It's completely infeasible - they were dropped because >> the vendors couldn't make them for affordable amounts of money any >> longer. > > Actually I'm a bit skeptical of the cost argument ... obviously it's > not feasible to make large banks of vector registers fast enough for > multiple GHz FPUs to fight over, but what about a vector FPU with a > few dedicated registers? I have been reading this thread somewhat bemused. To start, full disclosure: I have proposed having pipelined vector instructions making a comeback, in my postings to this group and my presentations, e.g. at Berkeley Parlab (linked to on some web page). Reason: not to improve performance, but to reduce costs compared to what is now done now. What is done now? There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2 sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide. There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about pipeline depth, but it is indicated to be deep by recommendations that dependent ops not be closer together than 40 cycles. The GPUs often have 16KB of registers. For each group of 32 or so FPUs. I.e. we are building systems with more FPUs, more deeply pipelined FPUs, and more registers than the vector machines I am most familiar with, Cray-1 era machines. I don't know by heart the specs for the last few generations of vector machines before they died back, but I suspect that modern CPUs and, especially, GPUs, are comparable. Except (1) they are not organized as vector machines, and (2) the memory subsystems are less powerful, in proportion to the FPUs, than in the old days. I'm going to skate past the memory subsystems since we have talked about this at length elsewhere, and since that will be the topic of Robert Myers' new mailing list. Except to say (a) high end GPUs often have memory separate from the main CPU memory, made with more expebsive GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize sequential burst accesses in ways that Cray-1's SRAM based memory subsystem did not. Basically, commodity DRAM does not lend itself to non-unit-stride access patterns. And building a big system out of non-commodity memory is much more prohibitive than back in the day of the Cray-1. This was becoming evident in the last years of the old vector processors. But let's get back to the fact that these modern machines, with more FPUs more deeply pipelined, and with more registers, than the classic vector machines, are not organized as pipelined vector machines. To some limited extent they are small parallel vector machines - operating on 4 32b SP in a given operation, in parallel in one instruction. The actual FPU operation is pipelined. They may be a small degree of vector pipelining, e.g. spreading an 8 element vector over 2 cycles. But not the same degree of vector pipelining as in the okd days, where a single instruction may be pipelined over 16 or 64 cycles. Why aren't modern CPUs and GPUs vector pipelined? I think one of the not widely recognized things is that we are significantly better at pipelining now than in the old days. The Cray-1 had 8 gate delays per cycle. I suspect that one of the motivations for vectors was that it was a challenge to decode back to back dependent instructions at that rate, whereas it was easier to decode an instruction, set up a vector,, and then run that vector instruction for 16 to 64 cycles. Yes, arranging chaining, and yes, I know that one of the Cray-1's claims to fame was better scalar instruction performance. If you can run individual scalar instructions as fast as you can run vector instructions, giving the same FLOPS, wouldn't you rather? Why use vectors rather than scalars? I'll answer my own question: (a) vectors allow you to use the same number of register bits to specificy a lot more registers - #vector-registers * #elements per vector. (b) vectors save power - you onl;y decode the instruction once, and the decoding and scheduling logic getsd amortized over the entire vector. But if you aren't increasing the register set or adding new types of registers, and if you aren't that worried about power, then you don't need vectors. But we are worried about power, aren't we? Why aren't modern GPUs vector pipelined? Basically because they are SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent Threading. This nearly always gets 4 cycle's worth of amortization of instruction decode and schedule cost. And it seems to be easier to program. And it promotes portability. When I started working on GPUs, I thought, like many on this newsgroup, that vector ISAs were easier to program than SIMD GPUs. I was quite surprised to find out that this is NOT the case. Graphics programmers consistengtly prefer the SIMD programming model. Or, rather, they conistently prefer to have lots of little threads executing scalar or moderate VLIW or short vector instructions, rather than fewer, heavyweight, threads executing longer vector instructions. Partly because their problems tend to be short vector, 4 element, rather than long vector operations. Perhaps because SIMD is what they are familiar with - although, again I emphasize than SIMT/CIMT is not the same as classic Illiac-IV SIMD. I think that one of the most important aspects is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on both GPUs and CPUs. And it runs on GPUs no matter whether the parallel FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x 8 cycles, or ...
From: Andy Glew "newsgroup at on 23 Jul 2010 20:28 The workday officialy over at 5pm, so I can continue the post I started at lunch. (Although I am pretty sure to get back to work this evening.) Top quoting without deleting my previous post - so you'll have to scroll way down. On 7/23/2010 12:01 PM, Andy Glew wrote: > On 7/21/2010 3:18 PM, George Neuner wrote: >> On Tue, 20 Jul 2010 15:41:13 +0100 (BST), nmm1(a)cam.ac.uk wrote: >> >>> In article<04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>, >>> George Neuner<gneuner2(a)comcast.net> wrote: >>>> >>>> ISTM bandwidth was the whole point behind pipelined vector processors >>>> in the older supercomputers. ... >>>> ... the staging data movement provided a lot of opportunity to >>>> overlap with real computation. >>>> >>>> YMMV, but I think pipeline vector units need to make a comeback. >>> >>> NO chance! It's completely infeasible - they were dropped because >>> the vendors couldn't make them for affordable amounts of money any >>> longer. > >> >> Actually I'm a bit skeptical of the cost argument ... obviously it's >> not feasible to make large banks of vector registers fast enough for >> multiple GHz FPUs to fight over, but what about a vector FPU with a >> few dedicated registers? > > > I have been reading this thread somewhat bemused. > > To start, full disclosure: I have proposed having pipelined vector > instructions making a comeback, in my postings to this group and my > presentations, e.g. at Berkeley Parlab (linked to on some web page). > Reason: not to improve performance, but to reduce costs compared to what > is now done now. > > What is done now? > > There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2 > sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide. > > There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about > pipeline depth, but it is indicated to be deep by recommendations that > dependent ops not be closer together than 40 cycles. > > The GPUs often have 16KB of registers. For each group of 32 or so FPUs. > > I.e. we are building systems with more FPUs, more deeply pipelined FPUs, > and more registers than the vector machines I am most familiar with, > Cray-1 era machines. I don't know by heart the specs for the last few > generations of vector machines before they died back, but I suspect that > modern CPUs and, especially, GPUs, are comparable. > > Except > (1) they are not organized as vector machines, and > (2) the memory subsystems are less powerful, in proportion to the FPUs, > than in the old days. > > I'm going to skate past the memory subsystems since we have talked about > this at length elsewhere, and since that will be the topic of Robert > Myers' new mailing list. Except to say (a) high end GPUs often have > memory separate from the main CPU memory, made with more expebsive > GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize > sequential burst accesses in ways that Cray-1's SRAM based memory > subsystem did not. Basically, commodity DRAM does not lend itself to > non-unit-stride access patterns. And building a big system out of > non-commodity memory is much more prohibitive than back in the day of > the Cray-1. This was becoming evident in the last years of the old > vector processors. > > But let's get back to the fact that these modern machines, with more > FPUs more deeply pipelined, and with more registers, than the classic > vector machines, are not organized as pipelined vector machines. To some > limited extent they are small parallel vector machines - operating on 4 > 32b SP in a given operation, in parallel in one instruction. The actual > FPU operation is pipelined. They may be a small degree of vector > pipelining, e.g. spreading an 8 element vector over 2 cycles. But not > the same degree of vector pipelining as in the okd days, where a single > instruction may be pipelined over 16 or 64 cycles. > > Why aren't modern CPUs and GPUs vector pipelined? I think one of the not > widely recognized things is that we are significantly better at > pipelining now than in the old days. The Cray-1 had 8 gate delays per > cycle. I suspect that one of the motivations for vectors was that it was > a challenge to decode back to back dependent instructions at that rate, > whereas it was easier to decode an instruction, set up a vector,, and > then run that vector instruction for 16 to 64 cycles. Yes, arranging > chaining, and yes, I know that one of the Cray-1's claims to fame was > better scalar instruction performance. > > If you can run individual scalar instructions as fast as you can run > vector instructions, giving the same FLOPS, wouldn't you rather? Why use > vectors rather than scalars? > I'll answer my own question: (a) vectors allow you to use the same > number of register bits to specificy a lot more registers - > #vector-registers * #elements per vector. (b) vectors save power - you > onl;y decode the instruction once, and the decoding and scheduling logic > getsd amortized over the entire vector. > But if you aren't increasing the register set or adding new types of > registers, and if you aren't that worried about power, then you don't > need vectors. > But we are worried about power, aren't we? > > Why aren't modern GPUs vector pipelined? Basically because they are > SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent > Threading. This nearly always gets 4 cycle's worth of amortization of > instruction decode and schedule cost. And it seems to be easier to > program. And it promotes portability. > > When I started working on GPUs, I thought, like many on this newsgroup, > that vector ISAs were easier to program than SIMD GPUs. I was quite > surprised to find out that this is NOT the case. Graphics programmers > consistengtly prefer the SIMD programming model. Or, rather, they > conistently prefer to have lots of little threads executing scalar or > moderate VLIW or short vector instructions, rather than fewer, > heavyweight, threads executing longer vector instructions. Partly > because their problems tend to be short vector, 4 element, rather than > long vector operations. Perhaps because SIMD is what they are familiar > with - although, again I emphasize than SIMT/CIMT is not the same as > classic Illiac-IV SIMD. I think that one of the most important aspects > is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on > both GPUs and CPUs. And it runs on GPUs no matter whether the parallel > FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x > 8 cycles, or ... Continuing the discussion of the advantages of vector instruction sets and hardware. Vector ISAs allow you to have a whole lot of registers accessible from relatively small register numbers in the instruction. GPU SIMD/SIMT/CIMT get the same effect by having a whole lot of threads, each given a variable number of registers. Basically, reducing the number of registers allocated to threads (which run in warps or wavefronts, say 16 wide over 4 cycles) is equivalent to, and probably better that, having a variable vector length. Variable on a per vector register basis. I'm not aware of many classic vector ISAs doing this - and if they did, they would lose the next advantage. Vector register files can be cheaper than ordinary register files. Instead of having to allow any register to be accessed, vector ISAs allow you to only have to index the first element of a vector fast; subsequent elements can stream along with greater latency. However, I'm not aware of any recent vector hardware uarch that has taken advantage of this possibility. Usually they build just a great big wide register file. Vector ISAs are on a slippery slope of ISA complexity. First you have vector+vector ->vector ops. Then you add vector sum reductions. Inner products. Prefix calculations. Operate under mask. Etc. This slippery slope seems much less slippery for CIMT, since most of these opeerations can be synthesized simply out of the scalar operations that are their basis. Vector chaining is a source of performance - and complexity. It happens somewhat for free with Nvidia style scalar SIMT, and the equivalent of more complicated chaining complexes can be set up using ATI/AMD's VLIW SIMT. All this being said, why would I be interested in reviving vector ISAs? Mainly because vector ISAs allow the cost of instruction decode and scheduling to be amortized. But also because, as I discusssed in my Berkeley Parlab presentation of Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate somewhat the deficiencies of coherent threading, specifically the problem of divergence.
From: Terje Mathisen "terje.mathisen at on 24 Jul 2010 02:48
Andy Glew wrote: > Vector register files can be cheaper than ordinary register files. > Instead of having to allow any register to be accessed, vector ISAs > allow you to only have to index the first element of a vector fast; > subsequent elements can stream along with greater latency. However, I'm > not aware of any recent vector hardware uarch that has taken advantage > of this possibility. Usually they build just a great big wide register > file. And this is needed! If you check actual SIMD type code, you'll notice that various forms of permutations are _very_ common, i.e. you need to rearrange the order of data in one or more vector register: If vectors were processed in streaming mode, we would have the same situation as for the Pentium4 which did half a register in each half cycle in the fast core, but had to punt each time you did a right shift (or any other operations which could not be processed in LE order). I have seen once a reference to Altivec code that used the in-register permute operation more than any other opcode. > > Vector ISAs are on a slippery slope of ISA complexity. First you have > vector+vector ->vector ops. Then you add vector sum reductions. Inner > products. Prefix calculations. Operate under mask. Etc. This slippery > slope seems much less slippery for CIMT, since most of these opeerations > can be synthesized simply out of the scalar operations that are their > basis. Except that even scalar code needs prefix/mask type operations in order to get rid of some branches, right? All (most of?) the others seem to boil down to a need for a fast vector permute... > > Vector chaining is a source of performance - and complexity. It happens > somewhat for free with Nvidia style scalar SIMT, and the equivalent of > more complicated chaining complexes can be set up using ATI/AMD's VLIW > SIMT. > > All this being said, why would I be interested in reviving vector ISAs? > > Mainly because vector ISAs allow the cost of instruction decode and > scheduling to be amortized. > > But also because, as I discusssed in my Berkeley Parlab presentation of > Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate > somewhat the deficiencies of coherent threading, specifically the > problem of divergence. Please tell! Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |