From: Brett Davis on 17 Apr 2010 18:58 In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>, EricP <ThatWouldBeTelling(a)thevillage.com> wrote: > Brett Davis wrote: > >>>>> If you do the following you can get a 2x speedup, it looks like > >>>>> more code, but will generate less, and the results will be > >>>>> pipelined correctly. > >>>>> Extra bonus points to those that understand why. Half the posters here? > >>>>> > >>>>> { > >>>>> unsigned char *C = B+(H*j+H-i-1)*3; > >>>>> temp0 = A[0]; > >>>>> temp1 = A[1]; > >>>>> temp2 = A[2]; > >>>>> C[0] = temp0; > >>>>> C[1] = temp1; > >>>>> C[2] = temp2; > >>>>> A += 3; > >>>>> } > >>>>> > >>>>> Do not use *C++ = *A++; > >>> Only one in a hundred programers know an optimizaton like that, for > >>> half of comp.arch to be that good says good things about comp.arch. > > > > Benchmark it, try and stay in cache as a 66 MHz embedded CPU does > > not have memory 600 cycles away. > > Only an Intel class chip has enough OoO stuff to have a chance > > of coming close to the same speed as my code. > > > > Brett > > Oh, you are avoiding read after write data > dependency pipeline stalls on in-order cpus. No, nice guess. I dont know of any CPU that stalls on a read after write, instead they try and forward the data, and in the rare case when a violation occurs the CPU will throw an interrupt and restart the instructions. This is an important comp.arch point, so someone will correct me if I am wrong. > The buffer overlap aliasing considerations in C prevents the > compiler from automatically rearranging the original LD ST > order to be more pipeline friendly for a particular cpu, > but Fortran would allow it. This is close, the answer is a mundane but important point. CPUs today have a load to use delay of ~2 cycles from level one cache. (less for slower chips, more in faster chips.) An OoO chip can try and find other instructions to execute, but even they are subject to this delay. (I believe.) This copy loop is so small I dont think there is much even a Intel/AMD chip can do. I was hoping you would benchmark it. ;) A lot of the dual issue CPUs are partial OoO and support hit under miss for reads, but these will epic fail at running code like this faster. The future of computing (this decade) is lots of simple in-order CPUs. Rules of die size, heat and efficiency kick in. Like ATI chips. > Yeah ok that could be about 2x speed for that kind of cpu. > > Eric
From: "Andy "Krazy" Glew" on 17 Apr 2010 23:07 On 4/17/2010 3:58 PM, Brett Davis wrote: > In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>, > EricP<ThatWouldBeTelling(a)thevillage.com> wrote: > > The future of computing (this decade) is lots of simple in-order CPUs. > Rules of die size, heat and efficiency kick in. Like ATI chips. This remains to be seen. I am tempted to say "That is so last decade 200x". Wrt GPUs, perhaps. However, in the last few months I have been seeing evidence of a trend the other way: The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads. Apple's PA-Semi's team's last processor was a low power PowerPC. I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family. Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to see an out-of-order flavor of such an x86 vector SIMD. De-facto, AVX is that, although not in the same space as Larrabee. I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style microarchitectures. The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how much to invest in either half. Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for some workloads. I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD / Coherent threading as well. Or vice versa. Minimizng the amount of hardware that can only be used for one class of computation.
From: John W Kennedy on 17 Apr 2010 23:40 On 2010-04-17 04:44:49 -0400, robin said: > Now 3 octets will be 9 bits, Robin, will you /please/ stop blithering on about things you don't understand?! Buy a Data Processing dictionary, for God's sake! The rest of this is addressed to the original poster: I don't understand why you're using int variables for octets; they should be char. For the rest, I'd do the following: typedef struct __Pixel { unsigned char red, green, blue; } Pixel; Pixel src[W][H]; Pixel dest[H][W]; for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) { dest[j][i].red = src[i][j].red; dest[j][i].green = src[i][j].green; dest[j][i].blue = src[i][j].blue; } I'd also try: for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) memcpy(dest[j][i], src[i][j], sizeof (Pixel)); and see which one is faster; it will depend on the individual compiler. (Make sure you test at the same optimization level you plan to use.) -- John W Kennedy "There are those who argue that everything breaks even in this old dump of a world of ours. I suppose these ginks who argue that way hold that because the rich man gets ice in the summer and the poor man gets it in the winter things are breaking even for both. Maybe so, but I'll swear I can't see it that way." -- The last words of Bat Masterson
From: nmm1 on 18 Apr 2010 04:29 In article <ggtgp-2839D7.17581617042010(a)news.isp.giganews.com>, Brett Davis <ggtgp(a)yahoo.com> wrote: >In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>, > EricP <ThatWouldBeTelling(a)thevillage.com> wrote: > >> Oh, you are avoiding read after write data >> dependency pipeline stalls on in-order cpus. > >No, nice guess. >I dont know of any CPU that stalls on a read after write, instead >they try and forward the data, and in the rare case when a violation >occurs the CPU will throw an interrupt and restart the instructions. >This is an important comp.arch point, so someone will correct me >if I am wrong. There used to be a fair number, and I suspect still are. However, I doubt that stalling when the read and write are on a single CPU will return. However, I would expect that at least some multi-core CPUs stall at least sometimes, because there will not be cache to cache links at all levels, and they will have to wait until the write reaches the next cache (or memory) with a link. But that's guessing on the basis of history and general principles, not actual knowledge. Regards, Nick Maclaren.
From: nmm1 on 18 Apr 2010 04:36
In article <4BCA775A.8040604(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >On 4/17/2010 3:58 PM, Brett Davis wrote: >> In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>, >> EricP<ThatWouldBeTelling(a)thevillage.com> wrote: >> > >> The future of computing (this decade) is lots of simple in-order CPUs. >> Rules of die size, heat and efficiency kick in. Like ATI chips. > >This remains to be seen. I am tempted to say "That is so last decade 200x". > >Wrt GPUs, perhaps. > >However, in the last few months I have been seeing evidence of a trend the other way: > >The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads. >Apple's PA-Semi's team's last processor was a low power PowerPC. > >I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family. > >Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to >I suspect that we will end up in a bifurcated market: out-of-order for >the high performance general purpose computation in cell phones and >other important portable computers, in-order in the >SIMD/SIMT/CoherentThreading GPU style microarchitectures. > >The annoying thing about such bifurcation is that it leads to hybrid >heterogenous architectures - and you never know how much to invest in >either half. Whatever resource allocation you make to in-order SIMD >vs. ooo scalar will be wrong for some workloads. Well, yes, but that's no different from any other choice. As I have posted before, I favour a heterogeneous design on-chip: Essentially uninteruptible, user-mode only, out-of-order CPUs for applications etc. Interuptible, system-mode capable, in-order CPUs for the kernel and its daemons. Most programs could be run on either, whichever there was more of, but affinity could be used to select which. CPUs designed for HPC would be many:one; ones designed for file serving etc would be one:many. But all systems would run on all CPUs. Regards, Nick Maclaren. |