From: Brett Davis on
In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
EricP <ThatWouldBeTelling(a)thevillage.com> wrote:

> Brett Davis wrote:
> >>>>> If you do the following you can get a 2x speedup, it looks like
> >>>>> more code, but will generate less, and the results will be
> >>>>> pipelined correctly.
> >>>>> Extra bonus points to those that understand why. Half the posters here?
> >>>>>
> >>>>> {
> >>>>> unsigned char *C = B+(H*j+H-i-1)*3;
> >>>>> temp0 = A[0];
> >>>>> temp1 = A[1];
> >>>>> temp2 = A[2];
> >>>>> C[0] = temp0;
> >>>>> C[1] = temp1;
> >>>>> C[2] = temp2;
> >>>>> A += 3;
> >>>>> }
> >>>>>
> >>>>> Do not use *C++ = *A++;
> >>> Only one in a hundred programers know an optimizaton like that, for
> >>> half of comp.arch to be that good says good things about comp.arch.
> >
> > Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
> > not have memory 600 cycles away.
> > Only an Intel class chip has enough OoO stuff to have a chance
> > of coming close to the same speed as my code.
> >
> > Brett
>
> Oh, you are avoiding read after write data
> dependency pipeline stalls on in-order cpus.

No, nice guess.
I dont know of any CPU that stalls on a read after write, instead
they try and forward the data, and in the rare case when a violation
occurs the CPU will throw an interrupt and restart the instructions.
This is an important comp.arch point, so someone will correct me
if I am wrong.

> The buffer overlap aliasing considerations in C prevents the
> compiler from automatically rearranging the original LD ST
> order to be more pipeline friendly for a particular cpu,
> but Fortran would allow it.

This is close, the answer is a mundane but important point.
CPUs today have a load to use delay of ~2 cycles from level one
cache. (less for slower chips, more in faster chips.)
An OoO chip can try and find other instructions to execute, but
even they are subject to this delay. (I believe.)
This copy loop is so small I dont think there is much even a
Intel/AMD chip can do. I was hoping you would benchmark it. ;)

A lot of the dual issue CPUs are partial OoO and support hit
under miss for reads, but these will epic fail at running code
like this faster.

The future of computing (this decade) is lots of simple in-order CPUs.
Rules of die size, heat and efficiency kick in. Like ATI chips.

> Yeah ok that could be about 2x speed for that kind of cpu.
>
> Eric
From: "Andy "Krazy" Glew" on
On 4/17/2010 3:58 PM, Brett Davis wrote:
> In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
> EricP<ThatWouldBeTelling(a)thevillage.com> wrote:
>

> The future of computing (this decade) is lots of simple in-order CPUs.
> Rules of die size, heat and efficiency kick in. Like ATI chips.

This remains to be seen. I am tempted to say "That is so last decade 200x".

Wrt GPUs, perhaps.

However, in the last few months I have been seeing evidence of a trend the other way:

The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads.
Apple's PA-Semi's team's last processor was a low power PowerPC.

I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family.

Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to
see an out-of-order flavor of such an x86 vector SIMD. De-facto, AVX is that, although not in the same space as Larrabee.

I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
microarchitectures.

The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
much to invest in either half. Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
some workloads.

I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
Coherent threading as well. Or vice versa. Minimizng the amount of hardware that can only be used for one class of
computation.
From: John W Kennedy on
On 2010-04-17 04:44:49 -0400, robin said:
> Now 3 octets will be 9 bits,

Robin, will you /please/ stop blithering on about things you don't
understand?! Buy a Data Processing dictionary, for God's sake!

The rest of this is addressed to the original poster: I don't
understand why you're using int variables for octets; they should be
char. For the rest, I'd do the following:

typedef struct __Pixel {
unsigned char red, green, blue;
} Pixel;

Pixel src[W][H];
Pixel dest[H][W];

for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i) {
dest[j][i].red = src[i][j].red;
dest[j][i].green = src[i][j].green;
dest[j][i].blue = src[i][j].blue;
}

I'd also try:

for (int i = 0; i < W; ++i)
for (int j = 0; j < H; ++i)
memcpy(dest[j][i], src[i][j], sizeof (Pixel));

and see which one is faster; it will depend on the individual compiler.
(Make sure you test at the same optimization level you plan to use.)

--
John W Kennedy
"There are those who argue that everything breaks even in this old dump
of a world of ours. I suppose these ginks who argue that way hold that
because the rich man gets ice in the summer and the poor man gets it in
the winter things are breaking even for both. Maybe so, but I'll swear
I can't see it that way."
-- The last words of Bat Masterson

From: nmm1 on
In article <ggtgp-2839D7.17581617042010(a)news.isp.giganews.com>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>In article <Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
> EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
>
>> Oh, you are avoiding read after write data
>> dependency pipeline stalls on in-order cpus.
>
>No, nice guess.
>I dont know of any CPU that stalls on a read after write, instead
>they try and forward the data, and in the rare case when a violation
>occurs the CPU will throw an interrupt and restart the instructions.
>This is an important comp.arch point, so someone will correct me
>if I am wrong.

There used to be a fair number, and I suspect still are. However,
I doubt that stalling when the read and write are on a single CPU
will return. However, I would expect that at least some multi-core
CPUs stall at least sometimes, because there will not be cache to
cache links at all levels, and they will have to wait until the
write reaches the next cache (or memory) with a link.

But that's guessing on the basis of history and general principles,
not actual knowledge.


Regards,
Nick Maclaren.
From: nmm1 on
In article <4BCA775A.8040604(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/17/2010 3:58 PM, Brett Davis wrote:
>> In article<Kw6yn.224427$wr5.103281(a)newsfe22.iad>,
>> EricP<ThatWouldBeTelling(a)thevillage.com> wrote:
>>
>
>> The future of computing (this decade) is lots of simple in-order CPUs.
>> Rules of die size, heat and efficiency kick in. Like ATI chips.
>
>This remains to be seen. I am tempted to say "That is so last decade 200x".
>
>Wrt GPUs, perhaps.
>
>However, in the last few months I have been seeing evidence of a trend the other way:
>
>The ARM Cortex A9 CPU is out-of-order, and is becoming more and more widely used in things like cell phones and iPads.
>Apple's PA-Semi's team's last processor was a low power PowerPC.
>
>I suspect that we will soon see out-of-order processors in the Qualcomm SnapDragon family and the Intel Atom family.
>
>Intel has delayed Larrabee, in-order vector SIMD (as opoposed to GPU style threaded SIMD). I would not be surprised to
>I suspect that we will end up in a bifurcated market: out-of-order for
>the high performance general purpose computation in cell phones and
>other important portable computers, in-order in the
>SIMD/SIMT/CoherentThreading GPU style microarchitectures.
>
>The annoying thing about such bifurcation is that it leads to hybrid
>heterogenous architectures - and you never know how much to invest in
>either half. Whatever resource allocation you make to in-order SIMD
>vs. ooo scalar will be wrong for some workloads.

Well, yes, but that's no different from any other choice. As I have
posted before, I favour a heterogeneous design on-chip:

Essentially uninteruptible, user-mode only, out-of-order CPUs
for applications etc.
Interuptible, system-mode capable, in-order CPUs for the kernel
and its daemons.

Most programs could be run on either, whichever there was more of,
but affinity could be used to select which. CPUs designed for HPC
would be many:one; ones designed for file serving etc would be
one:many. But all systems would run on all CPUs.


Regards,
Nick Maclaren.
First  |  Prev  |  Next  |  Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.