From: bartc on

"John W Kennedy" <jwkenne(a)attglobal.net> wrote in message
news:4bca7f2a$0$22520$607ed4bc(a)cv.net...
> On 2010-04-17 04:44:49 -0400, robin said:
>> Now 3 octets will be 9 bits,
>
> Robin, will you /please/ stop blithering on about things you don't
> understand?! Buy a Data Processing dictionary, for God's sake!
>
> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char. For the
> rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }
....
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));

And perhaps dest[j][i] = src[i][j];

But in practice W,H might only be known at runtime, making the code rather
different. Depending on the exact format of the image data, there might also
be padding bytes (nothing to do with C's struct padding), for example at the
end of each row.

--
Bartc


From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on
Andy "Krazy" Glew <ag-news(a)patten-glew.net> wrote:

> The ARM Cortex A9 CPU is out-of-order, and is becoming more and more
> widely used in things like cell phones and iPads.

Cortex A9 is not shipping in any product yet (I believe). Lots of
preannouncements though. The Apple A4 CPU is currently believed to be a
tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
for Samsung before being acquired by Apple.

Someone with a jailbroken iPad (or having paid the 99$ fee) could run
benchmarks to probe the properties of the CPU.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Robert Myers on
On Apr 17, 11:07 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:

> I suspect that we will end up in a bifurcated market: out-of-order for the high performance general purpose computation
> in cell phones and other important portable computers, in-order in the SIMD/SIMT/CoherentThreading GPU style
> microarchitectures.
>
> The annoying thing about such bifurcation is that it leads to hybrid heterogenous architectures - and you never know how
> much to invest in either half.  Whatever resource allocation you make to in-order SIMD vs. ooo scalar will be wrong for
> some workloads.
>
A significant part of the resources of the Cray 1 were nearly useless
to almost no matter what customer bought and/or used such machines.
Machines were bought by customers who had no use for vector registers
and by customers for whom there was a whole class of scalar registers
that were nearly beside the point.

However difficult those choices may have been (and I'm not sure they
weren't less important than the cost and cooling requirements of the
memory), the machines were built and people bought and used them.

I don't think the choices are nearly as hard now. Transistors are
nearly free, but active transistors consume watts, which aren't free.
There are design costs to absorb, but you'd rather spread those costs
over as many chips as possible, even if it means that most customers
have chips with capabilities they never use. So long as the useless
capabilities are idle and consume no watts, everyone is happy.

> I think that the most interesting thing going forward will be microarchitectures that are hybrids, but which are
> homogenous: where ooo code can run reasonably efficiently on a microarchitecture that can run GPU-style threaded SIMD /
> Coherent threading as well.  Or vice versa.  Minimizng the amount of hardware that can only be used for one class of
> computation.

I thought that was one of the goals of pushing scheduling out to the
compiler. I still don't know whether the goal was never possible or
Itanium was just a hopelessly clumsy design.

Robert.
From: Robert Myers on
On Apr 17, 6:58 pm, Brett Davis <gg...(a)yahoo.com> wrote:

>
> No, nice guess.
> I dont know of any CPU that stalls on a read after write, instead
> they try and forward the data, and in the rare case when a violation
> occurs the CPU will throw an interrupt and restart the instructions.
> This is an important comp.arch point, so someone will correct me
> if I am wrong.
>
> > The buffer overlap aliasing considerations in C prevents the
> > compiler from automatically rearranging the original LD ST
> > order to be more pipeline friendly for a particular cpu,
> > but Fortran would allow it.
>
> This is close, the answer is a mundane but important point.
> CPUs today have a load to use delay of ~2 cycles from level one
> cache. (less for slower chips, more in faster chips.)
> An OoO chip can try and find other instructions to execute, but
> even they are subject to this delay. (I believe.)
> This copy loop is so small I dont think there is much even a
> Intel/AMD chip can do. I was hoping you would benchmark it. ;)
>
> A lot of the dual issue CPUs are partial OoO and support hit
> under miss for reads, but these will epic fail at running code
> like this faster.
>
> The future of computing (this decade) is lots of simple in-order CPUs.
> Rules of die size, heat and efficiency kick in. Like ATI chips.
>
Regardless of how natural and even gratifying it may be to you, the
outlook you clearly espouse is job security for you but what I would
deem an unacceptable burden for all ordinary mortals.

I believe that even L1 cache delays have varied between at least one
and two cycles, that L2 remains very important even for OoO chips and
that L2 delays vary even more, that L3 delay tradeoffs are something
that someone like, say, Andy would understand, but that most others
wouldn't, and that the circumstances that cause a stall are not always
clear, as evidenced by the discussion here.

If you can write code for the exact CPU and memory setup and test it
and have the time to do lots of tinkering, then super-slick hand-coded
optimizations might be worth talking about in something other than a
programming forum, and there not because the ideas have general
applicability, but because that's the kind of detail that so many
programmers seem keen on.

As it is, the computer architecture tradeoffs, like the tradeoffs in
cache delays, are probably obsessed over by computer architects, but I
can't see the relevance of a particular slick trick in C to any such
decision-making.

Robert.
From: "Andy "Krazy" Glew" on
On 4/18/2010 1:36 AM, nmm1(a)cam.ac.uk wrote:
> As I have
> posted before, I favour a heterogeneous design on-chip:
>
> Essentially uninteruptible, user-mode only, out-of-order CPUs
> for applications etc.
> Interuptible, system-mode capable, in-order CPUs for the kernel
> and its daemons.

This is almost opposite what I would expect.

Out-of-order tends to benefit OS code more than many user codes. In-order coherent threading benefits manly fairly
stupid codes that run in user space, like multimedia.

I would guess that you are motivated by something like the following:

System code tends to have unpredictable branches, which hurt many OOO machines.

System code you may want to be able to respond to interrupts easily. I am guessing that you believe that OOO has worse
interrupt latency. That is a misconception: OOO tends to have better interrupt latency, since they usually redirect to
the interrupt handler at retirement. However, they lose more work.

(Anecdote: in P6 I asked the Novel Netware guys if they wanted better interrupt latency or minimal work lost. They
preferred the latter, even at the cost of longer interrupt latency. However, we gave them the former, because it was
easier.)

Also, OOO CPUs tend to have more state like TLBs and caches that is not directly related to interrupts, but which
affects interrupt latency.

Finally, it is true that tricks like alternate register sets for interrupt handlers tend to be more prevalent on in-order.

--

I think workloads may be divided into several categories:

a) Trivial throughput oriented - the sort of workload that benefits most from in-order coherent threaded GU style
microarchitectures. Lots of parallelism. Simple instruction and memory coherency.

b) Classical out-of-order workloads: irregular parallelism, pointer chases but also sizeable work at each pointer miss.
Predictable branches.

c) Intermediate: unpredictable branches, but pointers with fan-out/MLP. Classic system code. For that matter, lets
throw interrupt latency into the mix.


OOO dataflow can help speed up system code in class c), but you may lose the benefit due to branch mispredictions.
Better to switch threads than to predict a flakey branch.

Now, code in class c) can also be executed on in-order thread-switching systems. OOO dataflow just improves the latency
of such code, which amounts to reducibg the number of threads needed for a given performance level. Since, in my
experience, there are far fewer threads in class c) than in either of the other classes, reducibg the number of system
threads required seems like a good tradeff.



The taxonomy is not complete. Thse are just the combibnations that I see as most important.