From: "Andy "Krazy" Glew" on
On 4/18/2010 2:29 PM, nmm1(a)cam.ac.uk wrote:
> In article<3782bf12-b3f5-4003-94a9-0299859358ed(a)y17g2000yqd.googlegroups.com>,
> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>> On Apr 18, 1:15=A0pm, "Andy \"Krazy\" Glew"<ag-n...(a)patten-glew.net>
>> wrote:
>>
>>> System code tends to have unpredictable branches, which hurt many OOO mac=
>> hines.
>>
>> I think it is easier to think that system codes have so much inherent
>> serializations that the efforts applied in doing OoO are "for want"
>> and that these great big OoO machines degrade down to just about the
>> same performance as the absolutely in-order cousins.
>>
>> Its a far bigger issue than simple branch mispredictability. Pointer
>> chasing into poorly cached data structures is rampant; "dangerous"
>> instructions that are inherently serialized; and poor TLB translation
>> success rates. Overall, there just is not that much ILP left in many
>> of the paths through system codes.
>
> That was the experience in the days of the System/370. User code
> got a factor of two better ILP than system code.


I surprised a friend who is working on speculative multithreading when he asked what benchmark I used for my SpMT work.
I said "gcc". In my experience, gcc is the user mode benchmark tha is most challenging, and which most resembles system
code.

I reject "inherently serialized" instructions. Very little need be inherently serialized. Such serialiations tend to
happen because you have not wanted to rename or predict the result. Only true MSR/creg accesses need be inherently
serialized.

Pointer chasing: I'm the MLP guy. I can show you a dozen ways to make pointer chasing run faster. Mainly: very
seldom do you just access the pointer. Usually you acccess p=p->nxt or p->p->link, plus several fields p->field1,
p->field2. You always need to consider the ratio of non-pointer chases to pointer chases. Of late, the ratio has been
INCREASING, i.e. system code has been becoming more amenable.

TLB miss rates: again, I can show/have shown many ways to improve these. One of my favorites is to cache a predicted
TB translation inside a data memory cache line, possibly using space freed up by compression.

Mitch: you're a brilliant guy, but you have only seen a small fraction of my ideas. Too bad we never got to work
together at AMD or Motorola.
From: "Andy "Krazy" Glew" on
On 4/19/2010 9:17 AM, MitchAlsup wrote:
> On Apr 18, 4:32 pm, n...(a)cam.ac.uk wrote:
>> Yup. In my view, interrupts are doubleplus ungood - message passing
>> is good.
>
> CDC was the only company to get this one right. The OS ran mostly* in
> the perifferal processors, leaving the great big number cruncher to
> (ahem) crunch numbers.
>
> (*) The interupt processing and I/O was run in the PPs and most of the
> OS scheduling was run in the PPs.
>
> I remember a time back in 1979, I was logged into a CDC 7600 in
> California doing text editing. There were a dozen other members of the
> SW team doing similarly. There was a long silent pause where no
> character echoing was taking place. A few moments later (about 30
> seconds) the processing returned to normal. However, we found out that
> we were now logged into a CDC 7600 in Chicago. The Ca machine had
> crashed, and the OS had picked up all the nonfaulting tasks, shipped
> them up to another machine half way across the country and restarted
> the processes.
>
> Why can't we do this today? We could 30 years ago!
>
> Mitch


At Intel, on P6, we made some deliberate decisions that prevented this. We "deliberately" decided not to provide fault
containment within shared memory - most notably, we had incomplete cache tag snooping. When an error was detected, we
could not guarantee how far it had propagated - it might have propagated anywhere in cache coherent shared memory.

I quote "deliberately" because I was aware of this decision - I flagged it, and its consequences - I don't know how far
up the chain of command it propagated. Actually, I don't think it mattered - we would probably have made the smae
decision no matter what. The real problem was that, when demand arose to have better error containment, the knowledge
was lost, and had to be reconstructed. Usually without involving the original designer (me).

Nehalem has added error poison propagation, so this sort of thing can now be done.

When will you see OSes taking advantage? Don't hold your breath.

By the way, OSes have nearly always been able to do this using message passing. But apparently there was not enough demand.
From: nmm1 on
In article <4BCC9FCA.5010007(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/18/2010 2:29 PM, nmm1(a)cam.ac.uk wrote:
>> In article<3782bf12-b3f5-4003-94a9-0299859358ed(a)y17g2000yqd.googlegroups.com>,
>> MitchAlsup<MitchAlsup(a)aol.com> wrote:
>>> On Apr 18, 1:15=A0pm, "Andy \"Krazy\" Glew"<ag-n...(a)patten-glew.net>
>>> wrote:
>>>
>>>> System code tends to have unpredictable branches, which hurt many OOO
>>> machines.
>>>
>>> I think it is easier to think that system codes have so much inherent
>>> serializations that the efforts applied in doing OoO are "for want"
>>> and that these great big OoO machines degrade down to just about the
>>> same performance as the absolutely in-order cousins.
>>>
>>> Its a far bigger issue than simple branch mispredictability. Pointer
>>> chasing into poorly cached data structures is rampant; "dangerous"
>>> instructions that are inherently serialized; and poor TLB translation
>>> success rates. Overall, there just is not that much ILP left in many
>>> of the paths through system codes.
>>
>> That was the experience in the days of the System/370. User code
>> got a factor of two better ILP than system code.
>
>I surprised a friend who is working on speculative multithreading when
>he asked what benchmark I used for my SpMT work. I said "gcc". In my
>experience, gcc is the user mode benchmark that is most challenging, and
>which most resembles system code.

Isn't there a GUI benchmark? Most of that code is diabolical. But I
agree that gcc is an excellent bellwether for a lot of kernel, daemon
and utility code.

>I reject "inherently serialized" instructions. Very little need be
>inherently serialized. Such serialiations tend to happen because you
>have not wanted to rename or predict the result. Only true MSR/creg
>accesses need be inherently serialized.

There are some, but they tend to be used sparsely in run-time systems
and language libraries, rather than open code. But I don't know
what you are counting as MSR/creg accesses.


Regards,
Nick Maclaren.
From: Chris Gray on
Noob <root(a)127.0.0.1> writes:

> The loop kernel is

> 24: 06 61 mov.l @r0+,r1
> 26: 10 42 dt r2
> 28: 12 23 mov.l r1,@r3
> 2a: fb 8f bf.s 24 <_rotate_right+0x24>
> 2c: 7c 33 add r7,r3

[Not referring to this specific code, but just following up.]

Why can't modern CPU's optimize the heck out of the relatively simple
code that a compiler might produce for a block copy? They have all of
the information they need - the addresses, the length, the alignments,
the position relative to page boundaries, cache lines, write buffers, etc.

Compilers often look at large chunks of code to figure out what they
are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs
have transistors to burn now, why can't they look for patterns that
can be executed faster? Detect block copies, and turn them into
streaming fetches and stores, limited only by memory speeds. Don't
cache the data, don't purge any existing nonconflicting write buffers,
etc. Is the latency of detecting the situation too large?

Lots of code does a lot of copying - there could be a real benefit.

--
Experience should guide us, not rule us.

Chris Gray
From: Morten Reistad on
In article <4BCB4C2A.8080601(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/18/2010 1:36 AM, nmm1(a)cam.ac.uk wrote:
>> As I have
>> posted before, I favour a heterogeneous design on-chip:
>>
>> Essentially uninteruptible, user-mode only, out-of-order CPUs
>> for applications etc.
>> Interuptible, system-mode capable, in-order CPUs for the kernel
>> and its daemons.
>
>This is almost opposite what I would expect.
>
>Out-of-order tends to benefit OS code more than many user codes. In-order coherent threading benefits manly fairly
>stupid codes that run in user space, like multimedia.
>
>I would guess that you are motivated by something like the following:
>
>System code tends to have unpredictable branches, which hurt many OOO machines.
>
>System code you may want to be able to respond to interrupts easily. I am guessing that you believe that OOO has worse
>interrupt latency. That is a misconception: OOO tends to have better interrupt latency, since they usually redirect to
>the interrupt handler at retirement. However, they lose more work.

...... interesting perspectives deleted ....

This general approach about throwing resources at the cpu and at
the compiler so we can work around all kinds of stalls has rapidly
diminishing returns at this point, with our deep pipelines, pretty
large 2-4 levels of cache, and code that is written without regard
to deep parallellism.

We can win the battle, but we will lose the war if we continue down
that path. We must the facts sink in, and that is that the two main
challenges for modern processing are the "memory wall" and the "watt
per mips" challenge.

The memory wall is a profound problem, but bigger and better caches
can alleviate it. At the current point, I mean lots and lots of
caches, and well interconnected ones too.

Return to the risc mindset, and back down a little regarding cpu
power, and rather give us lots of them, and lots and lots of cache.

It is amazing how well that works.

Then we will have to adapt software, which happens pretty fast
in the Open Source world nowadays, when there are real performance
gains to be had.

For the licensing problems, specificially windows, perhaps a
hypervisor can address that, and keep the core systems like databases,
transaction servers etc. running either under some second OS
or directly under the hypervisor, and let windows be a window
onto the user code. And I am sure licensing will be adapted
if such designs threaten the revenue stream.

For the recalcitrant, single thread code I would suggest taking
the autotranslation path. Recode-on-the-fly. The Alpha team
and Transmeta has proven that this is viable.

Or, we may keep a 2-core standard chip for the monolithic
code, and add a dozen smaller cores and a big cache for the
stuff that is already parallellized. This seems like the
path the gpu-coders are taking. Just integrate the GPUs
with the rest of the system, and add a hypervisor.


-- mrr