Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: nmm1 on 15 Dec 2009 05:23

In article <JfednUfEbp5qwrrWnZ2dnUVZ_hOdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>Andy "Krazy" Glew wrote:
>
>> In strict PGAS (Private/Global Address Space) there are only two forms
>> of memory access:
>> 1. local private memory, inaccessible to other processors
>> 2. global shared memory, accessible by all other processors,
>> although implicitly accessible everywhere the same. Not local to anyone.
>>
>> Whereas SHMEM allows more types of memory accesses, including
>> a. local memory, that may be shared with other processors
>> b. remote accesses to memory that is local to other processors
>> as well as remote access to memory that isn't local to anyone.
>> And potentially other memory types.
>
>I can't see that there is any benefit between having strictly private
>memory (PGAS 1. above), at least on a high-performance MP system.
>
>The CPUs are going to access memory via a cache. I doubt that there will
>be 2 separate kinds of caches, one for private and one for the rest of
>the memory. So, as far as the CPUs are concerned there is no distinction.
>
>Since the CPUs are still going to have to talk to a shared memory (PGAS
>2. above), there will still be an path/controller between the bottom of
>the cache hierarchy and the shared memory. This "controller" will have
>to implement whatever snooping/cache-coherence/transfer protocol is
>needed by the global memory.
>
>The difference between shared local memory (SHMEM a) and strictly
>private local memory (PGAS 1) is whether the local memory sits below the
>memory controlller or bypasses it. Its not obvious (to me at least)
>whether there are any benefits to be had by bypassing it. Can anyone
>come up with something?

I don't think you realise how much cache coherence costs, once you
get beyond small core-counts. There are two main methods: snooping
is quadratic in the number of packets and directories are quadratic
in the amount of logic (for constant time accesses). As usual,
there are intermediates, e.g. directories that are (say) N*sqrt(N)
in both logic and number of packets.

The main advantage of truly private memory, rather than incoherent
sharing across domains, is reliability. You can guarantee that it
won't change because of a bug in the code being run on another
processor.

Regards,
Nick Maclaren.

From: Mayan Moudgill on 15 Dec 2009 06:07

nmm1(a)cam.ac.uk wrote:

>
> I don't think you realise how much cache coherence costs, once you
> get beyond small core-counts.

That has nothing to do with truly private vs. shared-local memory:
that's in the cache-coherence protocol. One can (in theory) have the
cross product of {local,global} x {coherent,non-coherent}.

And you really need to stop assuming what other people do and don't know
about stuff...

>
> The main advantage of truly private memory, rather than incoherent
> sharing across domains, is reliability. You can guarantee that it
> won't change because of a bug in the code being run on another
> processor.
>

If I wanted to absolutely guarantee that, I would put the access control
in the memory controller. If I wanted to somewhat guarantee that, I
would use the VM access right bits.

From: nmm1 on 15 Dec 2009 07:08

In article <B7SdnYVDl8Wf87rWnZ2dnUVZ_sSdnZ2d(a)bestweb.net>,
Mayan Moudgill <mayan(a)bestweb.net> wrote:
>
>> I don't think you realise how much cache coherence costs, once you
>> get beyond small core-counts.
>
>That has nothing to do with truly private vs. shared-local memory:
>that's in the cache-coherence protocol. One can (in theory) have the
>cross product of {local,global} x {coherent,non-coherent}.

One can in theory do many things that have proved to be infeasible
in practice. It is true that I misunderstood what you were trying
to say, but I assert that your words (which I quote below) matches
my understanding better than your intent does.

I can't see that there is any benefit between having strictly private
memory (PGAS 1. above), at least on a high-performance MP system.

The CPUs are going to access memory via a cache. I doubt that there will
be 2 separate kinds of caches, one for private and one for the rest of
the memory. So, as far as the CPUs are concerned there is no distinction.

Since the CPUs are still going to have to talk to a shared memory (PGAS
2. above), there will still be an path/controller between the bottom of
the cache hierarchy and the shared memory. This "controller" will have
to implement whatever snooping/cache-coherence/transfer protocol is
needed by the global memory.

>And you really need to stop assuming what other people do and don't know
>about stuff...

I suggest that you read what I post before responding like that.
I can judge what you know only from your postings, and this is not
the first time that you have posted assertions that fly in the face
of all HPC experience, without posting any explanation of why you
think that is mistaken, even after being queried.

In particular, using a common cache with different coherence
protocols for different parts of it has been done, but has never
been very successful. I have no idea why you think that the previous
experience of its unsatisfactoriness is misleading.

>> The main advantage of truly private memory, rather than incoherent
>> sharing across domains, is reliability. You can guarantee that it
>> won't change because of a bug in the code being run on another
>> processor.
>
>If I wanted to absolutely guarantee that, I would put the access control
>in the memory controller. If I wanted to somewhat guarantee that, I
>would use the VM access right bits.

Doubtless you would. And that is another example of what I said
earlier. That does not "absolutely guarantee" that - indeed, it
doesn't even guarantee it, because it still leaves the possibility
of a privileged process on another processor accessing the pseudo-
local memory. And, yes, I have seen that cause trouble.

You might claim that it is a bug, but you would be wrong if you did.
Consider the case when processor A performs some DMA-capable I/O on
its pseudo-local memory. You now have different consistency
semantics according to where the I/O process runs.

Regards,
Nick Maclaren.

From: nmm1 on 15 Dec 2009 07:42

In article <4B271A90.8060302(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>>> Since integration is inevitable as well as obvious, inevitably we
>>> will have more than one cache coherent domains on chip, which are PGAS
>>> or MPI non-cache coherent between the domains.
>>
>> Extremely likely - nay, almost certain. Whether those domains will
>> share an address space or not, it's hard to say. My suspicion is
>> that they will, but there will be a SHMEM-like interface to them
>> from their non-owning cores.
>
>Actually, it's not an either/or choice. There aren't just two points on
>the spectrum. We have already mentioned three, including the MPI space.
> I like thinking about a few more:

Gug. I need to print those out and study them! Yes, I agree that
it's not an either/or choice, but I hadn't thought out that many
possibilities.

>I am particularly intrigued by the possibility of
>
>0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually
>coherent". Track which bytes have been written by a bitmask per cache
>line. When evicting a cache line, evict with the bitmask, and
>write-back only the written bytes. (Or words, if you prefer).
>
>What I like about this is that it avoids one of the hardest aspects of
>non-cache-coherent systems: (a) the fact that writes can disappear - not
>just be observed in a different order, but actually disappear, and the
>old data reappear (b) tied to cache line granularity.
>
>Tracking bitmasks in this way means that you will never lose writes.
>
>You may not know what order they get done in. There may be no global
>order.
>
>But you will never lose writes.
>
>I think SMB-WB-bitmask is more likely to be important than the weak
>models 0.7 and 0.9,
> in part because I am in love with new ideas
> but also because I think it scales better.

It also matches language specifications much better than most of the
others, which is not a minor advantage. That could well be the
factor that gets it accepted, if it is.

Regards,
Nick Maclaren.

From: Del Cecchi on 15 Dec 2009 13:37

"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message
news:eb1c4904-abd3-4646-8e61-dd833f806776(a)b15g2000yqd.googlegroups.com...
On Dec 14, 4:03 pm, j...(a)cix.compulink.co.uk wrote:

>
> I wasn't explaining enough. A single memory controller does not seem
> to be enough for today's big OOO x86 cores. A Core 2 Duo has two
> memory
> controllers; a Core i7 has three. This is inevitably pushing up pin
> count. If you add a bunch more small cores, you're going to need
> even
> more memory bandwidth, and thus presumably more memory controllers.
> This
> is do doubt achievable, but the price may be a problem.

Bandwidth. Bandwidth. Bandwidth.

It must be in scripture somewhere. It is, but no one reads the Gospel
according to Seymour any more.

Is an optical fat link out of the question? I know that optical on-
chip will take a miracle and maybe a Nobel prize, but just one fat
link. Is that too much to ask?

Robert.
----------------------
Yes it is at the moment. On the other hand you can do
10Gb/sec/differential pair on copper if you don't want to go too far.
So you don't really need optics. But all the fancy dancy interface
stuff adds latency, if that's ok.

del.

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Prev: PEEEEEEP
Next: Texture units as a general function