Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 30 Dec 2009 20:27

nmm1(a)cam.ac.uk wrote:
> In article <4B399C3B.2050207(a)patten-glew.net>,
> Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>>> That doesn't help significantly. For the real benefit, you need to
>>> be able to ensure that objects that are accessed in separate threads
>>> can be separated by the implementation to any degree that it needs.
>>> That's a problem.
>> How can you do this with linked datastructures?
>>
>> E.g. the flavor of sparse array that has cells
>> (row,col,data,ptr_to_next_in_row,ptr_to_next_in_col)
>> with linked lists on both dimensions.
>> Sometimes parallelized by row, sometimes by columns;
>> sometimes both at the same time.
>
> Just as for arrays. You copy in and copy back, when and as necessary.
> Yes, it costs, which is why most Fortran compilers use copy-in/copy-out
> for arrays only when essential.
>
> My point is that the language must be such as to ALLOW the compiler
> to do that. Fortran, Python etc. by and large do. C, C++ and Java,
> by and large don't.

Copy in/out helps, but generalize:

Some funky linked data structure where you can't tell a priori what
subsections there are.

Or, heck, an N-dimensional array, where you sometimes want row access,
sometimes columns, sometimes any of the diagonals, sometimes
sub-matrices - where, in general, you can't tell what subset of the data
structure is being manipulated. And where there is not some hierarchy
that is natural for locking or other mutex.

Copy in/out helps. Particularly for dense arrays - because the parallel
threads are not deallocating structure nodes, but are just changing
values. You may get inconsistent values when you copy back in - values
that are not consistent with their surrounding values - but the dense
array isn't broken.

Whereas with a subset of linked datastructure that does not have a
single dominator, you can't necessarily copy back in a changed
datastructure, with nodes that have been added or deleted.

Herlihy's non-blocking lock free algorithms using compare-and-swap
depend on there being a single pointer that dominates the data structure
you are changing.

Failing that, you need either transaction - K-way compare and swap, or
full transactions.

Or you need locks.

If you are operating on something like an irregular but static mesh,
okay, you may be able to copy in/out. But if the mesh is evolving as
objects move.

Anyway, yes, copy in/out helps. But it is not the final answer. The
real problem in so much of this is that we have memory. Data structures
that are operated on in place. And, no matter what advocates of data
flow, logic programming, applicative or functional languages say, I
don't think that we are going to do away with them.

From: "Andy "Krazy" Glew" on 30 Dec 2009 20:35

Terje Mathisen wrote:
> nmm1(a)cam.ac.uk wrote:
>> In article<MMOdnSSpXI_wwaTWnZ2dnUVZ_tGdnZ2d(a)bestweb.net>,
>> Mayan Moudgill<mayan(a)bestweb.net> wrote:
>>> Sorry; don't see how this suffers from any kind of degradation. Could
>>> you elaborate?
>>
>> Because, if simultaneous updates of the same cache line are not
>> coherent (in whatever sense required by the language), that code is
>> incorrect. This has nothing to do with performance, and a great
>> deal to do with programs giving occasional, non-repeatable wrong
>> answers.
>
> So then you need the compiler to know that all accesses within a cache
> line (or whatever the coherency size is) of both edges of the range are
> special, and must be handled with totally separate instructions, right?
>
> Sounds like something it should be trivial to add to any given C
> compiler! :-)
>
> Actually it seems to me that a compiler which supports OpenMP or similar
> must do quite a bit of this when automatically partitioning a loop...

void par_func( char* a, char* b ) {

if( in_different_cache_lines(a,b) ) {
par {
operate_in_processor_1(func,a);
operate_in_processor_2(func,b);
}
}
else {
// because a and b are in the same cache line
// and because I am in a non-cache-coherent-system
// cannot run in parallel
// so just run locally.
func(a);
func(b);
}
}

Sure, some compilers do stuff like this. Although they hate doing it,
because it leads to combinatoric explosion.

But that's PERFORMANCE tuning.

Whereas the code above is necessary for CORRECTNESS.

And what about when a and b are not char*, but are pointers to some
possibly intersecting linked data structures?

From: "Andy "Krazy" Glew" on 30 Dec 2009 20:45

Mayan Moudgill wrote:
> nmm1(a)cam.ac.uk wrote:
>> In article <MMOdnSSpXI_wwaTWnZ2dnUVZ_tGdnZ2d(a)bestweb.net>,
>> Mayan Moudgill <mayan(a)bestweb.net> wrote:
>>
>>>>> Sorry, don't understand. Can you give a more concrete example? What
>>>>> do you mean by "byte-separation accesses"?
>>>>
>>
>> Because, if simultaneous updates of the same cache line are not
>> coherent (in whatever sense required by the language), that code is
>> incorrect.
>
> I see. You are worried that, given code of the form:
>
> for p in processors
> x[p] = foo(p,...)
>
> in the case of *non-coherent* memory, the writeback of x[M] may
> overwrite the already written value of x[M-1] (or any other value of x
> sharing a cache line with x[M]).
>
> This is exacerbated if x[] is a byte array, since it increases the
> chances of collisions, and because one hardware workaround (write masks)
> would require more bits (you need to keep track of dirty bytes as
> opposed to dirty words).
>
> If I have understood your problem properly, why would making the entire
> array write-through not be an adequate solution?

Making the data structure write-through, with byte enables on the write
throughs, would be adequate.

Except: I'm trying to come up with a policy that allows read caching,
potentially of stale data. That does not suffer all of the performance
problems of write-through. That does not mandate the immediate
invalidations of cache coherency protocols.

Note, by the way, that there are at least two approaches to write through.

(1) The write-through that I grew up with did not obtain ownership of
the line being written to: you wrote, a transaction with byte enables
went out, and that transaction invalidated all other copies as it went
around. This is correct for certain memory ordering models, and for
certain interconnection networks. E.g. bus; e.g. tree-hierarchical with
PC, but not SC.

(2) IBM and, lately, QPI, insist on doing the invalidation of all other
copies of write through lines before the write is done.

Call them invalidate-before-write-through versus
invalidate-as-writing-through.

These write through schemes have cache coherency overhead: you've got to
be able to find the lines that you want to invalidate. Plus, they
invalidate other copies of the line, even if the user does not depend on
the bytes being written.

I am trying to come up with a scheme that does not have the cache
coherence protocol - software has to do explicit invalidations and
flushes - and allows other copies to be kept around, not invalidated on
the first write to any part.

From: "Andy "Krazy" Glew" on 30 Dec 2009 21:16

nmm1(a)cam.ac.uk wrote:
> THE memory model? That's my point - there isn't one, there are at
> least two - the one provided by the hardware and the one assumed by
> the programming language.
>
> ... Currently, almost all shared-memory parallel
> models being introduced by programming languages are MORE consistent
> than what the hardware currently provides, rather than less (which
> would allow freedom to change caching technology).

In my experience somewhat the reverse. The languages have a weaker
memory ordering model than the hardware. (Admittedly, at Intel and AMD
I worked on the processors that were put into large systems - the system
vendor did not need to implement the memory model we were implementing.)

E.g. the Java standard's sequential atomics. The language
allows arbitrary reordering, except at well identified places. Whereas
the hardware enforces ordering on all memory operations.

Actually, for historical reasons the old Intel official memory ordering
model was PC, and the semantics of fences and synchronizing atomic RMWs
were not well defined. But the actual hardware always implemented
something stricter, except for bugs. With the publication in the last
few years of the reformed Intel memory ordering model (hmm, I may have
forgotten the plaque for that when I left Intel), we just published and
made official the stronger memory ordering model that the hardware had
been doing since time immemorial.

I.e. at the start, say 1991

Programming languages specified a really weak memory ordering model, say
MO-0 (almost no model at all).

Hardware built a fairly strong memory ordering model, say MO-0.8. (TSO,
almost.)

But we published a slightly weaker memory ordering model, say MO-0.7. (PC)

Over the years, programming languages got stricter and stricter. Say
MO-0.5, Sarita Adve's sequential atomics.

Somebody realized that there were incompatibilities with the published
documents: it wasn't that MO-0.7 was strictly stronger, it was a
partial order. MO-0.7 PC ?>? MO-0.5 SA

But the hardware model actually implemented was strictly stronger,
MO-0.8 TSO-like > MO-0.5 SA

So the published model was updated.

===

But I agree with Nick's point about implementation freedom. Nick just
has a different world of experience than I do.

The supercomputer people keep saying "We don't want cache coherency."
and "We don't want memory ordering." "We want to manage both in
software, if you give us the primitives."

Which is fine... except that caches really do work pretty well in many
cases. Some people think that we will have domains that are cache
coherent, in combination with domains that share memory, but are not
cache coherent.

If you think about it, the really hard aspects of cache consistency
protocols all have to do with memory ordering: making sure that there is
only one writer to a cache line at any time. But the programmers say
they are okay on handling that themselves.

Caching per se is easy enough to implement. There are some interesting
issues in how to give software cache control primitives, but let's
assume that you can do that.

So, let's see:
* I want caching
* Weak ordering
* I do not want to have to invalidate other copies of a cache line,
etc., etc.

This is easy to implement. Heck, I can change cache line size whenever
I want. If I want to build 512 byte cache lines, I can. If I want to
build vector strided cache lines, I can. (Yes, vector strided caches
have been proposed, that store not contiguous cache lines but, say, 4
bytes every N. They are quite implementable in a non-MP sense But in
MP, the coherency overhead gets ugly.)

But I can't change cache line size... not because of cache coherency,
but because software that was allocating stuff at 64B cache lines will
break if I have a different cache line size, larger, or strided, or ...

So that's why I am thinking about tyhe bitmasks. Whether byte or word.

I'm wlling to suspend disbelief and believe software when they say they
can manage memory ordering and cache consistency in software. That
reduces a lot of overhead; ad if we make the cache line granularity
invisible to software correctness wise, it opens up a whol;ew slew of
possibilities for hardware.

---

Unfortunately, I suspect that software really can't manage cache
consistency or weak ordering anywhere near as well as it thinks. I
suspect they may have hidden assumptions, like "I can manage weak
ordering as long as the inconsistencies of the eventual consistency
model are only short lived; but if the inconistencies can last
arbitrarily long times, not so good..." (e.g. you mean "I *have* to
flush? I can't just wait in a spin loop?)

But it is fun to explore new possibilities.

From: "Andy "Krazy" Glew" on 30 Dec 2009 22:03

Robert Myers wrote:
> This exchange, while impressive for its depth of experience and
> understanding, is a bit like a discussion of how to save the Ptolemaic
> universe.
>
> Keeping track of things by memory locations that can be written to and
> accessed by multiple processes and that can have multiple meanings
> based on context is a losing enterprise.
>
> No matter what you propose, you'll wind up with specifications that
> read like the health care bill now before Congress. And, as I think
> Nick is pointing out with current solutions, the bugs won't always be
> the result of programmer incompetence.
>
> You've both got so much invested in thinking about these things that
> you'll beat it to death forever, rather than admitting that the entire
> model of stored variables simply doesn't work.
>
> Other models have been proposed. Time to get on with it.

I agree with you that memory and stored variables is the fundamental
problem.

I am sympathetic to languages and programming models that seek to
ameliorate this fundamental problem. I think that we must encourage
programming models such as functional, applicative, etc.

But... there's just so much memory oriented stuff.

Here's a challenge: rewrite all of Knuth's Art of Computer Programming,
all of the volumes on datastructures, etc. in terms of your favorite
non-memory programming model. Better yet, show an automatic transform.
Show that no efficiency is lost via the transform. Even better, if you
can completely automate this, then accept ordinary programs written for
memory datastructures as input, and output your favorite non-memory
paradigm.

---

It's easy to do this for arrays. It's harder to do it for linked data
structures.

Oftentimes, mechanisms such as futures use linked datastructures behind
the scenes.

Maybe linked datastructures are the problem? Maybe LDSes are the
datastructure equivalent of assembly code? Maybe only programming gods
should use them to implement the highly parallel sytems, while we just
use the highly parallel stuff that the gods have produced?

---

I think that part of the problem is that it is just so much easier to
write some forms of memory / stored variable code:

e.g. compare the following code snippets for finding the sum of an array

sum = 0;
for(int i = 0; i<n; i++) {
sum += a[i];
}

versus

sum = 0+
FORC(int i = 0; i < n; i ++ ) DOC
a[i]
ENDC

(the latter being a trick I created in iHDL. Particularly fun with IFs.)

Oh, sure, you can use some sort of map_reduce function. Now, which of
the umpteen flavors of map_reduce do you want to use?

In C++, do you write for(T::iterator i = ds.begin(); i!= d.end(); i++)?

Or do you know the STL's map reduce templates by heart? And you create
functors on a whim? (I do... but I still find the for iteraror code
easier to read and write.)

Sure, a good compiler will handle some of the above examples. But it
probably won't handle fancier examples.

It's just a simple matter of retraining all of the programmers in the
world, and rewriting most of the textbooks.

We need to do much of this. But it won't happen overnight. And it
probably won't happen completely. We'll probably have stored variables
for a very long time, even if non-memory programming models increase
from <<0.1% to a reasonable fraction of the code in the world.

---

By the way, one thing I think will help is developing good hybrid
notations. E.g. a lambda with an annotation that says "in this scope, I
am using sequential notation."

sum = {seq:
sum_so_far = 0;
for(int i = 0; i<n; i++) {
sum_so_far += a[i];
}
return sum_so_far;
}

Allow single assignment to be called from non-single assignment, and
vice versa. Arbitrarily layered.

First | Prev | Next | Last
Pages: 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Prev: PEEEEEEP
Next: Texture units as a general function