Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: "Andy "Krazy" Glew" on 26 Dec 2009 22:56

Mayan Moudgill wrote:
> Bernd Paysan wrote:
>
> >
> > Sending chunks of code around which are automatically executed by the
> > receiver is called "active messages".
>
> I'm not so sure. The original Active Messages stuff from Thorsten von
> Eicken et.al. was more like passing a pointer to a user space interrupt
> handler along with an inter-processsor message, so that the message
> could be handled with zero-copies/low-latency (OK, it wasn't always
> quite that - but its close in flavor). The interrupt handler code was
> already resident on the processor.
>
> I've never heard of pushing code for execution on another processor
> being called "active messages" - citations? references?

True - the only references I have seen to active messages say "pointer
to code".

But pointer to code only works with shared address space. It doesn't
work with message passing. Unless you can assume that all code has been
distributed in advance to all of the non-shared-address space processors.

Now, since we are sometimes talking about PGAS/SHMEM, and sometimes
about message passing non-shared memory, sometimes code pointers are
sufficient.

But even with code pointers in a shared memory system, if the code is
be9ing generated on the fly fencing and flushing and pushing the code
gets interesting.

So, I am thinking along two tracks: pushing the code, as well as
pushing pointers when that can be arranged.

---

Here's another reason: Tom Palowski of Micron at SC09's "Future
Challenges" panel made a pitch for an abstract interface for DRAM. I.e.
no more RAS, CAS: just read, write, and a few other operations. In some
ways, this amounts to hiding the present (remote) memory controller in
the stack of DRAMs. If those few other operations are to include active
messages, you can't rely on shared code pointers.

From: Andrew Reilly on 26 Dec 2009 23:35

On Sat, 26 Dec 2009 19:44:12 -0800, Andy \"Krazy\" Glew wrote:

> Andrew Reilly wrote:
>> On Wed, 23 Dec 2009 21:17:07 -0800, Andy \"Krazy\" Glew wrote:
>> Why prefer adding layers of protocol and mechanism, so that you can
>> coordinate the overwriting of memory locations, instead of just writing
>> your results to a different memory location (or none at all, if the
>> result is immediately consumed by the next computation?)
>>
>> [I suspect that the answer has something to do with caching and hit-
>> rates, but clearly there are trade-offs and optimizations that can be
>> made on both sides of the fence.]
>
> Yep. It's caching. Reuse of memory.
>
> I've run experiments where you never reuse a memory location. (You can
> do it even without language support, by a mapping in your simulator.)
> Performance very bad. 2X-4X worse.

That doesn't sound so bad to me, for a maximally-pessimistic test. These
days I find I'm pretty happy to run at 50% of theoretical peak
performance if I'm getting something useful in return (and as I explain
below, I expect that the result would typically be much better than that.)

> You have to somehow reuse memory to benefit from caches.

Sure, and you still get most of that with functional code: all re-reading
is still cache hits, and most functional language implementations these
days make some effort to ensure that the purely LIFO (stack) behaviour
common for local variables and arguments does wind up re-using the stack
space, resulting in hits on writes.

So, of the remaining data stores (new allocations, probably missing in
cache), how does that cost (with appropriate write buffering and what-
not) compare to protecting those writes with explicitly atomic operations
or mutexes? I don't think that we've got much data to go on, yet: most
of the functional languages are still pretty green when it comes to
actually using lots of processors. Many still use green-threads or
similar for their threading operations (limiting them to single cores.)
So there aren't (as far as I know), much in the way of scale-up
functional applications to base measurements on. There are some in
Erlang (like the messaging system underneath Google wave, for instance),
but there are other reasons why it's difficult to use Erlang code as a
performance benchmark. It's coming, though. PLT is experimenting with
lightweight mechanism for using multiple cores called "futures", that
sounds to me as though it could be useful (not vastly different from
Apple's GCD "blocks" and golang's "go routines", I think.)

Cheers,

--
Andrew

From: Mayan Moudgill on 27 Dec 2009 01:22

Andy "Krazy" Glew wrote:

> Yep. It's caching. Reuse of memory.
>
> I've run experiments where you never reuse a memory location. (You can
> do it even without language support, by a mapping in your simulator.)
> Performance very bad. 2X-4X worse.
>
> You have to somehow reuse memory to benefit from caches.

Or use an architecture which supports
allocate-and-0-line-without-fetching - PowerPC dcbz instruction.

Or write an entire line/block at a time (SIMD registers where SIMD
size=cache-block size or store-multiple-registers on implementations
where a store multiple reister doesn't get split up into single register
transfers).

From: Anne & Lynn Wheeler on 27 Dec 2009 01:23

re:
http://www.garlic.com/~lynn/2009s.html#34 Larrabee delayed: anyone know what's happening?
http://www.garlic.com/~lynn/2009s.html#35 Larrabee delayed: anyone know what's happening?

The communication group had other mechanisms besides outright
opposition. At one point the disk division had pushed thru corporate
approval for a kind of distributed envirionment product ... and the
communication group change tactics (from outright opposition) to
claiming that the communication group had corporate strategic
responsibility for selling such products. The produt then had price
increase of nearly ten times (compared to what the disk division had
been planning on sellting it for).

The other problem with the product was that the shipped mainframe
support only got about 44kbytes/sec thruput while using up a 3090
processor(/cpu). I did the enhancements that added that added RFC1044
to the product and in some tuning tests at cray research got
1mbyte/sec thruput while using only modest amount of 4341 processor
(an improvement of approx. 500 times in terms of instruction executed
per byte moved) ... turning tests were memorable in other ways ...
trip was NW flt to minneapolis left group 20 minutes late ... however
it was still wheels up out of SFO five minutes before the earthquake
hit. misc. past posts mentioning rfc1044 support
http://www.garlic.com/~lynn/subnetwork.html#1044

also slightly related:
http://www.garlic.com/~lynn/2009s.html#32 Larrabee delayed: anyone know what's happening?

slight digression, mainframe product had done tcp/ip protocol stack in
vs/pascal. It had none of the common buffer related exploits that are
common in C language implementations. It wasn't that it was impossible
to make such errors in pascal ... it was that it was nearly as hard
to make such errors as it is hard not to make such errors in C. misc.
past posts
http://www.garlic.com/~lynn/subintegrity.html#overflow

In the time-frame of doing rfc 1044 support was also getting involved
in HIPPI standards and what was to becomes FCS standards ... at the
same time as trying to figure out what to do about SLA when rs/6000
shipped. ESCON was the mainframe varient that ran 200mbits/sec ... but
got only about 17mbytes/sec aggregate thruput, minor reference:
http://www-01.ibm.com/software/htp/tpf/tpfug/tgs03/tgs03l.txt

RS/6000 SLA was tweaked to 220mibts/sec ... and was looking at
significantly better than 17mbytes/sec sustained but full-duplex, in
each direction (not aggregate, in each direction ... in large part
because it wasn't simulating half-duplex with the end-to-end synchronous
latencies).

also, while the communication group was doing things like trying to
shutdown things like client/server, as part of preserving the terminal
emulation install base ... we had come up with 3-tier architecture and
was out pitching it to customer executives (and taking more than
a few barbs from the communication group) ... misc. past post mentioning
3-tier
http://www.garlic.dom/~lynn/subnetwork.html#3tier

also these old posts with references to the (earlier) period ... with
pieces from '88 3-tier marketing pitch
http://www.garlic.com/~lynn/96.html#16 middle layer
http://www.garlic.com/~lynn/96.html#17 middle layer

this is reference to jan92 meeting looking at part of ha/cmp scaleup
(commercial & database as opposed to numerical intensive) & FCS
http://www.garlic.com/~lynn/95.html#13

where FCS is looking better than 100mbyte/sec full-duplex (i.e.
100mbyte/sec in each direction). for other drift ... some old
email more related to ha/cmp scaleup for numerical intensive and
some other national labs issues:
http://www.garlic.com/~lynn/2006x.html#3 Why so little parallelism?

now part of client/server ... two of the people mentioned in the jan92
meeting reference ... later left and show up at small client/server
startup responsible for something called "commerce server" (we had
also left in part because the ha/cmp scaleup had been transferred and
we were told we weren't to work on anything with more than four
processors) ... and we were brought in as consultants because they
wanted to do payment transactions. The startup had also invented this
technology called "SSL" that they wanted to use ... and the result is
now frequently called "electronic commerce".

Part of this "electronic commerce" thing was something called a
"payment gateway" (which we periodically claim was the original
"SOA") ... some past posts
http://www.garlic.com/~lynn/subnetwork.html#gateway

which required a lot of availability ... taking payment transactions
from webservers potentially all over the world; for part of the
configuration we used rs/6000 ha/cmp configurations.
http://www.garlic.com/~lynn/subtopic.html#hacmp

in any case ... one of the latest buzz is "cloud computing" ... which
appears trying to (at least) move all the data back into a datacenter
.... with some resemblance to old-time commercial time-sharing ... for
other drift, misc. past posts mentioning (mainframe) virtual machine
based commercial time-sharing service bureaus starting in the late 60s
and going at least into the mid-80s
http://www.garlic.com/~lynn/submain.html#timeshare

--
40+yrs virtualization experience (since Jan68), online at home since Mar1970

From: nmm1 on 27 Dec 2009 05:50

In article <4B36D748.3060904(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>Terje Mathisen wrote:
>>
>> I accept however that if both you and Andy think this is bad, then it
>> probably isn't such a good idea to allow programmers to be surprised by
>> the difference between one size of data objects and another, both of
>> which can be handled inside a register and with size-specific load/store
>> operations available.
>> :-(
>
>I'm not as sure as Nick is.

As I posted later, I am only sure in the context of the currently
favoured programming languages. You may remember that I keep banging
on about how they are obstacles and not assistances ....

>Since this is a somewhat new approach, I tend to think in terms of extremes.

People have been trying to tweak parallelism into existing programming
paradigms, as well as trying to resolve the 'memory wall' problem by
tweaking hardware, for 30 years - and the sucess has been, at best,
limited. For example, caches work well, but only on codes for which
caches work well, and that is as true today as it was 30 years ago.

So I believe that any major improvement must come from a radically
new approach, which will look extreme to some people.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Prev: PEEEEEEP
Next: Texture units as a general function