Prev: PEEEEEEP
Next: Texture units as a general function
From: "Andy "Krazy" Glew" on 26 Dec 2009 22:56 Mayan Moudgill wrote: > Bernd Paysan wrote: > > > > > Sending chunks of code around which are automatically executed by the > > receiver is called "active messages". > > I'm not so sure. The original Active Messages stuff from Thorsten von > Eicken et.al. was more like passing a pointer to a user space interrupt > handler along with an inter-processsor message, so that the message > could be handled with zero-copies/low-latency (OK, it wasn't always > quite that - but its close in flavor). The interrupt handler code was > already resident on the processor. > > I've never heard of pushing code for execution on another processor > being called "active messages" - citations? references? True - the only references I have seen to active messages say "pointer to code". But pointer to code only works with shared address space. It doesn't work with message passing. Unless you can assume that all code has been distributed in advance to all of the non-shared-address space processors. Now, since we are sometimes talking about PGAS/SHMEM, and sometimes about message passing non-shared memory, sometimes code pointers are sufficient. But even with code pointers in a shared memory system, if the code is be9ing generated on the fly fencing and flushing and pushing the code gets interesting. So, I am thinking along two tracks: pushing the code, as well as pushing pointers when that can be arranged. --- Here's another reason: Tom Palowski of Micron at SC09's "Future Challenges" panel made a pitch for an abstract interface for DRAM. I.e. no more RAS, CAS: just read, write, and a few other operations. In some ways, this amounts to hiding the present (remote) memory controller in the stack of DRAMs. If those few other operations are to include active messages, you can't rely on shared code pointers.
From: Andrew Reilly on 26 Dec 2009 23:35 On Sat, 26 Dec 2009 19:44:12 -0800, Andy \"Krazy\" Glew wrote: > Andrew Reilly wrote: >> On Wed, 23 Dec 2009 21:17:07 -0800, Andy \"Krazy\" Glew wrote: >> Why prefer adding layers of protocol and mechanism, so that you can >> coordinate the overwriting of memory locations, instead of just writing >> your results to a different memory location (or none at all, if the >> result is immediately consumed by the next computation?) >> >> [I suspect that the answer has something to do with caching and hit- >> rates, but clearly there are trade-offs and optimizations that can be >> made on both sides of the fence.] > > Yep. It's caching. Reuse of memory. > > I've run experiments where you never reuse a memory location. (You can > do it even without language support, by a mapping in your simulator.) > Performance very bad. 2X-4X worse. That doesn't sound so bad to me, for a maximally-pessimistic test. These days I find I'm pretty happy to run at 50% of theoretical peak performance if I'm getting something useful in return (and as I explain below, I expect that the result would typically be much better than that.) > You have to somehow reuse memory to benefit from caches. Sure, and you still get most of that with functional code: all re-reading is still cache hits, and most functional language implementations these days make some effort to ensure that the purely LIFO (stack) behaviour common for local variables and arguments does wind up re-using the stack space, resulting in hits on writes. So, of the remaining data stores (new allocations, probably missing in cache), how does that cost (with appropriate write buffering and what- not) compare to protecting those writes with explicitly atomic operations or mutexes? I don't think that we've got much data to go on, yet: most of the functional languages are still pretty green when it comes to actually using lots of processors. Many still use green-threads or similar for their threading operations (limiting them to single cores.) So there aren't (as far as I know), much in the way of scale-up functional applications to base measurements on. There are some in Erlang (like the messaging system underneath Google wave, for instance), but there are other reasons why it's difficult to use Erlang code as a performance benchmark. It's coming, though. PLT is experimenting with lightweight mechanism for using multiple cores called "futures", that sounds to me as though it could be useful (not vastly different from Apple's GCD "blocks" and golang's "go routines", I think.) Cheers, -- Andrew
From: Mayan Moudgill on 27 Dec 2009 01:22 Andy "Krazy" Glew wrote: > Yep. It's caching. Reuse of memory. > > I've run experiments where you never reuse a memory location. (You can > do it even without language support, by a mapping in your simulator.) > Performance very bad. 2X-4X worse. > > You have to somehow reuse memory to benefit from caches. Or use an architecture which supports allocate-and-0-line-without-fetching - PowerPC dcbz instruction. Or write an entire line/block at a time (SIMD registers where SIMD size=cache-block size or store-multiple-registers on implementations where a store multiple reister doesn't get split up into single register transfers).
From: Anne & Lynn Wheeler on 27 Dec 2009 01:23 re: http://www.garlic.com/~lynn/2009s.html#34 Larrabee delayed: anyone know what's happening? http://www.garlic.com/~lynn/2009s.html#35 Larrabee delayed: anyone know what's happening? The communication group had other mechanisms besides outright opposition. At one point the disk division had pushed thru corporate approval for a kind of distributed envirionment product ... and the communication group change tactics (from outright opposition) to claiming that the communication group had corporate strategic responsibility for selling such products. The produt then had price increase of nearly ten times (compared to what the disk division had been planning on sellting it for). The other problem with the product was that the shipped mainframe support only got about 44kbytes/sec thruput while using up a 3090 processor(/cpu). I did the enhancements that added that added RFC1044 to the product and in some tuning tests at cray research got 1mbyte/sec thruput while using only modest amount of 4341 processor (an improvement of approx. 500 times in terms of instruction executed per byte moved) ... turning tests were memorable in other ways ... trip was NW flt to minneapolis left group 20 minutes late ... however it was still wheels up out of SFO five minutes before the earthquake hit. misc. past posts mentioning rfc1044 support http://www.garlic.com/~lynn/subnetwork.html#1044 also slightly related: http://www.garlic.com/~lynn/2009s.html#32 Larrabee delayed: anyone know what's happening? slight digression, mainframe product had done tcp/ip protocol stack in vs/pascal. It had none of the common buffer related exploits that are common in C language implementations. It wasn't that it was impossible to make such errors in pascal ... it was that it was nearly as hard to make such errors as it is hard not to make such errors in C. misc. past posts http://www.garlic.com/~lynn/subintegrity.html#overflow In the time-frame of doing rfc 1044 support was also getting involved in HIPPI standards and what was to becomes FCS standards ... at the same time as trying to figure out what to do about SLA when rs/6000 shipped. ESCON was the mainframe varient that ran 200mbits/sec ... but got only about 17mbytes/sec aggregate thruput, minor reference: http://www-01.ibm.com/software/htp/tpf/tpfug/tgs03/tgs03l.txt RS/6000 SLA was tweaked to 220mibts/sec ... and was looking at significantly better than 17mbytes/sec sustained but full-duplex, in each direction (not aggregate, in each direction ... in large part because it wasn't simulating half-duplex with the end-to-end synchronous latencies). also, while the communication group was doing things like trying to shutdown things like client/server, as part of preserving the terminal emulation install base ... we had come up with 3-tier architecture and was out pitching it to customer executives (and taking more than a few barbs from the communication group) ... misc. past post mentioning 3-tier http://www.garlic.dom/~lynn/subnetwork.html#3tier also these old posts with references to the (earlier) period ... with pieces from '88 3-tier marketing pitch http://www.garlic.com/~lynn/96.html#16 middle layer http://www.garlic.com/~lynn/96.html#17 middle layer this is reference to jan92 meeting looking at part of ha/cmp scaleup (commercial & database as opposed to numerical intensive) & FCS http://www.garlic.com/~lynn/95.html#13 where FCS is looking better than 100mbyte/sec full-duplex (i.e. 100mbyte/sec in each direction). for other drift ... some old email more related to ha/cmp scaleup for numerical intensive and some other national labs issues: http://www.garlic.com/~lynn/2006x.html#3 Why so little parallelism? now part of client/server ... two of the people mentioned in the jan92 meeting reference ... later left and show up at small client/server startup responsible for something called "commerce server" (we had also left in part because the ha/cmp scaleup had been transferred and we were told we weren't to work on anything with more than four processors) ... and we were brought in as consultants because they wanted to do payment transactions. The startup had also invented this technology called "SSL" that they wanted to use ... and the result is now frequently called "electronic commerce". Part of this "electronic commerce" thing was something called a "payment gateway" (which we periodically claim was the original "SOA") ... some past posts http://www.garlic.com/~lynn/subnetwork.html#gateway which required a lot of availability ... taking payment transactions from webservers potentially all over the world; for part of the configuration we used rs/6000 ha/cmp configurations. http://www.garlic.com/~lynn/subtopic.html#hacmp in any case ... one of the latest buzz is "cloud computing" ... which appears trying to (at least) move all the data back into a datacenter .... with some resemblance to old-time commercial time-sharing ... for other drift, misc. past posts mentioning (mainframe) virtual machine based commercial time-sharing service bureaus starting in the late 60s and going at least into the mid-80s http://www.garlic.com/~lynn/submain.html#timeshare -- 40+yrs virtualization experience (since Jan68), online at home since Mar1970
From: nmm1 on 27 Dec 2009 05:50
In article <4B36D748.3060904(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >Terje Mathisen wrote: >> >> I accept however that if both you and Andy think this is bad, then it >> probably isn't such a good idea to allow programmers to be surprised by >> the difference between one size of data objects and another, both of >> which can be handled inside a register and with size-specific load/store >> operations available. >> :-( > >I'm not as sure as Nick is. As I posted later, I am only sure in the context of the currently favoured programming languages. You may remember that I keep banging on about how they are obstacles and not assistances .... >Since this is a somewhat new approach, I tend to think in terms of extremes. People have been trying to tweak parallelism into existing programming paradigms, as well as trying to resolve the 'memory wall' problem by tweaking hardware, for 30 years - and the sucess has been, at best, limited. For example, caches work well, but only on codes for which caches work well, and that is as true today as it was 30 years ago. So I believe that any major improvement must come from a radically new approach, which will look extreme to some people. Regards, Nick Maclaren. |