Prev: PEEEEEEP
Next: Texture units as a general function
From: Mayan Moudgill on 1 Jan 2010 03:35 Bernd Paysan wrote: > Mayan Moudgill wrote: > >>So: summarizing - I still don't think active messages is the right >>name. I haven't encountered any real-life instances where people >>actually send code to be executed (or even interpreted) at a low-level >>inside the device driver. > > > I have. Several different, to tell you. One guy (Heinz Schnitter) > sends source code around - this is a distributed programming system. > Another one (Chuck Moore) sends instructions around - this is a small > chip full of tiny CPUs. They all did not really generalize, though I > know that Chuck Moore knows what Heinz did. > > Got any references that are publically available? Thanks Mayan
From: Gavin Scott on 1 Jan 2010 04:22 Bernd Paysan <bernd.paysan(a)gmx.de> wrote: > The tragedy is that it was never designed for that job. It just > happened. The consequence of all these accidental pieces growing > together as the "next OS" is a nightmare. Indeed, but that hardly makes it unique in this industry :) G.
From: nmm1 on 1 Jan 2010 07:30 In article <4B3CF140.2000504(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >> The key is that you can't do ANYTHING when handicapped by a memory >> model like those of C, C++ or Java - once the address of a subobject >> is required to be fixed, you are dead. > >Ah, now we are getting somewhere. > >Now, I *like* the fact that C in some ways is essentially a structured >assembly - at a particular point in time an object or subobject has an >address, and that can be taken and used. If it had been left at that, it would have been fine. The disaster was to put a layer of high-level language paint on it. >However, I agree that it is the possibly of such an address persisting >forever that gets in the way of so many things. > > Although... garbage collection systems can at least locate all such. >However, we would not want to have to do such a scan from the GC roots >frequently, as part of some synchronization or parallelization operations. C99 allows you to encrypt addresses and/or save them on disk. Seriously. >So, I agree that it might be nice to have a language "mode" that forbids >or deprecates the address-of operator. E.g. one that had a special >repreesentation for parameters, that allowed reference, copy-in/out, >etc. semantics. E.g. one that had a special representation for the >pointers that so many people use to walk down an array or linked list. You couldn't add one to C, without making it unusable. Even ignoring the fact that subscription needs the address-of primitive, there is just SO much else that depends on the basic static model. You can't read data, handle strings or subsection arrays without the address-of primitive, to name just three features. Simply forbidding the operator isn't enough. Also, modes have never worked in the past, and I can't see anything that has changed. Obviously, the language needs to be designed to support the modes, or else they are merely 'style guidelines', which have failed time and time again. But that's not enough. There isn't any real likelihood of starting with a C-like language and adding enough constraints to make it reliable or optimisable. I looked at C# and wasn't impressed. Yes, it tackled many of the details, but skirted around most of the underlying problems. This problem has been tackled many times over the past 40 years, and is well-known to be hard. It needs a more radical approach than C#. Regards, Nick Maclaren.
From: nmm1 on 1 Jan 2010 07:37 In article <4B3D7D40.4070807(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >> >>> Herlihy's non-blocking lock free algorithms using compare-and-swap >>> depend on there being a single pointer that dominates the data structure >>> you are changing. >> >> One of the many reasons that so many people feel that a double form >> of compare-and-swap is essential. > >I've read Greenwald's thesis, and I am sympathetic. In fact, I have >proposed implementations for DCAS, twice at Intel and once at AMD. > >However > >1) It doesn't meet all needs - witness the folks who ask for KCSS (K-way >Compare and Swap Synchronization) and Transactional Memory with >arbitrary numbers of locations involved. > >2) It doesn't seem to be necessary in a theoretical sense: Agreed, but with reservations on the second point. Most of the alternative approaches have ignored the performance implications. Floating-point hardware isn't needed in a theoretical sense, either, but there are people who would scream if you dropped it :-) What is clear is that there are a zillion ways to provide adequate 'atomic' primitives, and most of them can be claimed to be 'best' if you choose the right criteria. I doubt that agreement will be reached in my lifetime. >The thing I don't like about DCAS or KCSS or TM is that most of the >hardware proposals depend on having a cache coherence protocol. > >As explained in other posts, of late I have been thinking in terms of >supercomputers without a cache coherence protocol. It's a thorny problem. I don't have any good ideas. Regards, Nick Maclaren.
From: Mayan Moudgill on 1 Jan 2010 13:59
Robert Myers wrote: > On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote: > > >>Any problem that is not I/O bound that can be made solved using a >>non-imperative language can be made to work faster in an imperative >>language with the appropriate library (i.e. if you can write it in ML, >>it can be re-written in C-as-high-level-assembly, and it will perform >>better). > > If your original claim is correct even for one machine (and the truth > of your claim is not obvious to me, given the lengthy discussion here > about memory models), Its not necessarily simple, but it is definitely do-able. One of the economic costs is being able to find the programmer(s) who can pull this off. > does it then follow that it is also possible for > a second, or a third machine? If the cost of rewrite is bounded for > one machine (that is to say that, you eventually succeeded in getting > the program to work--a least you *think* you got it to work-- > correctly), it is bounded for a second or a third machine? Yes - generally, to get it done for one machine, one will have to have identified all sharing/communication between the parallel entities. This is work you don't have to do again. There is a caveat - if the code (algorithms, task partititioning) has to be reworked because of large differences between the systems, then much less code/learning carries over. > > I'll accept that Nick is an expert at finding and fixing weird memory > problems. From reading his posts, I'm convinced that the memory model > of c is not well-defined or even agreed-upon in any practical sense. Which is why it's C-as-high-level-assembly, not C-as-an-ANSI-standard, will be used for getting the job done. Actually, thats not strictly true - you have to partition stuff into things that are isolated to one process/CPU and code dealing with the sharing of state/synchronization between processes. The isolated case can generally be written in vanilla C (or any other language), while the sharing case has to be written non-portably - perhaps actually in assembly. Hopefully, a lot of the shared code can be hidden behind macros, function calls or code generators to isolate the behavior. > Thus, not only are you worried about whether your program will work > with a second machine, but even with a second compiler with a > different notion of critical details of c. Again: you don't rely on the compiler to get those details correct. You have to ensure that it is correct. This may mean constraining the compiler in ways that make it no more than a high-level assembler, or using inline assembly, or even straight assembly where necessary. The problem is that unless you've done it, you don't know where the friction points are, and you assume that its too difficult. It isn't - its just complicated engineering. I can think of lots of systems codes which are in many ways more complicated. > From an economic point of view, the proposed trade doesn't even seem > rational: trade a program with transparently correct memory semantics > (perhaps because there are no memory semantics to be incorrect) for > one that is faster but may or may not do the same thing under some > limited set of circumstances. Generally tasks that are not trivially paralellizable/distributable and are not IO bound are parallelized because the performance is inadequate. If the performance is inadequate, it may be because we don't have the best serial implementation, or because the best serial implementation is itself not sufficient. What is the slowdown between approach X (for you favorite value of X) and the best serial implementation? This slowdown matters - a lot. If, for instance, ths slow down 4x, does that mean that we will end up with identical performance using 4 way parallelism? Probably not - the parallel inefficiencies will probably mean that we break even at 6x-8x. So: is it more economic to focus on writing a serial imperative program or a parallel approach X program? How about the case where we're going to *have to* parallelize - even with the best case serial program is just too slow. In that case, both the imperative approach and the alternative(s) will have to be parallel. What are the inefficencies here? The hierarchical nature of communication between processors means that a 4-way parallel machine will have better communication properties than a 16-way parallel machine, which in turn will be better than a 64-way and so on. This means that if we can fit an imperative parallel program into a 4 way, and approach X is 4x slower, then approach X will be forced into a 16 way. But since it is now one level down the communication hierarchy, it is quite possible that it will be even slower, requiring, say, a 32 way machine to be competitive. Also, in some programs, it is easy to extract a small amount of (task) parallelism, but it is not possible to extract large (or unbounded) parallelism. It possible that we have access to an N-way machine, there is N-way parallelism available in the program, the N-way solution using approach-X is fast enough, and we prioritize the advantages of using approach X (time-to-market, programmer availablity, etc.) over the baseline, highest performance, approach. In that case, we are free to speculate about the various alterative programming approaches. As an aside: if approach X uses garbage collection, then there may be a significant penalty for going from a serial implementation to a parallel implementation; a simple comparison of the performance of the sequential implementations in approach X and an imperative language may understate the comparative performance. |