Prev: PEEEEEEP
Next: Texture units as a general function
From: Jeff Fox on 1 Jan 2010 16:01 On Jan 1, 11:48 am, Bernd Paysan <bernd.pay...(a)gmx.de> wrote: > >>>So: summarizing - I still don't think active messages is the right > >>>name. I haven't encountered any real-life instances where people > >>>actually send code to be executed (or even interpreted) at a > >>>low-level inside the device driver. > > >> I have. Several different, to tell you. One guy (Heinz Schnitter) > >> sends source code around - this is a distributed programming system. > >> Another one (Chuck Moore) sends instructions around - this is a small > >> chip full of tiny CPUs. They all did not really generalize, though I > >> know that Chuck Moore knows what Heinz did. > > > Got any references that are publically available? > > Chucks SeaForth: > > http://www.intellasys.net/index.php?option=com_content&task=view&id=3... > > This is a commercial product, but Intellasys has basically folded down > since patent Trolls and engineers can't work together in the long run Green Array Chips http://www.greenarraychips.com has continued the development and in 2009 produced working chips in a couple geometries and severlal configurations. Designs using some number of the 20k transistor, $.01 manufacture cost, 700 Forth MIP performance, 5mw core (in .18u) include the GA4, GA32, GA40, and GA144. Arbitration of the contract between Chuck Moore and TPL will take place soon. IntellaSys, a TPL Group, mostly shutting down in January of 2009 although they did continue with the hearing enchancement project as reported at the Silicon Valley Forth Interest Group Forth Day meeting in November of 2009. You can read Chuck's opinions about the legal case at his website http://www.colorforth.com The previous generation of full custom VLSI Forth chips included a network router coprocessor integrated into the design for active messaging. It routed messages, did DMA if the individual or group address bits in a message matched the node routing the message, and could interrupt the CPU to execute messages after they were in RAM. The active message processor used about 300 transistors and used two $.01 pins. It ran autonomously at up to several hundred mbps but due to sharing memory with everything else bandwidth limited performance to 40mbps. Maximum CPU throughput was 220 Forth MIPS at 50mw, it also had a 40MSPS analog coprocessor and a video I/O coprocessor/ accelerator and a manufacture cost of about $.85 in quantity due to the size of the die and the use of 68 pins. This was back in the early nineties. More information about the old F21 and the history of the chips at my website http://www.ultratechnology.com The lack of on-chip memory at the time in those designs meant that each node required some external memory and were networked with several chips at each node. Adding internal RAM and ROM to the core made it reasonable to put multiple core per chip package which lead to the SEAforth (Scalable Embedded Arrays). In these designs some core have pins, some are just connected to other core. Some packages have enough pins for flash, some for external RAM etc. Some core have a/d and d/a, and some have serdes. We took the Occam style communication channels and implemented them as shared registers and added the ability to address up to four of these ports at once. These ports require only a few transistors and block a node until a neighbor reads or writes. A port can be read with a pointer as data or by the program counter as instructions. Routing is done by packets that execute some instructions on a port and read and write data to/from ports or memory. This allows one one cent processor to send a program to another one cent processor in a about nanoseconds and have it wake up and execute it within a couple of hundred picoseconds. We have a similar mechanism for waking up on pin changes in a couple of hundred picoseconds to process real-time events. There are some ports with serializer/deserializer hardware so that messages can go from chip to chip in the same way that they move between core on the same chip, except slower. The design is Forth CSP with multiport addressing capability which makes for very small programs. The design has some things, from a software standpoing, in common with parallel designs like the CELL. The big differences are, large ram spaces and floating point hardware. This results in a 20000/1 ratio between core sizes so they are very different in most ways. Each CELL core equates to about 20,000 700 integer MIP Forth core. This is also why these tiny core are less likely to have fatal flaws and why yield has been very close to 100%. These Forth core have to have dense code, they only have 64 words of internal RAM and 64 words of internal ROM each. Most Forth words by frequency of execution are five-bit native opcodes that pack together to form very dense code. This makes for very dense arget code. The target code, the development code, even the development tools are remarkably small and fast. When we tell people that the boot code, the OS, the editor, the compiler, the full custom VLSI CAD suite with a dozen programs, target compilers, hardware and software simulators, design rule check and GDS extract utilities, and source code to several chips all fit on a fraction of a floppy drive. When they see us do in a few seconds things that take other people all day with their tools they are often very surprised by how our tools operation. It is also interesting to me that SPICE based tools claim these designs are impossible and won't run at all. These things are unusual and not what people are used to. I have not worked on a target chip for which there was a C compiler in about twenty five years. I have seen threads about if C is close to the machine but I never see people ask if the machine is close to C. These aren't but they have so much of Forth in hardware that much of traditional Forth isn't needed. The tiny multicore chips are different. I noticed at one trade show that we had the only multicore chips that didn't need fans. As they are not supported by mainstream tools they will most likey remain a niche product. I don't know if 'active messave' should apply to the SEAforth design or not. I think it did apply to the F21 we did long ago. Best Wishes
From: Bernd Paysan on 1 Jan 2010 16:29 nmm1(a)cam.ac.uk wrote: >>Hm, the most essential usage of address-of (unitary prefix &) in C is >>to >>return multiple parameters. ... > > The mind boggles. That is so far down the list of its uses that I > hadn't even thought of it. > > You could start with trying to work out how to use scanf without it, scanf *is* precisely what I'm talking about: Returning multiple values. int a; float b; char c[]; (a, b, c) = tuple_fscanf(file, "%d%f%s"); The problem that this sort of "format string" based procedures are completely bonkers as API isn't solved ;-). Of course you'd need some way to accumulate an arbitrary run-time defined tuple (similar to the problems of varargs), if you want to keep this crazy stuff. The good news is: Such a tuple as return value on the stack will not mess around with addresses that are not there, but maybe push more values on the stack as needed - but the stack cleanup after calling tuple_fscanf will deal with that. Format string errors then will still lead to wrong values in the assigned tuple, but *not* in stuff written into the return address (code space). > and then try to pass a subsection of an array to a function (which > then treats the subsection as a complete array). Ah, that's easy: int foo[10]; bar(foo+5); No "address of" required, foo is an array object, +n is the operator to create an array subsection. If you want to change the end, as well, cast: bar((int[3])(foo+5)); -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Robert Myers on 1 Jan 2010 18:07 On Jan 1, 1:59 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote: > Robert Myers wrote: > > On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote: > > >>Any problem that is not I/O bound that can be made solved using a > >>non-imperative language can be made to work faster in an imperative > >>language with the appropriate library (i.e. if you can write it in ML, > >>it can be re-written in C-as-high-level-assembly, and it will perform > >>better). > > > If your original claim is correct even for one machine (and the truth > > of your claim is not obvious to me, given the lengthy discussion here > > about memory models), > > Its not necessarily simple, but it is definitely do-able. One of the > economic costs is being able to find the programmer(s) who can pull this > off. > The closest thing to a proof I can examine that such programmers even exist and even then, only to some approximation, is the Linux kernel. As Far as I'm concerned, Windows and nearly every bit of software that runs on it is a glaring counterexample. I assume the the industrial- strength commercial *ixes might also be examples, but I don't (for the most part) use them, and I can't examine the source. I doubt if operating systems will ever be written in an elegant, transparent, programmer-friendly language. For that universe (and maybe some os-like systems like database software) your advice is, at least in some practical sense, probably correct. For the kind of scientific computing with which I have the most experience, one naive calculation you can do shows the amount of physics you can do as scaling as N^(1/4) (three space dimenstions, plus time, leaving out lots of details). If your computer is 4x, then you can do 40% more physics in the same amount of time, but with a 4x expenditure of energy (and investment in hardware, but that's money into the pockets of hardware mfrs, which is ok by me). The bomb labs, of course, don't have to fuss with that sort of pedestrian consideration, but I assume that much of the rest of the world does. You are, I think, examining a universe in which the payoff for performance is linear or even possibly better. I'm generally looking at problems where the payoff for performance increases is marginal. On the other hand, energy and computers are expensive, but scientists and scientific programmers are also expensive, and the costs associated with non-transparency and non-portability are very high. Actually, I'd say those costs are unacceptable, but the world of science has not yet advanced to my level of thinking. ;-) People are perfectly happy to look at the end result of computations, taking it largely on faith that the computations are correct or even make sense, just so long as they fit "data" or prevailing prejudices. I don't know why people bother with models that ostensibly mimic physics. In the old days, people just fit curves, and I'm not sure how far beyond curve-fitting we have actually advanced for pure science. I'm sure that the picture doesn't look nearly as dismal for some kinds of applications. Some engineering applications make good use of large-scale computation. One aerodynamicist I talked to who used CFD as a black box said he was convinced there was a bug in a program that was widely relied upon for aerodynamics. Even there, the successes may be substantially delusional/luck. > > does it then follow that it is also possible for > > a second, or a third machine? If the cost of rewrite is bounded for > > one machine (that is to say that, you eventually succeeded in getting > > the program to work--a least you *think* you got it to work-- > > correctly), it is bounded for a second or a third machine? > > Yes - generally, to get it done for one machine, one will have to have > identified all sharing/communication between the parallel entities. This > is work you don't have to do again. > > There is a caveat - if the code (algorithms, task partititioning) has to > be reworked because of large differences between the systems, then much > less code/learning carries over. > But, for some applications, the costs may simply be unacceptable. You can't invest the money to duplicate results? Too bad, then, I guess you'll have to accept my results at face value, and, of course, I'm the only one who will ever get support for working this problem, because it's too expensive to move it anywhere. Good deal. > > I'll accept that Nick is an expert at finding and fixing weird memory > > problems. From reading his posts, I'm convinced that the memory model > > of c is not well-defined or even agreed-upon in any practical sense. > > Which is why it's C-as-high-level-assembly, not C-as-an-ANSI-standard, > will be used for getting the job done. Actually, thats not strictly true > - you have to partition stuff into things that are isolated to one > process/CPU and code dealing with the sharing of state/synchronization > between processes. The isolated case can generally be written in vanilla > C (or any other language), while the sharing case has to be written > non-portably - perhaps actually in assembly. Hopefully, a lot of the > shared code can be hidden behind macros, function calls or code > generators to isolate the behavior. > Does this "C-as-high-level-assembly" compiler exist? > > Thus, not only are you worried about whether your program will work > > with a second machine, but even with a second compiler with a > > different notion of critical details of c. > > Again: you don't rely on the compiler to get those details correct. You > have to ensure that it is correct. This may mean constraining the > compiler in ways that make it no more than a high-level assembler, or > using inline assembly, or even straight assembly where necessary. > Does an appropriately-constrained compiler exist? People seem to want to add features, not remove them. > The problem is that unless you've done it, you don't know where the > friction points are, and you assume that its too difficult. It isn't - > its just complicated engineering. I can think of lots of systems codes > which are in many ways more complicated. > For the kinds of problems you are most accustomed to thinking about, perhaps. > > From an economic point of view, the proposed trade doesn't even seem > > rational: trade a program with transparently correct memory semantics > > (perhaps because there are no memory semantics to be incorrect) for > > one that is faster but may or may not do the same thing under some > > limited set of circumstances. > > Generally tasks that are not trivially paralellizable/distributable and > are not IO bound are parallelized because the performance is inadequate. > If the performance is inadequate, it may be because we don't have the > best serial implementation, or because the best serial implementation is > itself not sufficient. > > What is the slowdown between approach X (for you favorite value of X) > and the best serial implementation? This slowdown matters - a lot. > > If, for instance, ths slow down 4x, does that mean that we will end up > with identical performance using 4 way parallelism? Probably not - the > parallel inefficiencies will probably mean that we break even at 6x-8x. > > So: is it more economic to focus on writing a serial imperative program > or a parallel approach X program? > > How about the case where we're going to *have to* parallelize - even > with the best case serial program is just too slow. In that case, both > the imperative approach and the alternative(s) will have to be parallel. > What are the inefficencies here? > > The hierarchical nature of communication between processors means that a > 4-way parallel machine will have better communication properties than a > 16-way parallel machine, which in turn will be better than a 64-way and > so on. This means that if we can fit an imperative parallel program into > a 4 way, and approach X is 4x slower, then approach X will be forced > into a 16 way. But since it is now one level down the communication > hierarchy, it is quite possible that it will be even slower, requiring, > say, a 32 way machine to be competitive. > The scientists I know generally want to speed things up because they are in a hurry. The question is: is it better to do a bit less physics and/or let the machine run longer, or is it better to use up expensive scientist/ scientific programmer time and, at the same time, make the code opaque and not easily transportable? > Also, in some programs, it is easy to extract a small amount of (task) > parallelism, but it is not possible to extract large (or unbounded) > parallelism. > If we can't do "unbounded" ("scalable") parallelism, then there is an end of the road as far as some kinds of science are concerned, and we may already be close to it or even there in terms of massive parallelism (geophysical fluid dynamics would be an example). The notion that current solutions "scale" is pure bureaucratic fraud. Manufacturers who want to keep selling more of the same (do you know any?) cooperate in this fraud, since the important thing is what the customer thinks. > It possible that we have access to an N-way machine, there is N-way > parallelism available in the program, the N-way solution using > approach-X is fast enough, and we prioritize the advantages of using > approach X (time-to-market, programmer availablity, etc.) over the > baseline, highest performance, approach. In that case, we are free to > speculate about the various alterative programming approaches. > Which is mostly the kind of problem I am familiar with. Within a constrained universe, your advice seems eminently sensible. My bitter observation (and maybe Nick will agree) is that the world has come to be dominated by a language (C) that is best suited for writing operating systems, while most of us never have such a need. Robert.
From: Del Cecchi on 1 Jan 2010 19:19 "Mike" <mike(a)mike.net> wrote in message news:v_qdnUeuT-97zKPWnZ2dnUVZ_hadnZ2d(a)earthlink.com... > > "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message > news:4B3E4928.7060703(a)patten-glew.net... > | nmm1(a)cam.ac.uk wrote: > | > C99 allows you to encrypt addresses and/or save them on disk. > | > Seriously. > | > | Which is, seriously, a good idea. For certain classes of > applications. > | Such as when you want to persist a data structure to disk, that > you > will > | later load into exactly the same machine, at the same locations. > Like > | in a phone switch. > | > | However, for 99% of the jobs we need to do, not such a good idea. > | > | Except... you can do stupid persistence packages for single > threaded > | machines, on OSes that guarantee that data is always allocated at > the > | same address. Ditto simplistic checkpoint recovery schemes. > | > | So I guess that it is not all that stupid for those apps. > | > | But it sure does get in the way for alias analysis. > > > The IBM System i (not single threaded) places the file system in a > single virtual address space in which all objects have a single > constant virtual location which is never reassigned. That may > provide > a lead to a practical approach. > Back in the day it used to be said that system/i (os/400, s/38) didn't really have a file system since it had a very large virtual address space in which objects were located. But I was a hardware guy and didn't really get the details. del
From: Bill Todd on 1 Jan 2010 19:57
Del Cecchi wrote: > "Mike" <mike(a)mike.net> wrote in message > news:v_qdnUeuT-97zKPWnZ2dnUVZ_hadnZ2d(a)earthlink.com... .... >> The IBM System i (not single threaded) places the file system in a >> single virtual address space in which all objects have a single >> constant virtual location which is never reassigned. That may >> provide >> a lead to a practical approach. >> > Back in the day it used to be said that system/i (os/400, s/38) didn't > really have a file system since it had a very large virtual address > space in which objects were located. Well, sort of - at least in the sense that it didn't have a file system that was exposed to applications. But it must have had something resembling a file system internally if it allowed objects to grow, because despite the fact that it had (for the time) an effectively infinite virtual address space into which to map them it had decidedly finite physical storage space on disk in which to hold them, hence needed a mechanism to map an arbitrarily large expandible object onto multiple separate areas on disk while preserving its virtual contiguity (and likely also required a means to instantiate new objects too large to fit into any existing physically-contiguous area of free space). The normal way a file system (just like almost everyone else) supports movable/expandible objects with unvarying addresses is via indirection, substituting the unvarying address of a small pointer for that of an awkwardly large and/or variable-size object. That unvarying address need not be physical, of course - e.g., the i-series may have hashed the constant virtual address to a chain address and then walked the chain entries until it found one stamped with the desired target virtual address. But it's not clear how applicable this kind of solution would be to the broader subject under discussion here. - bill |