Prev: PEEEEEEP
Next: Texture units as a general function
From: Robert Myers on 31 Dec 2009 14:38 On Dec 31, 8:30 am, Mayan Moudgill <ma...(a)bestweb.net> wrote: > Any problem that is not I/O bound that can be made solved using a > non-imperative language can be made to work faster in an imperative > language with the appropriate library (i.e. if you can write it in ML, > it can be re-written in C-as-high-level-assembly, and it will perform > better). > > The questions that arise are: > - how much speedup does the C version get? > - how much more effort is it to write the C version? > - how many machines will this version need to support, and how much > rewrite is required to support all these machines? If your original claim is correct even for one machine (and the truth of your claim is not obvious to me, given the lengthy discussion here about memory models), does it then follow that it is also possible for a second, or a third machine? If the cost of rewrite is bounded for one machine (that is to say that, you eventually succeeded in getting the program to work--a least you *think* you got it to work-- correctly), it is bounded for a second or a third machine? I'll accept that Nick is an expert at finding and fixing weird memory problems. From reading his posts, I'm convinced that the memory model of c is not well-defined or even agreed-upon in any practical sense. Thus, not only are you worried about whether your program will work with a second machine, but even with a second compiler with a different notion of critical details of c. So, even if you could do such a rewrite once, whether or not you had a correct translation in any other circumstance would always be an open question. From an economic point of view, the proposed trade doesn't even seem rational: trade a program with transparently correct memory semantics (perhaps because there are no memory semantics to be incorrect) for one that is faster but may or may not do the same thing under some limited set of circumstances. From the POV of scientific computing, where reproducibility is essential, the idea that you're never quite sure whether the software will work correctly on two different machines or under two different compilers seems completely unacceptable. Perhaps, on the other hand, because the world of finance is essentially a casino, it's a plausible proposal for a manufacturer of business machines. Robert.
From: Gavin Scott on 31 Dec 2009 18:14 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote: > But that's PERFORMANCE tuning. Speaking of the P word, offhand I can only think of one interesting front in the battle for performance these days, that of executing JavaScript inside web browsers. There are at least three groups actively competing to produce the fastest JavaScript execution engine, based on the idea that this is going to be a key part of success in the upcoming "web os" age. It might be interesting to look at some of that work and their specific challenges as a counterpoint to the usual "we can't do it because C doesn't let us" arguments. It's a different language and environment which is having real money and brainpower thrown at it. G.
From: Bernd Paysan on 31 Dec 2009 20:35 Gavin Scott wrote: > There are at least three groups actively competing to produce the > fastest JavaScript execution engine, based on the idea that this > is going to be a key part of success in the upcoming "web os" age. > > It might be interesting to look at some of that work and their > specific challenges as a counterpoint to the usual "we can't do > it because C doesn't let us" arguments. It's a different language > and environment which is having real money and brainpower thrown > at it. The tragedy is that it was never designed for that job. It just happened. The consequence of all these accidental pieces growing together as the "next OS" is a nightmare. We already have enough bad operating systems, bad programming languages, bad protocols, and bad user interfaces, but apparently, the direction is straight downhill. Fefe recently gave a talk about bad API on the 26C3; if you understand German, you can see the talk on CCC-TV: http://media.ccc.de/browse/congress/2009/26c3-3691-de- vier_fuste_fr_ein_halleluja.html JavaScript is not even part of that presentation, he got lost in the layers below ;-). Making JavaScript fast is actually not really a challenge. After all, JavaScript is a Lisp derivative in disguise, so the compiler technology used has to be a mix between a fast Lisp-like compiler (Ocaml or so) and Forth (the latter, since compiler performance matters, and it has to be an incremental compiler). And actually, that's what those people above do (at least two use Forth as part of their JavaScript engine). -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: "Andy "Krazy" Glew" on 31 Dec 2009 23:42 nmm1(a)cam.ac.uk wrote: > In article <4B3BFDF9.2070607(a)patten-glew.net>, > Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >> Herlihy's non-blocking lock free algorithms using compare-and-swap >> depend on there being a single pointer that dominates the data structure >> you are changing. > > One of the many reasons that so many people feel that a double form > of compare-and-swap is essential. I've read Greenwald's thesis, and I am sympathetic. In fact, I have proposed implementations for DCAS, twice at Intel and once at AMD. However 1) It doesn't meet all needs - witness the folks who ask for KCSS (K-way Compare and Swap Synchronization) and Transactional Memory with arbitrary numbers of locations involved. 2) It doesn't seem to be necessary in a theoretical sense: "DCAS is not a silver bullet for nonblocking algorithm design". 16th annual ACM symposium on Parallelism in algorithms and architectures, 2004, pp. 216�224. Doherty et al, inter alia Mark Moir and Guy Steele. http://en.wikipedia.org/wiki/Double_compare-and-swap: "More recently, however, it has been shown that an STM can be implemented with comparable properties using only CAS." >> Whereas with a subset of linked datastructure that does not have a >> single dominator, you can't necessarily copy back in a changed >> datastructure, with nodes that have been added or deleted. > > No. You have to add other constraints on what can be done. Not > impossible, but not easy, either. That's the clincher. Algorithms and datastructures that are efficient non-parallel are not necssarily efficient in parallel. In a non-parallel code you can operate on an arbitrary sub-mesh of a linked datastructure, with arbitrary numbers of links in and out. You can operate in place, replace it, etc. Whereas in parallel you either need complicated locking. Or, if you want to do it lock-free, you have to change the datastructure to have only a single pointer entry point, or maybe 2 with DCAS. Or maybe K with KCSS. Or maybe ... well, TM (Transactional Memory) doesn't really give you an arbitrary number of locations. Maybe an L1 cache full of locations. Maybe. (OK, some forms of TM fall back to locking, e.g. a single global lock, when the L1 cache or whatever structure they are tracking the transaction footprint in overflows. De facto, that means you don't want to go larger than some fraction of the L1 cache, reduced by a factor related to the load.) (Which would you rather have: a TM implementation that is limited to an L1 cache worth of data lines, for *all* of the data being manipulated? Or a KCSS where K was an L1 cache - where you could use KCSS to handle all of the links into and out of the sub-mesh, but where you could access an arbitrary amount of data within the submesh? === The thing I don't like about DCAS or KCSS or TM is that most of the hardware proposals depend on having a cache coherence protocol. As explained in other posts, of late I have been thinking in terms of supercomputers without a cache coherence protocol. Whereas a single location CAS (of any size up to the interleave or bank width) can be implemented in a machine with no caches - by letting the memory interface implement it. DCAS or KCSS or TM, in a system without caches, requires the ability to lock memory locations. This can be arbitrarily sized if done via a memory tag bit, e.g. stolen from ECC. But many people want to avoid that, which introduces an arbitrary limit on the number of locations involved. There're always the deadlock issues. Those can be resolved, e.g. by physical address sorting. (The unwillingness to provide hardware to do the physical address sorting efficiently is why DCAS is not in the x86 instruction set.)
From: "Andy "Krazy" Glew" on 31 Dec 2009 23:45
Gavin Scott wrote: > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote: >> But that's PERFORMANCE tuning. > > Speaking of the P word, offhand I can only think of one > interesting front in the battle for performance these days, > that of executing JavaScript inside web browsers. > > There are at least three groups actively competing to produce the > fastest JavaScript execution engine, based on the idea that this > is going to be a key part of success in the upcoming "web os" age. > > It might be interesting to look at some of that work and their > specific challenges as a counterpoint to the usual "we can't do > it because C doesn't let us" arguments. It's a different language > and environment which is having real money and brainpower thrown > at it. Care to share pointers to the three leading Javascript groups? Plus maybe to add some commentary? Or are you going to make us google it ourselves? :-( |