From: Roger Ivie on 26 Apr 2010 15:40 On 2010-04-26, Quadibloc <jsavard(a)ecn.ab.ca> wrote: > On Apr 26, 7:03�am, Bernd Paysan <bernd.pay...(a)gmx.de> wrote: > >> I remember that the HP Fortran compiler compiled a hand-optimized matrix >> multiplication whenever it found something resembling a matrix >> multiplication (more than 15 years ago), and I'm quite ok with that >> approach. > > Well, I'm not. Not because it's cheating on benchmarks. But because it > should only replace Fortran code with a routine that performs a matrix > multiplication if, in fact, what it found *really is* a matrix > multiplication. I have actually been fighting just this sort of battle on an Itanium machine. Not specifically matrix multiplication, but... In my situation, we're doing real-time work on an Itanium VMS box using FORTRAN code that's been around since VAX-11/750s roamed the earth. Since the only thing these particular boxes do is run our application, we do things like create global sections to specific hardware addresses to allow our FORTRAN code to get at the registers. Had a bit of trouble a while ago with a Bit3 PCI to VMEbus adapter. Code that creates mappings on the VMEbus was unable to map more than one region. The code worked by walking the scatter/gather map looking for an unused region in which the mapping could be performed. In this specific case, it means walking through an array of longwords looking for an entry that has bit 0 clear. The array is declared as INTEGER*4,volatile:: (although it's FORTRAN and been around since the /750, that doesn't mean it hasn't had a few facelifts over the years). The Itanium compiler noticed that I was only looking at bit 0, so it performed *byte* fetches from the scatter/gather map. The Bit3 hardware doesn't support byte accesses to the map. I suspect it uses the size of an access to decide between the CSR data path (bytes only) and the scatter/gather map data path (longwords). As a result, I was seeing 0x0f (an unaddressed byte CSR) always returned for the first map register, resulting in my code always believing the first register was available. Similarly, once it's allocated a chunk of map registers it clears them to mark them as in use. This involves walking through the array, plunking a zero in each longword. The compiler noticed that this was a block clear and replaced my code with an unrolled block clear loop that did either byte or word clears, depending on alignment. Furthermore, the compiler saw through all my simple-minded attempts to trick it. And compiling it /NOOPTIMIZE didn't fix the first problem, which involved using a byte fetch to snag only the "interesting" portion of a longword. I wound up having to do map accesses through a function similar to this: integer*4 peek( address ) implicit none integer*4,volatile:: address return address But this makes me worry about *all* of the other CSR accesses in the system, *especially* those that go through the scatter/gather map to a bus that has another byte order. Using the *one* byte swapping mode that makes stuff at the other end of the system look enough like memory to tolerate changes in access size moves the bits around. -- roger ivie rivie(a)ridgenet.net
From: Rick Jones on 26 Apr 2010 17:57 Anne & Lynn Wheeler <lynn(a)garlic.com> wrote: > HP: last Itanium man standing > http://www.theregister.co.uk/2010/04/26/itanium_hp_last_standing/ > from above: > Make no mistake: If Hewlett-Packard had not coerced chip maker Intel > into making Itanium into something it never should have been, the > point we have come to in the history of the server business would > have got here a hell of a lot sooner than it has. But the flip side > is that a whole slew of chip innovation outside of Intel might never > have happened. I read the article, but clearly not closely enough - what was it HP coerced Intel into making Itanium that it should never have been? Also, the bit about emulators is a little off: "(The Itanium chips had an x86 emulator, you will remember, and also emulated some PA-RISC instructions that HP-UX needed)" I have never been a HW guy, but I don't recall there being any sort of PA-RISC instruction emulation in Itanium chips. There is the Aries PA-RISC emulator *SW* available with HP-UX to allow customers to run PA-RISC binaries. rick jones -- The glass is neither half-empty nor half-full. The glass has a leak. The real question is "Can it be patched?" these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: MitchAlsup on 26 Apr 2010 18:43 On Apr 26, 12:43 am, Brett Davis <gg...(a)yahoo.com> wrote: > In article > <b24c8bb2-fcc3-4f4a-aa0d-0d18601b0...(a)11g2000yqr.googlegroups.com>, > MitchAlsup <MitchAl...(a)aol.com> wrote: > > I think there are a number of semi-fundamental issues to be resolved; > > > The realization that "one can synchronize" a hundred thousand threads > > running in a system the size of a basketball court > ATI chips already have ~2000 processors, simple scaling over the next > decade states that the monitor in your iMac a decade from now will > have 100,000 CPUs. Which means that a desktop server will have a > million CPUs. One for each 10 pixels on your monitor. These ATI chips are the size of a basketball court? I suspect you mean pipelines or pipeline stages in a single chip. The problem I was aluding to was of size versus speed (equivalent to time and distance). Where size is massively bigger (>1000X) than the clock rate of the pipeline stage, and where the alusion of "nothing can happen simultaneously" and "everyone can agree on exactly what time it is" become unsolvable. BTW a basketball court is about the size of some really large supercomputer systems, so I was not talking about systems in-the-small with those assertions. Mitch
From: MitchAlsup on 26 Apr 2010 18:56 On Apr 26, 12:22 pm, Robert Myers <rbmyers...(a)gmail.com> wrote: > On Apr 25, 10:15 pm, MitchAlsup <MitchAl...(a)aol.com> wrote: > > > > > Perhaps along with the notion of the "Memory Wall" and the "Power > > Wall" we have (or are about to) run into the "Multi-Processing" Wall. > > That is, we think we understand the problem of getting applications > > and their necessary data and disk structures parallel-enough and > > distributed-enough. And we remain are under the impression that we > > "espression limited" in applying our techniques to the machines that > > have been built; but in reality we are limited by something entirely > > more fundamental, and one we do not yet grasp or cannot yet enumerate. > > A misbegotten imaginary generalization of the Turing machine is at the > root of all this, along with a misbegotten conception of how > intelligence works. > > One of these days, we'll recognize a Turing machine as an interesting > first step, but ultimately a dead end. Along with it, we'll > eventually grasp that the entire notion of "programming" is a very > limiting concept. Eventually, the idea of a "programmer", as we now > conceive it, will seem preposterously dated and strange. I, personally, blame the vonNeumann programming model. But it is so intemately intertwined with the Turing Machine fundamentals that little distinction is bought by making such a distinction. But person of blame apart, I entirely agree with you. > Nature has evolved very sophisticated ways of coding the design for an > organism that will interact with an environment with certain expected > characteristics to evolve into a very sophisticated mature organism > that it is hard to believe arose from such compact code--and it > didn't. It evolved from that compact code through interaction with an > appropriate environment, from which it "learned." One could say the same about LISP programs....or the purporters of LISP programs {LISP = all languages derived from the notions first established by LISP} <snip> > Does any of this have to do with hardware? I think it does. So long > as processes are so limited and clumsy in the way they communicate, > we'll wind up with machines that are at best an outer product of > Turing machines. It is not just communications, its the fundamental nature of one step (instruction) at a time that must die to break out of the Turing/ vonNeumann bottleneck. Parralelism in the memory interconnect (communications mechanism) is entirely stiffled by "Memory Models" and "Cache coherence". Parralelism in the system to system (communication mechanisms) is entirely stiffled by physical distance (latency), data rate (BW), and the notion that I/O is too dangerous for user level program to manage therefor the OS(device drivers) need to do it all, and this requires synchronization(s) on the scale of "threads in a box" Mitch
From: Robert Myers on 26 Apr 2010 20:15
Quadibloc wrote: > On Apr 26, 11:22 am, Robert Myers <rbmyers...(a)gmail.com> wrote: > >> One of these days, we'll figure out how to mimic that magic. > > Well, "genetic algorithms" are already used to solve certain types of > problem. > Genetic programming is only one possible model. The current programming model is to tell the computer in detail what to do. The proposed paradigm is to shift from explicitly telling the computer what to do to telling the computer what you want and letting it figure out the details of how to go about it, with appropriate environmental feedback, which could include human intervention. This changes what is now programming into more of a systems engineering problem, since it is unlikely that, in the foreseeable future, computers will be able to write "programs" without significant help, and telling the computer what to focus on will remain the province of the human user. The envisioned outcome is a way of using computers that is less brittle, that is less sensitive to the kind of timing issues that Mitch has identified, that is naturally parallel, and that will produce more reusable "software." Robert. |