Prev: Processors stall on OLTP workloads about half the time--almost no ?matter what you do
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do
From: Chris Gray on 2 May 2010 16:50 "nedbrek" <nedbrek(a)yahoo.com> writes: > I'm particularly interested in parallel linking. I wrote the original Myrias linker/librarian. One of my goals was to learn how to use the Myrias "pardo" model on that kind of application, so the linker was a parallel application. Normally it ran on a workstation, where the pardo loops serialized. We ran it on the actual hardware as a bit of a test, and it was basically I/O bound. One of the tools that it needed, which I may have described here before, was the new system I/O calls we added. In particular, the linker needed "seekread", which was an atomic seek/read combination. The linker also needed "tellwrite", which was an atomic write to the current end of the file, which returned the file position at which the write ocurred. Those calls allowed parallel I/O without confusion. Those, in combination with a new object file format (officially known as SCOFF - Super Computer Object File Format, but which was really "Stuart and Chris's Object File Format") allowed me to write the linker. The object file format started with a magic number, and then a pointer to the directory. The directory was at the end of the file, written after all of the code/data/whatever sections had been written out in parallel using tellwrite. I *think* the file format also represented each function separately, so that functions could be linked as separate entities. That problem has always bugged me about things like Elf, which presumeably were based on the way in which the original C compiler translated each source file into a single large assembly file which was then assembled as one indivisible blob. I think Tera was one of several projects which worked around that. I don't have the code here, and it was a long time ago, so I don't remember much in the way of how it worked internally, but I believe there were 2 or 3 pardo's in the code. After I had finished with it, the new compiler group decided to break the linker/librarian into two separate programs - I don't recall why. So, even if I could get hold of the latest version of it, I likely wouldn't be very familiar with the code. Basically, the parallelism was over the input files, and then over the functions and data sections within them. I believe there was an outer loop to iterate over what was learned about unresolved symbols (you don't want the parallel tasks to grab those things themselves, else you can end up with many copies of them). We did not use the Myrias memory semantics - the child tasks simply allocated what they needed, possibly reading code/data into it, and that memory was then given to the main task, as the result of the child's work. -- Experience should guide us, not rule us. Chris Gray cg(a)GraySage.COM
From: Paul Wallich on 3 May 2010 12:06 nedbrek wrote: > Hello all, > > <nmm1(a)cam.ac.uk> wrote in message > news:hre2p7$3nf$1(a)smaug.linux.pwf.cam.ac.uk... >> That being said, MOST of the problem IS only that people are very >> reluctant to change. We could parallelise ten or a hundred times >> as many tasks as we do before we hit the really intractable cases. > > I'm curious what sort of problems these are? My day-to-day tasks are: > 1) Compiling (parallel) > 2) Linking (serial) > 3) Running a Tcl interpreter (serial) > 4) Simulating microarchitectures (serial, but I might be able to run > multiple simulations at once, given enough RAM). I know I'm not well-versed here, but isn't simulating microarchitectures at least small-n parallel?
From: MitchAlsup on 3 May 2010 18:56 On May 3, 11:06 am, Paul Wallich <p...(a)panix.com> wrote: > I know I'm not well-versed here, but isn't simulating microarchitectures > at least small-n parallel? Where n is at least pipe-lenght and might be at least as big as pipe- length*SuperScalarity + cache-hierarchy + memory-system That is n is approaching 32-64 easily blocked off units of work. Mitch
From: nedbrek on 4 May 2010 07:21
Hello all, "MitchAlsup" <MitchAlsup(a)aol.com> wrote in message news:db84f8dd-54e1-4a9a-9f8c-29768da1e9be(a)d19g2000yqf.googlegroups.com... > On May 3, 11:06 am, Paul Wallich <p...(a)panix.com> wrote: >> I know I'm not well-versed here, but isn't simulating microarchitectures >> at least small-n parallel? > > Where n is at least pipe-lenght and might be at least as big as pipe- > length*SuperScalarity + cache-hierarchy + memory-system > > That is n is approaching 32-64 easily blocked off units of work. Potentially. The question is, how much additional complexity does it cost? Accuracy costs complexity. You don't want to pay additional complexity just for performance (at the expense of exploring ideas or getting reliable data). There is much greater process parallelism (traces * configuration). When I was at Intel, we had ~400 traces. So, you had parallelism of 400 just for a single config. And configurations can grow exponentially (4 cache sizes * 4 cache latencies * 10 rob sizes * 10 scheduler sizes - not that it has to). Ned |