Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?
From: Anne & Lynn Wheeler on 4 Apr 2010 09:43 Morten Reistad <first(a)last.name> writes: > The classic mainframe paging got underway when a disk access took somewhat > more than 200 instructions to perform. Now it is main memory that takes a > similar number of instructions to access. We have a problem with handling > interrupts though; then an interrupt would cost ~20 instructions; now it > costs several hundred. > > On a page fault we would normally want to schedule a read of the > missing page into a vacant page slot, mark the faulting thread in IO > wait, and pick the next process from the scheduler list to run. On IO > complete we want to mark the page good, and accessed, put the thread > back on the scheduler list as runnable; and possibly run it. These > bits can be done in hardware by an MMU. But for a prototype we just need > to generate a fault whenever the page is not in on-chip memory. re: http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation "store-in caches" required the selected real memory slot to have any "changed" contents first written back to disk. the latency of this operation could somewhat be masked by running the replacement selection slightly "ahead" (asynchronous) of the selection process ... so any required writting (of selected replacede page) is completed by the time the page is selected (masks the latency to write pages back to disk but still requires the pathlength to perform the operations). '68 ... i had rewritten pieces of cp67 to get the whole avg. pathlength from several thousand down to approx. 500. (long-winded) old post http://www.garlic.com/~lynn/93.html#31 Big I/O or Kicking the Maiuframe out the Door that is separate from latency. original cp67 used fixed-head rotating drum with single page transfer per i/o ... which then had avg. rotational delay per transfer ... limit to approx. 80 i/os per second. i changed that to "chain" all pending requests in rotational order in one single i/o operation ... raising the peak to 300 i/os per second. As machines got faster and caches started to become critical component .... asynchronous i/o interrupts became a major thruput issues ... caches were relatively small ... so there was effectively complete cache replacement on every i/o interrupt (followed by separate cache replacement again when return to doing whatever was going on when the interrupt occured). the cache replace associated with asynchronous interrupts would start to dominant any highly optimized pathlength. in the 80s, some number of mainframe systems could have pathlength approaching 10,000 instructions to do the page fault, replace, task switch, i/o pathlength, interrupt, task switch again. 3090 introduced extended store (electronic paging) and synchronous page transfer instruction between main memory and extended store. extended store was solution to 3090 not being able to deal with non-uniform memory access with physical packaging of increasing amounts of real storage. The solution was the physical packaging that was further away and had higher latency was "extended store" ... and the software moved pages between "processor real storage" and "extended store". the synchronous instruction eliminated the really long pathlength associated with asynchronous operation (in many systems of the period) ... as well as some amount of the cache replace overhead. The issue was that extended store tended to be on the size of real storage (and the same technology ... just longer latency) ... so there still had to be disk paging ... "extended store" management required moving replaced "extended store" pages back into real processor storage before writting to disk. Later processors and physical packaging eliminated "extended store" .... but some configuration continued to be configured with the two level electronic memory (machine microcode configuration setup to simulate "extended store" operation). The issue was that some operating system page replacement algorithms had become heavily tuned to operating with bimodel "extended store" ... and when presented with one large homogeneous storage, the replacement page selection performed less well than when real storage was divided into the two areas. -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: Bernd Paysan on 4 Apr 2010 14:50 Morten Reistad wrote: > And 10G ethernet is reasonably easily switched over longer distances. > Fiber hops can be 70km between amplification, and 400 km between full > regeneration. It is fully possible to build a planet-wide 10G switched > ethernet. But then the latency would be a problem again. Switched ethernet has other problems which would prevent a planet-wide switched ethernet. Ok, not the distance is the problem, it's the population, so as long as you keep the population low, it's possible. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: Stefan Monnier on 4 Apr 2010 15:06 > The classic mainframe paging got underway when a disk access took somewhat > more than 200 instructions to perform. Now it is main memory that takes a > similar number of instructions to access. Are you saying that the classic mainframes switched from hardware-paging to software-paging more or less when disk accesses latency reached the "200 instructions" limit? I didn't know they used hardware paging before. But in any case the situation nowadays is slightly different: we have a lot of cheap hardware real-estate, so we can afford to implement somewhat complex "paging" schemes in hardware, so I'd expect the threshold to be a good bit higher than 200 instructions. Stefan
From: Morten Reistad on 4 Apr 2010 15:26 In article <jwv4ojrjb4c.fsf-monnier+comp.arch(a)gnu.org>, Stefan Monnier <monnier(a)iro.umontreal.ca> wrote: >> The classic mainframe paging got underway when a disk access took somewhat >> more than 200 instructions to perform. Now it is main memory that takes a >> similar number of instructions to access. > >Are you saying that the classic mainframes switched from hardware-paging >to software-paging more or less when disk accesses latency reached the >"200 instructions" limit? >I didn't know they used hardware paging before. The first pagers were very simple and primitive, like the ones on the KI10, or the first IBM 370s. We would call it "hardware-assisted software paging today". Real, full hardware paging came in the next generation of machines. But the "hardware assisted" paging was still a huge win. >But in any case the situation nowadays is slightly different: we have >a lot of cheap hardware real-estate, so we can afford to implement >somewhat complex "paging" schemes in hardware, so I'd expect the >threshold to be a good bit higher than 200 instructions. If we replace the cache-algorithms with "real" paging, where the hardware does all the time-critical stuff, but software still has control over the process I expect there to be substantial gains, but there are no firm figures for this. This could be a nice job for a handful of grad students somewhere to build and evaluate. -- mrr
From: Robert Myers on 4 Apr 2010 21:08
Andy "Krazy" Glew wrote: > ... you must remember that the guys doing the > hardware (and, more importantly, microcode and firmware, which is just > another form of software) for such multilevel mmory systems have the > same ideas and read the same papers. So is the lesson: If you're not IBM or Intel, why waste time talking about it? AMD is too busy playing catchup with two for one deals (no mention of it not being a *real* twelve core die). License Intel's Atom IP? Robert. |