Interesting presentation [Computer Architecture]

Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?

From: Anne & Lynn Wheeler on 4 Apr 2010 09:43

Morten Reistad <first(a)last.name> writes:
> The classic mainframe paging got underway when a disk access took somewhat
> more than 200 instructions to perform. Now it is main memory that takes a
> similar number of instructions to access. We have a problem with handling
> interrupts though; then an interrupt would cost ~20 instructions; now it
> costs several hundred.
>
> On a page fault we would normally want to schedule a read of the
> missing page into a vacant page slot, mark the faulting thread in IO
> wait, and pick the next process from the scheduler list to run. On IO
> complete we want to mark the page good, and accessed, put the thread
> back on the scheduler list as runnable; and possibly run it. These
> bits can be done in hardware by an MMU. But for a prototype we just need
> to generate a fault whenever the page is not in on-chip memory.

re:
http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation

"store-in caches" required the selected real memory slot to have any
"changed" contents first written back to disk. the latency of this
operation could somewhat be masked by running the replacement selection
slightly "ahead" (asynchronous) of the selection process ... so any
required writting (of selected replacede page) is completed by the time
the page is selected (masks the latency to write pages back to disk but
still requires the pathlength to perform the operations).

'68 ... i had rewritten pieces of cp67 to get the whole avg. pathlength
from several thousand down to approx. 500. (long-winded) old post
http://www.garlic.com/~lynn/93.html#31 Big I/O or Kicking the Maiuframe out the Door

that is separate from latency. original cp67 used fixed-head rotating
drum with single page transfer per i/o ... which then had
avg. rotational delay per transfer ... limit to approx. 80 i/os per
second. i changed that to "chain" all pending requests in rotational
order in one single i/o operation ... raising the peak to 300 i/os per
second.

As machines got faster and caches started to become critical component
.... asynchronous i/o interrupts became a major thruput issues ... caches
were relatively small ... so there was effectively complete cache
replacement on every i/o interrupt (followed by separate cache
replacement again when return to doing whatever was going on when the
interrupt occured). the cache replace associated with asynchronous
interrupts would start to dominant any highly optimized pathlength.

in the 80s, some number of mainframe systems could have pathlength
approaching 10,000 instructions to do the page fault, replace, task
switch, i/o pathlength, interrupt, task switch again. 3090 introduced
extended store (electronic paging) and synchronous page transfer
instruction between main memory and extended store. extended store was
solution to 3090 not being able to deal with non-uniform memory access
with physical packaging of increasing amounts of real storage. The
solution was the physical packaging that was further away and had higher
latency was "extended store" ... and the software moved pages between
"processor real storage" and "extended store". the synchronous
instruction eliminated the really long pathlength associated with
asynchronous operation (in many systems of the period) ... as well as
some amount of the cache replace overhead.

The issue was that extended store tended to be on the size of real
storage (and the same technology ... just longer latency) ... so there
still had to be disk paging ... "extended store" management required
moving replaced "extended store" pages back into real processor storage
before writting to disk.

Later processors and physical packaging eliminated "extended store"
.... but some configuration continued to be configured with the two level
electronic memory (machine microcode configuration setup to simulate
"extended store" operation). The issue was that some operating system
page replacement algorithms had become heavily tuned to operating with
bimodel "extended store" ... and when presented with one large
homogeneous storage, the replacement page selection performed less well
than when real storage was divided into the two areas.

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: Bernd Paysan on 4 Apr 2010 14:50

Morten Reistad wrote:
> And 10G ethernet is reasonably easily switched over longer distances.
> Fiber hops can be 70km between amplification, and 400 km between full
> regeneration. It is fully possible to build a planet-wide 10G switched
> ethernet. But then the latency would be a problem again.

Switched ethernet has other problems which would prevent a planet-wide
switched ethernet. Ok, not the distance is the problem, it's the
population, so as long as you keep the population low, it's possible.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

From: Stefan Monnier on 4 Apr 2010 15:06

> The classic mainframe paging got underway when a disk access took somewhat
> more than 200 instructions to perform. Now it is main memory that takes a
> similar number of instructions to access.

Are you saying that the classic mainframes switched from hardware-paging
to software-paging more or less when disk accesses latency reached the
"200 instructions" limit?
I didn't know they used hardware paging before.

But in any case the situation nowadays is slightly different: we have
a lot of cheap hardware real-estate, so we can afford to implement
somewhat complex "paging" schemes in hardware, so I'd expect the
threshold to be a good bit higher than 200 instructions.

Stefan

From: Morten Reistad on 4 Apr 2010 15:26

In article <jwv4ojrjb4c.fsf-monnier+comp.arch(a)gnu.org>,
Stefan Monnier <monnier(a)iro.umontreal.ca> wrote:
>> The classic mainframe paging got underway when a disk access took somewhat
>> more than 200 instructions to perform. Now it is main memory that takes a
>> similar number of instructions to access.
>
>Are you saying that the classic mainframes switched from hardware-paging
>to software-paging more or less when disk accesses latency reached the
>"200 instructions" limit?
>I didn't know they used hardware paging before.

The first pagers were very simple and primitive, like the ones
on the KI10, or the first IBM 370s. We would call it "hardware-assisted
software paging today".

Real, full hardware paging came in the next generation of
machines. But the "hardware assisted" paging was still a huge
win.

>But in any case the situation nowadays is slightly different: we have
>a lot of cheap hardware real-estate, so we can afford to implement
>somewhat complex "paging" schemes in hardware, so I'd expect the
>threshold to be a good bit higher than 200 instructions.

If we replace the cache-algorithms with "real" paging, where the
hardware does all the time-critical stuff, but software still has
control over the process I expect there to be substantial gains, but
there are no firm figures for this.

This could be a nice job for a handful of grad students somewhere
to build and evaluate.

-- mrr

From: Robert Myers on 4 Apr 2010 21:08

Andy "Krazy" Glew wrote:

> ... you must remember that the guys doing the
> hardware (and, more importantly, microcode and firmware, which is just
> another form of software) for such multilevel mmory systems have the
> same ideas and read the same papers.

So is the lesson: If you're not IBM or Intel, why waste time talking
about it? AMD is too busy playing catchup with two for one deals (no
mention of it not being a *real* twelve core die). License Intel's Atom IP?

Robert.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?