Interesting presentation [Computer Architecture]

Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?

From: "Andy "Krazy" Glew" on 3 Apr 2010 13:49

On 4/3/2010 10:29 AM, nmm1(a)cam.ac.uk wrote:

> What I don't understand is why everybody is so attached to demand
> paging - it was near-essential in the 1970s, because memory was
> very limited, but this is 35 years later! As far as I know, NONE
> of the facilities that demand paging provides can't be done better,
> and more simply, in other ways (given current constraints).

I think my last post identified a slightly far-out reason: you can page without knowledge of software. But you can't swap.

Paging (and other oblivious caching) has a more graceful fall-off. Note that I can't say truly graceful, because modern
OSes don't page well.

If you are doing whole program swapping, you can't run a program larger than memory With paging you can - although you
better not be accessing a footprint larger than memory

If you are doing segment swapping, and if data structures are constrained to be no larger than a segment, then you can't
have a data structure larger than main memory. Or even approaching in size, since you will hae other segments. With
paging you can.

Sure, there have been compilers that automatically split up stuff like "char array[1024*1024*1024*16]" into multiple
segments (where segments had a fixed maximum size). But that sure does look like paging. Just potentially with
variable size pages.

Hmmm... modern techniques like COW zero filled pages and forking work better with pages than with segments. You could
COW whole segments. But it really is better to have a level underneath the segments. It could be segments and
sub-segments, but it so often is segments and paging. And if you have paging, you don't need segments.

Segments are associated far too often with restrictions such as "you can't have data structures bigger than 64K." Or
1G, Or 2^40 Such restrictions are the kiss of death.

===

Or are you talking not about pages vs. segments, but why bother with either?

From: nmm1 on 3 Apr 2010 14:29

In article <4BB77FAE.5030203(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>
>> What I don't understand is why everybody is so attached to demand
>> paging - it was near-essential in the 1970s, because memory was
>> very limited, but this is 35 years later! As far as I know, NONE
>> of the facilities that demand paging provides can't be done better,
>> and more simply, in other ways (given current constraints).
>
>I think my last post identified a slightly far-out reason: you can page
>without knowledge of software. But you can't swap.

Hmm. Neither are entirely true - think of truly time-critical code.
I have code that will work together with swapping, but not paging.

>Paging (and other oblivious caching) has a more graceful fall-off. Note
>that I can't say truly graceful, because modern OSes don't page well.

Not in my experience, sorry. Even under MVS, which did, IBM learnt
that paging was a disaster area for far too many uses and backed off
to use swapping where at all possible. And "don't page well" often
translates to "hang up and crash in truly horrible ways, often leaving
filesystems corrupted."

>If you are doing whole program swapping, you can't run a program larger
>than memory With paging you can - although you better not be accessing a
>footprint larger than memory
>
>If you are doing segment swapping, and if data structures are
>constrained to be no larger than a segment, then you can't have a data
>structure larger than main memory. Or even approaching in size, since
>you will hae other segments. With paging you can.
>
>Sure, there have been compilers that automatically split up stuff like
>"char array[1024*1024*1024*16]" into multiple segments (where segments
>had a fixed maximum size). But that sure does look like paging. Just
>potentially with variable size pages.

All of the above is true, in theory, but I haven't even heard of a
current example where those really matter, in practice - which isn't
due to a gross (software) misdesign.

The point is that paging works (in practice) only when your access is
very sparse (either in space or time) AND when the sparse access is
locally dense. I don't know of any uses where there are not much
better solutions.

>Hmmm... modern techniques like COW zero filled pages and forking work
>better with pages than with segments. You could COW whole segments.
>But it really is better to have a level underneath the segments. It
>could be segments and sub-segments, but it so often is segments and
>paging. And if you have paging, you don't need segments.

Why? That's a serious question? See above. In any case, the fork-join
model is a thoroughgoing misdesign for almost all purposes, and most
COW uses are to kludge up its deficiencies. Please note that I was
NOT proposing a hardware-only solution!

>Segments are associated far too often with restrictions such as "you
>can't have data structures bigger than 64K." Or 1G, Or 2^40 Such
>restrictions are the kiss of death.

I am not talking about such misdesigns, but something competently
designed. After all, handling segments properly isn't exactly a new
technology.

Regards,
Nick Maclaren.

From: Anne & Lynn Wheeler on 3 Apr 2010 16:09

i got called in to talk to some of the people regardubg mvt to initial
virtual memory vs2/svs ... page replacement algorithms, local vs-a-vis
global replacement, some other stuff.

for some reason, they decided to make a micro trade-off ... selecting
non-changed pages for replacement before changed pages (because they
didn't have to write first ... the slot was immediately available)
.... didn't matter how much argued with them ... they were determined to
do it anyway. it wasn't until well into the vs2/mvs time-frame that
somebody realized that they were replacing high-use shared executable
library pages before private, lower-use data pages.

recent thread mentioning old issue with local vis-a-vis global lru
replacement:
http://www.garlic.com/~lynn/2010f.html#85 16:32 far pointers in OpenWatcom C/C++
http://www.garlic.com/~lynn/2010f.html#89 16:32 far pointers in OpenWatcom C/C++
http://www.garlic.com/~lynn/2010f.html#90 16:32 far pointers in OpenWatcom C/C++
http://www.garlic.com/~lynn/2010f.html#91 16:32 far pointers in OpenWatcom C/C++

jim getting me into middle of academic dispute based on some work i had
done as undergraduate in the 60s ... past reference
http://www.garlic.com/~lynn/2006w.html#46 The Future of CPUs: What's After Multi-Core?

in the 80s ... there was "big page" change for both mvs and vm ... sort
of part-way swapping & demand page. at queue drop (aka swap)
.... resident virtual pages would be formed into ten page collections and
written as full-track 3380 disk write ... to the nearest free track arm
position (moving cursor, slight similarity to log-structured file
system). Subsequent page fault, if the page was part of a big-page, the
whole big page would be brought in.

the whole orientation was arm motion is the primary bottleneck with
trading off both transfer capacity and real-storage against arm motion
(fetch of ten pages might bring in pages that weren't actually needed
the next time, the approach was slightly better than brute force move to
40k pages ... since it was some ten 4k pages that had actually been
recently used ... as opposed to 40kbyte contiguous area). recent
post mentioning big pages:
http://www.garlic.com/~lynn/2010g.html#23 16:32 far pointers in OpenWatcom C/C++

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: Muzaffer Kal on 3 Apr 2010 16:46

On Sat, 3 Apr 2010 04:45:56 +0200, Morten Reistad <first(a)last.name>
wrote:

>In article <tpqar5pfc05ajvn3329an6v9tcodh58i65(a)4ax.com>,
>Muzaffer Kal <kal(a)dspia.com> wrote:
>>On Thu, 01 Apr 2010 22:48:47 -0500, Del Cecchi` <delcecchi(a)gmail.com>
>>wrote:
>>
>>>MitchAlsup wrote:
>>>> On Apr 1, 5:40 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote:
>>>>
>>>>>The PCIe 2.0 links on the Clarkdale chips runs at 5G.
>>>>
>>>>
>>>> Any how many dozen meters can these wires run?
>>>>
>>>> Mitch
>>>
>>>Maybe 1 dozen meters, depending on thickness of wire. Wire thickness
>>>depends on how many you want to be able to put in a cable, and if you
>>>want to be able to bend those cables.
>>>
>>>10GbaseT or whatever it is called gets 10 gbits/second over 4 twisted
>>>pairs for 100 meters by what I classify as unnatural acts torturing the
>>>bits.
>>
>>You should also add to it that this is full-duplex ie simultaneous
>>transmission of 10G in both directions. One needs 4 equalizers, 4 echo
>>cancellers, 12 NEXT and 12 FEXT cancellers in addition to a fully
>>parallel LDPC decoder (don't even talk about the insane requirement on
>>the clock recovery block). Over the last 5 years probably US$ 100M of
>>VC money got spent to develop 10GBT PHYs with several startups
>>disappearing with not much to show for. Torturing the bits indeed (not
>>the mention torture of the engineers trying to make this thing work.)
>>--
>>Muzaffer Kal
>
>Except for the optics, fiber equipment is a lot simpler.

That reminds of the school principal who said "except for students,
running schools is very easy."
The nice thing about copper is that there are no optics. Once you get
over the hurdle of making a PHY, you can make cables easily, bend them
as you want/need (with in reason) and change connectivity as needed.
That said, 10GBT is probably the last copper PHY for 100m cable.
--
Muzaffer Kal

DSPIA INC.
ASIC/FPGA Design Services

http://www.dspia.com

From: Morten Reistad on 4 Apr 2010 01:51

In article <4BB7763D.9020103(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/2/2010 9:45 PM, Morten Reistad wrote:
>> If the 6 mb (or 12, or 24) of ram are superfast, have a minimal mmu,
>> at least capable of process isolation and address translation; and do
>> the "paging" to main memory, then you could run one of these minimal,
>> general purpose operating systems inside each gpu/cpu/whatever, and
>> live with the page faults. It will be several orders of magnitude
>> faster and lower latency than the swapping and paging we normally love
>> to hate. We already have that "swapping", except we call it "memory
>> access".
>>
>> The theory is old, stable and well validated. The code is done, and
>> still in many operating systems. We "just need drivers".
>
>
>I know of at least one startup whose founder called me up asking for somebody able to
>write such a driver and/or tune
>the paging subsystem to create a hierarchy of ultra fast and conventional DRAM.
>
>In Linux.
>
>A few months later he was on to hardware state machines to do the paging. Linux
>paging was too high overhead.

I could have told you that. Linux has done integrated paging and file systems
more or less straight from Tops20, or at least Denning's papers.

I would try OpenBSD, which still has that ultratight code from the classic
BSDs, or QNX; where they are proud to have short code paths.

>Now, I suspect that a highly tuned paging system might be able to be 10X faster than
>Linux's. (In the same way that
>Linux ages much better than Windows.) (I hearken back to the day when people
>actually cared about paging, e.g. at Gould.)
>
>But the lesson is that modern software, modern OSes, don't seem to page worth a damn.
>And if you are thinkng to brush
>off, re-tune, and completely reinvent virtual memory algoritms from the god old days
>(paging andor swapping), you must
>remember that the guys doing the hardware (and, more importantly, microcode and
>firmware, which is just another form of
>software) for such multilevel mmory systems have the same ideas and read the same papers.

I am not advocating a "multilevel memory system"; if you think in the terms of
NUMA etc. I am advocating utilising standard, 41 year old virtual memory techniques
on the chip boundary.

>It's just the usual:
>
>If you do it in software in the OS
>- you have to do it in every OS
>- you have to ship the new OS, or at least the new drivers
>+ you can take advantage of more knowledge of software, processes, tasks
>+ you can prototype quicker, since you have Linux sourcecode

Or you can build yourself a hypervisor, like Apple has done; and have
that present the "standard" x86 interface to the OS. And, as you say, once
tables are set up the you can have hardware perform the vast bulk of
paging activities.

>If you do it in hardware/firmware/microcode
>+ it works for every OS, including legacy OSes without device drivers
>- you have less visibility to software constructs
>+ if you are the hardware company, you can do it without OS (Microsoft OS) source code
>
>===
>
>Hmmm... HW/FW/UC can only really do paging. Can't really do swapping. Swapping
>requires SW knowledge.

>
>===
>
>My take: if the "slower" memory is within 1,000 cycles, you almost definitely need a
>hardware state machine Blok the
>process (hardware thread), switch to another thread.

The classic mainframe paging got underway when a disk access took somewhat
more than 200 instructions to perform. Now it is main memory that takes a
similar number of instructions to access. We have a problem with handling
interrupts though; then an interrupt would cost ~20 instructions; now it
costs several hundred.

On a page fault we would normally want to schedule a read of the
missing page into a vacant page slot, mark the faulting thread in IO
wait, and pick the next process from the scheduler list to run. On IO
complete we want to mark the page good, and accessed, put the thread
back on the scheduler list as runnable; and possibly run it. These
bits can be done in hardware by an MMU. But for a prototype we just need
to generate a fault whenever the page is not in on-chip memory.

>If within 10,000 cycles, I think hardware/firmware will still win.
>
>I think software OS level paging only is a clear win at >= 100,000 cycles.

The performance problem is with the interrupt load. Apart from that
we saw clear wins in paging at the quarter thousand cycle level.

-- mrr

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?