Interesting presentation [Computer Architecture]

Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?

From: nmm1 on 6 Apr 2010 03:50

In article <fFzun.305575$OX4.101427(a)newsfe25.iad>,
EricP <ThatWouldBeTelling(a)thevillage.com> wrote:
>Anne & Lynn Wheeler wrote:
>>
>> original cp67 just did single page transfer per i/o ... so 2301 would
>> saturate at about 80 page/sec. i redid several things in cp67
>> .... including chaining multiple page transfers in single i/o
>> .... chaining in rotational order. this resulted in still half rotational
>> delay per i/o ... but tended to be amortized over several page
>> transfers. this resulted in being able to drive 2301 up to nearly 300
>> page transfers per second (each i/o took longer ... but the queue delay
>> was significantly reduced under heavy load ... since it had almost four
>> times the peak thruput).
>
>I vaguely recall someone telling me that 370 VM had a page file
>defragger process that would coalesce pages for long running
>processes so they were contiguous in the page file,
>to facilitate multi page read ahead without multiple seeks.
>
>Does any of that sound familiar to you?

Yes. MVS did that, too.

Regards,
Nick Maclaren.

From: Anne & Lynn Wheeler on 6 Apr 2010 09:07

EricP <ThatWouldBeTelling(a)thevillage.com> writes:
> I vaguely recall someone telling me that 370 VM had a page file
> defragger process that would coalesce pages for long running
> processes so they were contiguous in the page file,
> to facilitate multi page read ahead without multiple seeks.
>
> Does any of that sound familiar to you?

"big pages" ... discussion/description in this recent post (done
for both mvs & vm in the 80s)
http://www.garlic.com/~lynn/2010g.html#23 16:32 far pointers in OpenWatcom C/C++
http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation

basically attempting to tailor paging to operational characteristics of
3380 disk drive ... part of it was analogous to log-structured
filesystems ... sort of moving cursor across drive with writes to empty
track that was closest to the current cursor position. one or more
drives ... and wanted available (big page 3380) page space possibly ten
times larger than expected use ... so that leading edge of cursor was
empty as possible.

re:
http://www.garlic.com/~lynn/2010g.html#71 Interesting presentation

as to 2301 ... it was formated 9 4k pages per pair of tracks ... with
page "5" spanning the end of one track and the start of the next. at
60revs/sec ... peak sustained was actually 270 page transfers/sec.

some regression analysis associated 1.5mills cpu per page fault/read
.... that is fault, page replacement algorithm, some fraction of page
write (when selected replaced page was changed and had to be written),
marking current task non-executable, task switch, later i/o interrupt,
processing, and switch back to faulting task. somewhere between 500-700
instructions.

cp67 avg. behavior had ratio of 3 reads per write (so 1/3rd of page
write overhead was attributed to each read).

in the initial simplification morph from cp67 to vm370 ... a lot of
stuff i had done for cp67 as undergraduate was dropped (some amount of
paging, paging algorithms, dispatching algorithms, "fastpath" for
common executed instruction paths, etc. some of the fastpath stuff
leaked by in late in vm370 release1.

there is this reference to doing major migration of remaining cp67
changes to (internal) vm370 ... and doing internal product distribution
as "csc/vm"
http://www.garlic.com/~lynn/2006v.html#email731212
http://www.garlic.com/~lynn/2006w.html#email750102
http://www.garlic.com/~lynn/2006w.html#email750430

a small subset of the virtual memory management changes leaked back
into the product for vm370 release 3.

i was then asked to do a separately charged kernel product for some of
the remaining changes ... resource management algoritms, some newer
paging stuff, etc.

some of the newer paging stuff included a background sweeper task that
would pull low-active pages off drum and move them to disk (increasing
the effective use of the relative small capacity, faster paging device).

the big pages stuff somewhat traded off i/o capacity and real storage
.... attempting to have 3380 approx. lower latency fixed head device
(attempting to minimize penalty of disk arm motion ... partially be
forcing the overhead to be always amortized for a ten page transfer).

some of the "big page" was motivated by no longer having a "corporate"
paging device product ... some recent discussion and old email
http://www.garlic.com/~lynn/2010g.html#11 Mainframe Executive article on the death of tape
http://www.garlic.com/~lynn/2010g.html#22 Mainframe Executive article on the death of tape
http://www.garlic.com/~lynn/2010g.html#55 Mainframe Executive article on the death of tape

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: Robert Myers on 6 Apr 2010 14:18

On Apr 5, 6:03 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:

> If you are a hardware company paging is attractive. Sure, maybe you want to talk about hints to promote more efficient
> SW/OS use.

I came across this document

http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Table 2
Data Source Latency
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE
~100-300
cycles
Local Dram ~60 ns
Remote Dram ~100 ns

L1 and L2 cache latencies are apparently 4 cycles and 10 cycles,
respectively.

The first thing I notice is that there is a cache-memory continuum
(L1, L2, L3, local dram, nonlocal dram), with a geometric progression
of latency where the latency is multiplied by a factor of 2-4 with
each step down the hierarchy).

In order to get a modern working set into some level of cache, we have
to be talking about L3 cache or perhaps an L4 cache.

We could

1. Make L3 even larger. To do that, you have to be Intel.

2. Create an off-chip L4 that is faster than local dram. You don't
have to be Intel to do that.

3. Conceivably, you could use L4 the way IBM sometimes does, which is
to cache memory requests to nonlocal dram in local dram. I don't
think you have to be Intel to do that.

The payoff for caching nonlocal dram storage seems like it could be
dramatic. The payoff for increasing the size of L3 and implementing a
paging scheme could be dramatic. I suppose that any improvement in
latency is important, but I wonder how dramatic the gains from
creating an off-chip L4 that is faster than local dram could possibly
be compared to local dram access.

Robert.

From: Anne & Lynn Wheeler on 6 Apr 2010 15:50

re:
http://www.garlic.com/~lynn/2010g.html#72 Interesting presentation

w/o bigpage ... pages retained "home" location on disk after being
brought into storage (if page was later selected for replacement and
hadn't been changed ... it avoided disk write since home location
hadn't gone stale)

with bigpage, there was no longer home location on disk ... whenever a
full-track bigpage was read into storage ... the corresponding disk
track was released. subsequent replacement would require a write
(since there was no copy retained on disk). this effectively drove up
writes ... effectively a write for every replaced page ... to make
room for a read (just another example of trading off transfer capacity
as part of optimizing 3380 moveable arm bottleneck for paging device.
http://www.garlic.com/~lynn/2010g.html#23 16:32 far pointers in OpenWatcom C/C++
http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation
http://www.garlic.com/~lynn/2010g.html#72 Interesting presentation

this is somewhat a variation on the dup/no-dup trade-off (i.e.
maintaining duplicate copy on disk when page was resident in processor
memory). this shows up when preferred paging storage becomes relative
small compared to real storage size (and/or total allocated virtual
memory).

this shows up more recently when some machine configurations had gbyte
real-storage and relatively small multi-gbyte disks ... where space
for paging was limited ... trade-off between total virtual pages being
limited by available space on disk (duplicate) ... or available space
on disk plus real storage (no-duplicates). A 2gbyte paging space and
1gbyte real storag e... might be able to handle 3gbyte total virtual
pages in no-dup strategy.

i had also added some code to vm370 (besides the fixed-head paging
device cleaner/sweeper) that would switch from duplicate stragegy to
no-duplicate stragegy when fixed-head paging device space became
especially constrained (do sweep/clean ... and then start
non-duplicate for fixed-head paging device ... but typically retained
duplicate for much larger paging areas on regular disks).

misc. past posts mentioning dup/no-dup stragegies:
http://www.garlic.com/~lynn/93.html#13 managing large amounts of vm
http://www.garlic.com/~lynn/2000d.html#13 4341 was "Is a VAX a mainframe?"
http://www.garlic.com/~lynn/2001i.html#42 Question re: Size of Swap File
http://www.garlic.com/~lynn/2001l.html#55 mainframe question
http://www.garlic.com/~lynn/2002b.html#10 hollow files in unix filesystems?
http://www.garlic.com/~lynn/2002b.html#20 index searching
http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates?
http://www.garlic.com/~lynn/2002f.html#20 Blade architectures
http://www.garlic.com/~lynn/2002f.html#26 Blade architectures
http://www.garlic.com/~lynn/2003f.html#5 Alpha performance, why?
http://www.garlic.com/~lynn/2004g.html#17 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004g.html#18 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004g.html#20 Infiniband - practicalities for small clusters
http://www.garlic.com/~lynn/2004h.html#19 fast check for binary zeroes in memory
http://www.garlic.com/~lynn/2004i.html#1 Hard disk architecture: are outer cylinders still faster than inner cylinders?
http://www.garlic.com/~lynn/2005c.html#27 [Lit.] Buffer overruns
http://www.garlic.com/~lynn/2005m.html#28 IBM's mini computers--lack thereof
http://www.garlic.com/~lynn/2006c.html#8 IBM 610 workstation computer
http://www.garlic.com/~lynn/2006f.html#18 how much swap size did you take?
http://www.garlic.com/~lynn/2007c.html#0 old discussion of disk controller chache
http://www.garlic.com/~lynn/2008f.html#19 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical
http://www.garlic.com/~lynn/2008k.html#80 How to calculate effective page fault service time?

--
42yrs virtualization experience (since Jan68), online at home since Mar1970

From: Morten Reistad on 7 Apr 2010 03:13

In article <4BBC0C0C.8010803(a)patten-glew.net>,
Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote:
>On 4/6/2010 11:18 AM, Robert Myers wrote:
>> The first thing I notice is that there is a cache-memory continuum
>> (L1, L2, L3, local dram, nonlocal dram), with a geometric progression
>> of latency where the latency is multiplied by a factor of 2-4 with
>> each step down the hierarchy).
>>
>> In order to get a modern working set into some level of cache, we have
>> to be talking about L3 cache or perhaps an L4 cache.
>>
>> We could
>>
>> 1. Make L3 even larger. To do that, you have to be Intel.
>>
>> 2. Create an off-chip L4 that is faster than local dram. You don't
>> have to be Intel to do that.

The Hypertransport interconnects between CPU L3 caches make the
other chip's L3 caches "L4". Making a "cache-only" L3 cache chip
and interconnecting this with hyperttransport would do this. But
you would have to be Intel or AMD to do this.

>Note, however, that building an L3 cache is now more complicated, since there is no longer an FSB. Instead, the
>processor has an integrated memory controller and DRAM channels, and QPI.
>
>You *could* just mark all memory as remote, and just use QPI, with a cache on QPI. Leaving the DRAM memory channels unused.
>
>However, that wastes a lot of chip pin bandwidth.
>
>It would be a fun challenge to create a cache that was attached to the DRAM channels. If uniprocessor, that might be
>sufficient. If multiprocessor, you would want to connect to both the DRAM channels and QPI. Simplest would be to use
>QPI for coherency traffic - invalidations - while the memory channels would be used for fills and evictions. Or you
>could spread memory across both.
>
>
>> The payoff for caching nonlocal dram storage seems like it could be
>> dramatic. The payoff for increasing the size of L3 and implementing a
>> paging scheme could be dramatic. I suppose that any improvement in
>> latency is important, but I wonder how dramatic the gains from
>> creating an off-chip L4 that is faster than local dram could possibly
>> be compared to local dram access.
>
>These are empirical questions.
>
>IIRC the payoff for increasing L3 cache much beyond the 2M/core (that we see now) to 16M per core is small. Well into
>diminishing returns. But there will certainly be apps for which incrementally larger cache helps right away, there will
>certainly be a point at which a new working set plateau fits, and certainly software footprint increases over time.

Do you have references for that?

I see clear and consistent results from benchmarks on kamailio, asterisk,
apache, linux itself, and various media servers that fitting the working
set of these applications in L3 is essential.

The number of simultaneous calls in asterisk went from 1400 to 9000 just
by increasing the L3 size from 16M to 72M and doubling the L2 cache sizes.

-- mrr

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14
Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?