Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?
From: nmm1 on 6 Apr 2010 03:50 In article <fFzun.305575$OX4.101427(a)newsfe25.iad>, EricP <ThatWouldBeTelling(a)thevillage.com> wrote: >Anne & Lynn Wheeler wrote: >> >> original cp67 just did single page transfer per i/o ... so 2301 would >> saturate at about 80 page/sec. i redid several things in cp67 >> .... including chaining multiple page transfers in single i/o >> .... chaining in rotational order. this resulted in still half rotational >> delay per i/o ... but tended to be amortized over several page >> transfers. this resulted in being able to drive 2301 up to nearly 300 >> page transfers per second (each i/o took longer ... but the queue delay >> was significantly reduced under heavy load ... since it had almost four >> times the peak thruput). > >I vaguely recall someone telling me that 370 VM had a page file >defragger process that would coalesce pages for long running >processes so they were contiguous in the page file, >to facilitate multi page read ahead without multiple seeks. > >Does any of that sound familiar to you? Yes. MVS did that, too. Regards, Nick Maclaren.
From: Anne & Lynn Wheeler on 6 Apr 2010 09:07 EricP <ThatWouldBeTelling(a)thevillage.com> writes: > I vaguely recall someone telling me that 370 VM had a page file > defragger process that would coalesce pages for long running > processes so they were contiguous in the page file, > to facilitate multi page read ahead without multiple seeks. > > Does any of that sound familiar to you? "big pages" ... discussion/description in this recent post (done for both mvs & vm in the 80s) http://www.garlic.com/~lynn/2010g.html#23 16:32 far pointers in OpenWatcom C/C++ http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation basically attempting to tailor paging to operational characteristics of 3380 disk drive ... part of it was analogous to log-structured filesystems ... sort of moving cursor across drive with writes to empty track that was closest to the current cursor position. one or more drives ... and wanted available (big page 3380) page space possibly ten times larger than expected use ... so that leading edge of cursor was empty as possible. re: http://www.garlic.com/~lynn/2010g.html#71 Interesting presentation as to 2301 ... it was formated 9 4k pages per pair of tracks ... with page "5" spanning the end of one track and the start of the next. at 60revs/sec ... peak sustained was actually 270 page transfers/sec. some regression analysis associated 1.5mills cpu per page fault/read .... that is fault, page replacement algorithm, some fraction of page write (when selected replaced page was changed and had to be written), marking current task non-executable, task switch, later i/o interrupt, processing, and switch back to faulting task. somewhere between 500-700 instructions. cp67 avg. behavior had ratio of 3 reads per write (so 1/3rd of page write overhead was attributed to each read). in the initial simplification morph from cp67 to vm370 ... a lot of stuff i had done for cp67 as undergraduate was dropped (some amount of paging, paging algorithms, dispatching algorithms, "fastpath" for common executed instruction paths, etc. some of the fastpath stuff leaked by in late in vm370 release1. there is this reference to doing major migration of remaining cp67 changes to (internal) vm370 ... and doing internal product distribution as "csc/vm" http://www.garlic.com/~lynn/2006v.html#email731212 http://www.garlic.com/~lynn/2006w.html#email750102 http://www.garlic.com/~lynn/2006w.html#email750430 a small subset of the virtual memory management changes leaked back into the product for vm370 release 3. i was then asked to do a separately charged kernel product for some of the remaining changes ... resource management algoritms, some newer paging stuff, etc. some of the newer paging stuff included a background sweeper task that would pull low-active pages off drum and move them to disk (increasing the effective use of the relative small capacity, faster paging device). the big pages stuff somewhat traded off i/o capacity and real storage .... attempting to have 3380 approx. lower latency fixed head device (attempting to minimize penalty of disk arm motion ... partially be forcing the overhead to be always amortized for a ten page transfer). some of the "big page" was motivated by no longer having a "corporate" paging device product ... some recent discussion and old email http://www.garlic.com/~lynn/2010g.html#11 Mainframe Executive article on the death of tape http://www.garlic.com/~lynn/2010g.html#22 Mainframe Executive article on the death of tape http://www.garlic.com/~lynn/2010g.html#55 Mainframe Executive article on the death of tape -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: Robert Myers on 6 Apr 2010 14:18 On Apr 5, 6:03 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: > If you are a hardware company paging is attractive. Sure, maybe you want to talk about hints to promote more efficient > SW/OS use. I came across this document http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf Table 2 Data Source Latency L3 CACHE hit, line unshared ~40 cycles L3 CACHE hit, shared line in another core ~65 cycles L3 CACHE hit, modified in another core ~75 cycles remote L3 CACHE ~100-300 cycles Local Dram ~60 ns Remote Dram ~100 ns L1 and L2 cache latencies are apparently 4 cycles and 10 cycles, respectively. The first thing I notice is that there is a cache-memory continuum (L1, L2, L3, local dram, nonlocal dram), with a geometric progression of latency where the latency is multiplied by a factor of 2-4 with each step down the hierarchy). In order to get a modern working set into some level of cache, we have to be talking about L3 cache or perhaps an L4 cache. We could 1. Make L3 even larger. To do that, you have to be Intel. 2. Create an off-chip L4 that is faster than local dram. You don't have to be Intel to do that. 3. Conceivably, you could use L4 the way IBM sometimes does, which is to cache memory requests to nonlocal dram in local dram. I don't think you have to be Intel to do that. The payoff for caching nonlocal dram storage seems like it could be dramatic. The payoff for increasing the size of L3 and implementing a paging scheme could be dramatic. I suppose that any improvement in latency is important, but I wonder how dramatic the gains from creating an off-chip L4 that is faster than local dram could possibly be compared to local dram access. Robert.
From: Anne & Lynn Wheeler on 6 Apr 2010 15:50 re: http://www.garlic.com/~lynn/2010g.html#72 Interesting presentation w/o bigpage ... pages retained "home" location on disk after being brought into storage (if page was later selected for replacement and hadn't been changed ... it avoided disk write since home location hadn't gone stale) with bigpage, there was no longer home location on disk ... whenever a full-track bigpage was read into storage ... the corresponding disk track was released. subsequent replacement would require a write (since there was no copy retained on disk). this effectively drove up writes ... effectively a write for every replaced page ... to make room for a read (just another example of trading off transfer capacity as part of optimizing 3380 moveable arm bottleneck for paging device. http://www.garlic.com/~lynn/2010g.html#23 16:32 far pointers in OpenWatcom C/C++ http://www.garlic.com/~lynn/2010g.html#42 Interesting presentation http://www.garlic.com/~lynn/2010g.html#72 Interesting presentation this is somewhat a variation on the dup/no-dup trade-off (i.e. maintaining duplicate copy on disk when page was resident in processor memory). this shows up when preferred paging storage becomes relative small compared to real storage size (and/or total allocated virtual memory). this shows up more recently when some machine configurations had gbyte real-storage and relatively small multi-gbyte disks ... where space for paging was limited ... trade-off between total virtual pages being limited by available space on disk (duplicate) ... or available space on disk plus real storage (no-duplicates). A 2gbyte paging space and 1gbyte real storag e... might be able to handle 3gbyte total virtual pages in no-dup strategy. i had also added some code to vm370 (besides the fixed-head paging device cleaner/sweeper) that would switch from duplicate stragegy to no-duplicate stragegy when fixed-head paging device space became especially constrained (do sweep/clean ... and then start non-duplicate for fixed-head paging device ... but typically retained duplicate for much larger paging areas on regular disks). misc. past posts mentioning dup/no-dup stragegies: http://www.garlic.com/~lynn/93.html#13 managing large amounts of vm http://www.garlic.com/~lynn/2000d.html#13 4341 was "Is a VAX a mainframe?" http://www.garlic.com/~lynn/2001i.html#42 Question re: Size of Swap File http://www.garlic.com/~lynn/2001l.html#55 mainframe question http://www.garlic.com/~lynn/2002b.html#10 hollow files in unix filesystems? http://www.garlic.com/~lynn/2002b.html#20 index searching http://www.garlic.com/~lynn/2002e.html#11 What are some impressive page rates? http://www.garlic.com/~lynn/2002f.html#20 Blade architectures http://www.garlic.com/~lynn/2002f.html#26 Blade architectures http://www.garlic.com/~lynn/2003f.html#5 Alpha performance, why? http://www.garlic.com/~lynn/2004g.html#17 Infiniband - practicalities for small clusters http://www.garlic.com/~lynn/2004g.html#18 Infiniband - practicalities for small clusters http://www.garlic.com/~lynn/2004g.html#20 Infiniband - practicalities for small clusters http://www.garlic.com/~lynn/2004h.html#19 fast check for binary zeroes in memory http://www.garlic.com/~lynn/2004i.html#1 Hard disk architecture: are outer cylinders still faster than inner cylinders? http://www.garlic.com/~lynn/2005c.html#27 [Lit.] Buffer overruns http://www.garlic.com/~lynn/2005m.html#28 IBM's mini computers--lack thereof http://www.garlic.com/~lynn/2006c.html#8 IBM 610 workstation computer http://www.garlic.com/~lynn/2006f.html#18 how much swap size did you take? http://www.garlic.com/~lynn/2007c.html#0 old discussion of disk controller chache http://www.garlic.com/~lynn/2008f.html#19 Fantasy-Land_Hierarchal_NUMA_Memory-Model_on_Vertical http://www.garlic.com/~lynn/2008k.html#80 How to calculate effective page fault service time? -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: Morten Reistad on 7 Apr 2010 03:13
In article <4BBC0C0C.8010803(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >On 4/6/2010 11:18 AM, Robert Myers wrote: >> The first thing I notice is that there is a cache-memory continuum >> (L1, L2, L3, local dram, nonlocal dram), with a geometric progression >> of latency where the latency is multiplied by a factor of 2-4 with >> each step down the hierarchy). >> >> In order to get a modern working set into some level of cache, we have >> to be talking about L3 cache or perhaps an L4 cache. >> >> We could >> >> 1. Make L3 even larger. To do that, you have to be Intel. >> >> 2. Create an off-chip L4 that is faster than local dram. You don't >> have to be Intel to do that. The Hypertransport interconnects between CPU L3 caches make the other chip's L3 caches "L4". Making a "cache-only" L3 cache chip and interconnecting this with hyperttransport would do this. But you would have to be Intel or AMD to do this. >Note, however, that building an L3 cache is now more complicated, since there is no longer an FSB. Instead, the >processor has an integrated memory controller and DRAM channels, and QPI. > >You *could* just mark all memory as remote, and just use QPI, with a cache on QPI. Leaving the DRAM memory channels unused. > >However, that wastes a lot of chip pin bandwidth. > >It would be a fun challenge to create a cache that was attached to the DRAM channels. If uniprocessor, that might be >sufficient. If multiprocessor, you would want to connect to both the DRAM channels and QPI. Simplest would be to use >QPI for coherency traffic - invalidations - while the memory channels would be used for fills and evictions. Or you >could spread memory across both. > > >> The payoff for caching nonlocal dram storage seems like it could be >> dramatic. The payoff for increasing the size of L3 and implementing a >> paging scheme could be dramatic. I suppose that any improvement in >> latency is important, but I wonder how dramatic the gains from >> creating an off-chip L4 that is faster than local dram could possibly >> be compared to local dram access. > >These are empirical questions. > >IIRC the payoff for increasing L3 cache much beyond the 2M/core (that we see now) to 16M per core is small. Well into >diminishing returns. But there will certainly be apps for which incrementally larger cache helps right away, there will >certainly be a point at which a new working set plateau fits, and certainly software footprint increases over time. Do you have references for that? I see clear and consistent results from benchmarks on kamailio, asterisk, apache, linux itself, and various media servers that fitting the working set of these applications in L3 is essential. The number of simultaneous calls in asterisk went from 1400 to 9000 just by increasing the L3 size from 16M to 72M and doubling the L2 cache sizes. -- mrr |