Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: Jim Nasby on 5 Mar 2007 23:11 On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote: > Tom, > >> I seem to recall that we've previously discussed the idea of >> letting the >> clock sweep decrement the usage_count before testing for 0, so that a >> buffer could be reused on the first sweep after it was initially >> used, >> but that we rejected it as being a bad idea. But at least with large >> shared_buffers it doesn't sound like such a bad idea. > > We did discuss an number of formulas for setting buffers with > different > clock-sweep numbers, including ones with higher usage_count for > indexes and > starting numbers of 0 for large seq scans as well as vacuums. > However, we > didn't have any way to prove that any of these complex algorithms > would > result in higher performance, so went with the simplest formula, > with the > idea of tinkering with it when we had more data. So maybe now's > the time. > > Note, though, that the current algorithm is working very, very well > for OLTP > benchmarks, so we'd want to be careful not to gain performance in > one area at > the expense of another. In TPCE testing, we've been able to increase > shared_buffers to 10GB with beneficial performance effect (numbers > posted > when I have them) and even found that "taking over RAM" with the > shared_buffers (ala Oracle) gave us equivalent performance to using > the FS > cache. (yes, this means with a little I/O management engineering > we could > contemplate discarding use of the FS cache for a net performance > gain. Maybe > for 8.4) An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, so that backends needing to grab a page would be much more likely to find one there (and not have to wait to scan through the entire buffer pool, perhaps multiple times). My thought is to keep track of how many page requests occurred during a given interval, and use that value (probably averaged over time) to determine how many pages we'd like to see on the free list. The background process would then run through the buffers decrementing usage counts until it found enough for the free list. Before putting a buffer on the 'free list', it would write the buffer out; I'm not sure if it would make sense to de-associate the buffer with whatever it had been storing or not, though. If we don't do that, that would mean that we could pull pages back off the free list if we wanted to. That would be helpful if the background process got a bit over-zealous. -- Jim Nasby jim(a)nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
From: Jim Nasby on 5 Mar 2007 23:02 On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > Another approach I proposed back in December is to not have a > variable like that at all, but scan the buffer cache for pages > belonging to the table you're scanning to initialize the scan. > Scanning all the BufferDescs is a fairly CPU and lock heavy > operation, but it might be ok given that we're talking about large > I/O bound sequential scans. It would require no DBA tuning and > would work more robustly in varying conditions. I'm not sure where > you would continue after scanning the in-cache pages. At the > highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. Given the partitioning of the buffer lock that Tom did it might not be that horrible for many cases, either, since you'd only need to scan through one partition. We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... -- Jim Nasby jim(a)nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: Tom Lane on 6 Mar 2007 02:17 Jim Nasby <decibel(a)decibel.org> writes: > An idea I've been thinking about would be to have the bgwriter or > some other background process actually try and keep the free list > populated, The bgwriter already tries to keep pages "just in front" of the clock sweep pointer clean. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: "Simon Riggs" on 6 Mar 2007 03:14 On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote: > Simon Riggs wrote: > But it would break the idea of letting a second seqscan follow in the > first's hot cache trail, no? No, but it would make it somewhat harder to achieve without direct synchronization between scans. It could still work; lets see. I'm not sure thats an argument against fixing the problem with the buffer strategy though. We really want both, not just one or the other. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo(a)postgresql.org so that your message can get through to the mailing list cleanly
From: Sherry Moore on 6 Mar 2007 00:34
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. <81 tread >./readtest -h Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s <size>: Working set size (may specify K,M,G suffix) -n iter: Number of test iterations -f filename: Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): <82 tread >./readtest -s 16k -f boot_archive 46445262 <83 tread >./readtest -s 128M -f boot_archive 118294230 <84 tread >./readtest -s 16k -f boot_archive -n 100 4230210856 <85 tread >./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): <89 tread >./readtest -s 16k -f boot_archive 43606882 <90 tread >./readtest -s 128M -f boot_archive 100547909 <91 tread >./readtest -s 16k -f boot_archive -n 100 4251823995 <92 tread >./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: > "Luke Lonergan" <LLonergan(a)greenplum.com> writes: > > Good info - it's the same in Solaris, the routine is uiomove (Sherry > > wrote it). > > Cool. Maybe Sherry can comment on the question whether it's possible > for a large-scale-memcpy to not take a hit on filling a cache line > that wasn't previously in cache? > > I looked a bit at the Linux code that's being used here, but it's all > x86_64 assembler which is something I've never studied :-(. > > regards, tom lane -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym |