Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: Josh Berkus on 5 Mar 2007 13:46 Tom, > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. We did discuss an number of formulas for setting buffers with different clock-sweep numbers, including ones with higher usage_count for indexes and starting numbers of 0 for large seq scans as well as vacuums. However, we didn't have any way to prove that any of these complex algorithms would result in higher performance, so went with the simplest formula, with the idea of tinkering with it when we had more data. So maybe now's the time. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. In TPCE testing, we've been able to increase shared_buffers to 10GB with beneficial performance effect (numbers posted when I have them) and even found that "taking over RAM" with the shared_buffers (ala Oracle) gave us equivalent performance to using the FS cache. (yes, this means with a little I/O management engineering we could contemplate discarding use of the FS cache for a net performance gain. Maybe for 8.4) -- Josh Berkus PostgreSQL @ Sun San Francisco ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: "Pavan Deolasee" on 5 Mar 2007 13:46 Tom Lane wrote: > > Nope, Pavan's nailed it: the problem is that after using a buffer, the > seqscan leaves it with usage_count = 1, which means it has to be passed > over once by the clock sweep before it can be re-used. I was misled in > the 32-buffer case because catalog accesses during startup had left the > buffer state pretty confused, so that there was no long stretch before > hitting something available. With a large number of buffers, the > behavior is that the seqscan fills all of shared memory with buffers > having usage_count 1. Once the clock sweep returns to the first of > these buffers, it will have to pass over all of them, reducing all of > their counts to 0, before it returns to the first one and finds it now > usable. Subsequent tries find a buffer immediately, of course, until we > have again filled shared_buffers with usage_count 1 everywhere. So the > problem is not so much the clock sweep overhead as that it's paid in a > very nonuniform fashion: with N buffers you pay O(N) once every N reads > and O(1) the rest of the time. This is no doubt slowing things down > enough to delay that one read, instead of leaving it nicely I/O bound > all the time. Mark, can you detect "hiccups" in the read rate using > your setup? > > Cool. You posted the same analysis before I could hit the "send" button :) I am wondering whether seqscan would set the usage_count to 1 or to a higher value. usage_count is incremented while unpinning the buffer. Even if we use page-at-a-time mode, won't the buffer itself would get pinned/unpinned every time seqscan returns a tuple ? If thats the case, the overhead would be O(BM_MAX_USAGE_COUNT * N) for every N reads. > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. > > How about smaller value for BM_MAX_USAGE_COUNT ? > Another issue nearby to this is whether to avoid selecting buffers that > are dirty --- IIRC someone brought that up again recently. Maybe > predecrement for clean buffers, postdecrement for dirty ones would be a > cute compromise. > Can we use a 2-bit counter where the higher bit is set if the buffer is dirty and lower bit is set whenever the buffer is used. The clock-sweep then decrement this counter and chooses a victim with counter value 0. ISTM that we should optimize for large shared buffer pool case, because that would be more common in the coming days. RAM is getting cheaper everyday. Thanks, Pavan ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
From: Tom Lane on 5 Mar 2007 13:55 "Pavan Deolasee" <pavan(a)enterprisedb.com> writes: > I am wondering whether seqscan would set the usage_count to 1 or to a higher > value. usage_count is incremented while unpinning the buffer. Even if > we use > page-at-a-time mode, won't the buffer itself would get pinned/unpinned > every time seqscan returns a tuple ? If thats the case, the overhead would > be O(BM_MAX_USAGE_COUNT * N) for every N reads. No, it's only once per page. There's a good deal of PrivateRefCount thrashing that goes on while examining the individual tuples, but the shared state only changes when we leave the page, because we hold pin continuously on the current page of a seqscan. If you don't believe me, insert some debug printouts for yourself. > How about smaller value for BM_MAX_USAGE_COUNT ? This is not relevant to the problem: we are concerned about usage count 1 versus 0, not the other end of the range. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org
From: Gregory Stark on 5 Mar 2007 14:02 "Tom Lane" <tgl(a)sss.pgh.pa.us> writes: > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. I seem to recall the classic clock sweep algorithm was to just use a single bit. Either a buffer was touched recently or it wasn't. I also vaguely recall a refinement involving keeping a bitmask and shifting it right each time the clock hand comes around. So you have a precise history of which recent clock sweeps the buffer was used and which it wasn't. I think the coarseness of not caring how heavily it was used is a key part of the algorithm. By not caring if it was lightly or heavily used during the clock sweep, just that it was used at least once it avoids making sticking with incorrect deductions about things like sequential scans even if multiple sequential scans pass by. As soon as they stop seeing the buffer it immediately reacts and discards the buffer. I would check out my OS book from school but it's on another continent :( -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo(a)postgresql.org so that your message can get through to the mailing list cleanly
From: "Simon Riggs" on 5 Mar 2007 14:34
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: > Tom, > > > I seem to recall that we've previously discussed the idea of letting the > > clock sweep decrement the usage_count before testing for 0, so that a > > buffer could be reused on the first sweep after it was initially used, > > but that we rejected it as being a bad idea. But at least with large > > shared_buffers it doesn't sound like such a bad idea. > Note, though, that the current algorithm is working very, very well for OLTP > benchmarks, so we'd want to be careful not to gain performance in one area at > the expense of another. Agreed. What we should also add to the analysis is that this effect only occurs when only uniform workloads is present, like SeqScan, VACUUM or COPY. When you have lots of indexed access the scan workloads don't have as much effect on the cache pollution as we are seeing in these tests. Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. It would be simple to implement the ring buffer and enable/disable it with a hint StrategyHintCyclicBufferReuse() in a similar manner to the hint VACUUM provides now. This would maintain the beneficial behaviour for OLTP, while keeping data within the L2 cache for DSS and bulk workloads. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |