Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Josh Berkus on 5 Mar 2007 13:46

Tom,

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea. But at least with large
> shared_buffers it doesn't sound like such a bad idea.

We did discuss an number of formulas for setting buffers with different
clock-sweep numbers, including ones with higher usage_count for indexes and
starting numbers of 0 for large seq scans as well as vacuums. However, we
didn't have any way to prove that any of these complex algorithms would
result in higher performance, so went with the simplest formula, with the
idea of tinkering with it when we had more data. So maybe now's the time.

Note, though, that the current algorithm is working very, very well for OLTP
benchmarks, so we'd want to be careful not to gain performance in one area at
the expense of another. In TPCE testing, we've been able to increase
shared_buffers to 10GB with beneficial performance effect (numbers posted
when I have them) and even found that "taking over RAM" with the
shared_buffers (ala Oracle) gave us equivalent performance to using the FS
cache. (yes, this means with a little I/O management engineering we could
contemplate discarding use of the FS cache for a net performance gain. Maybe
for 8.4)

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Pavan Deolasee" on 5 Mar 2007 13:46

Tom Lane wrote:
>
> Nope, Pavan's nailed it: the problem is that after using a buffer, the
> seqscan leaves it with usage_count = 1, which means it has to be passed
> over once by the clock sweep before it can be re-used. I was misled in
> the 32-buffer case because catalog accesses during startup had left the
> buffer state pretty confused, so that there was no long stretch before
> hitting something available. With a large number of buffers, the
> behavior is that the seqscan fills all of shared memory with buffers
> having usage_count 1. Once the clock sweep returns to the first of
> these buffers, it will have to pass over all of them, reducing all of
> their counts to 0, before it returns to the first one and finds it now
> usable. Subsequent tries find a buffer immediately, of course, until we
> have again filled shared_buffers with usage_count 1 everywhere. So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time. This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time. Mark, can you detect "hiccups" in the read rate using
> your setup?
>
>

Cool. You posted the same analysis before I could hit the "send" button :)

I am wondering whether seqscan would set the usage_count to 1 or to a higher
value. usage_count is incremented while unpinning the buffer. Even if
we use
page-at-a-time mode, won't the buffer itself would get pinned/unpinned
every time seqscan returns a tuple ? If thats the case, the overhead would
be O(BM_MAX_USAGE_COUNT * N) for every N reads.

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea. But at least with large
> shared_buffers it doesn't sound like such a bad idea.
>
>
How about smaller value for BM_MAX_USAGE_COUNT ?

> Another issue nearby to this is whether to avoid selecting buffers that
> are dirty --- IIRC someone brought that up again recently. Maybe
> predecrement for clean buffers, postdecrement for dirty ones would be a
> cute compromise.
>
Can we use a 2-bit counter where the higher bit is set if the buffer is
dirty
and lower bit is set whenever the buffer is used. The clock-sweep then
decrement this counter and chooses a victim with counter value 0.
ISTM that we should optimize for large shared buffer pool case,
because that would be more common in the coming days. RAM is
getting cheaper everyday.

Thanks,
Pavan

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

From: Tom Lane on 5 Mar 2007 13:55

"Pavan Deolasee" <pavan(a)enterprisedb.com> writes:
> I am wondering whether seqscan would set the usage_count to 1 or to a higher
> value. usage_count is incremented while unpinning the buffer. Even if
> we use
> page-at-a-time mode, won't the buffer itself would get pinned/unpinned
> every time seqscan returns a tuple ? If thats the case, the overhead would
> be O(BM_MAX_USAGE_COUNT * N) for every N reads.

No, it's only once per page. There's a good deal of PrivateRefCount
thrashing that goes on while examining the individual tuples, but the
shared state only changes when we leave the page, because we hold pin
continuously on the current page of a seqscan. If you don't believe
me, insert some debug printouts for yourself.

> How about smaller value for BM_MAX_USAGE_COUNT ?

This is not relevant to the problem: we are concerned about usage count
1 versus 0, not the other end of the range.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

From: Gregory Stark on 5 Mar 2007 14:02

"Tom Lane" <tgl(a)sss.pgh.pa.us> writes:

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea. But at least with large
> shared_buffers it doesn't sound like such a bad idea.

I seem to recall the classic clock sweep algorithm was to just use a single
bit. Either a buffer was touched recently or it wasn't.

I also vaguely recall a refinement involving keeping a bitmask and shifting it
right each time the clock hand comes around. So you have a precise history of
which recent clock sweeps the buffer was used and which it wasn't.

I think the coarseness of not caring how heavily it was used is a key part of
the algorithm. By not caring if it was lightly or heavily used during the
clock sweep, just that it was used at least once it avoids making sticking
with incorrect deductions about things like sequential scans even if multiple
sequential scans pass by. As soon as they stop seeing the buffer it
immediately reacts and discards the buffer.

I would check out my OS book from school but it's on another continent :(

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: "Simon Riggs" on 5 Mar 2007 14:34

On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
> Tom,
>
> > I seem to recall that we've previously discussed the idea of letting the
> > clock sweep decrement the usage_count before testing for 0, so that a
> > buffer could be reused on the first sweep after it was initially used,
> > but that we rejected it as being a bad idea. But at least with large
> > shared_buffers it doesn't sound like such a bad idea.

> Note, though, that the current algorithm is working very, very well for OLTP
> benchmarks, so we'd want to be careful not to gain performance in one area at
> the expense of another.

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?