Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Tom Lane on 5 Mar 2007 14:41

"Simon Riggs" <simon(a)2ndquadrant.com> writes:
> Itakgaki-san and I were discussing in January the idea of cache-looping,
> whereby a process begins to reuse its own buffers in a ring of ~32
> buffers. When we cycle back round, if usage_count==1 then we assume that
> we can reuse that buffer. This avoids cache swamping for read and write
> workloads, plus avoids too-frequent WAL writing for VACUUM.

> This would maintain the beneficial behaviour for OLTP,

Justify that claim. It sounds to me like this would act very nearly the
same as having shared_buffers == 32 ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: "Luke Lonergan" on 5 Mar 2007 14:42

This sounds like a good idea.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Simon Riggs [mailto:simon(a)2ndquadrant.com]
Sent: Monday, March 05, 2007 02:37 PM Eastern Standard Time
To: Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore
Cc: pgsql-hackers(a)postgresql.org
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
> Tom,
>
> > I seem to recall that we've previously discussed the idea of letting the
> > clock sweep decrement the usage_count before testing for 0, so that a
> > buffer could be reused on the first sweep after it was initially used,
> > but that we rejected it as being a bad idea. But at least with large
> > shared_buffers it doesn't sound like such a bad idea.

> Note, though, that the current algorithm is working very, very well for OLTP
> benchmarks, so we'd want to be careful not to gain performance in one area at
> the expense of another.

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From: "Simon Riggs" on 5 Mar 2007 15:06

On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(a)2ndquadrant.com> writes:
> > Itakgaki-san and I were discussing in January the idea of cache-looping,
> > whereby a process begins to reuse its own buffers in a ring of ~32
> > buffers. When we cycle back round, if usage_count==1 then we assume that
> > we can reuse that buffer. This avoids cache swamping for read and write
> > workloads, plus avoids too-frequent WAL writing for VACUUM.
>
> > This would maintain the beneficial behaviour for OLTP,
>
> Justify that claim. It sounds to me like this would act very nearly the
> same as having shared_buffers == 32 ...

Sure. We wouldn't set the hint for IndexScans or Inserts, only for
SeqScans, VACUUM and COPY.

So OLTP-only workloads would be entirely unaffected. In the presence of
a mixed workload the scan tasks would have only a limited effect on the
cache, maintaining performance for the response time critical tasks. So
its an OLTP benefit because of cache protection and WAL-flush reduction
during VACUUM.

As we've seen, the scan tasks look like they'll go faster with this.

The assumption that we can reuse the buffer if usage_count<=1 seems
valid. If another user had requested the block, then the usage_count
would be > 1, unless the buffer has been used, unpinned and then a cycle
of the buffer cache had spun round, all within the time taken to process
32 blocks sequentially. We do have to reuse one of the buffers, so
cyclical reuse seems like a better bet most of the time than more
arbitrary block reuse, as we see in a larger cache.

Best way is to prove it though. Seems like not too much work to have a
private ring data structure when the hint is enabled. The extra
bookeeping is easily going to be outweighed by the reduction in mem->L2
cache fetches. I'll do it tomorrow, if no other volunteers.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Jeff Davis on 5 Mar 2007 15:11

On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.
>

I don't see why we should strictly limit sequential scans to one buffer
per scan. I assume you mean one buffer per scan, but that raises these
two questions:

(1) What happens when there are more seq scans than cold buffers
available?
(2) What happens when two sequential scans need the same page, do we
double-up?

Also, the first time we access any heap page of a big table, we are very
unsure whether we will access it again, regardless of whether it's part
of a seq scan or not.

In our current system of 4 LRU lists (depending on how many times a
buffer has been referenced), we could start "more likely" (e.g. system
catalogs, index pages) pages in higher list, and heap reads from big
tables in the lowest possible list. Assuming, of course, that has any
benefit (frequently accessed cache pages are likely to move up in the
lists very quickly anyway).

But I don't think we should eliminate caching of heap pages in big
tables all together. A few buffers might be polluted during the scan,
but most of the time they will be replacing other low-priority pages
(perhaps from another seq scan) and probably be replaced again quickly.
If that's not happening, and it's polluting frequently-accessed pages, I
agree that's a bug.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

From: Jeff Davis on 5 Mar 2007 15:12

On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote:
> > My proposal for a fix: ensure that when relations larger (much larger?)
> > than buffer cache are scanned, they are mapped to a single page in the
> > shared buffer cache.
>
> How will this approach play together with synchronized scan patches ?
>

Thanks for considering my patch in this discussion. I will test by
turning shared_buffers down as low as I can, and see if that makes a big
difference.

> Or should synchronized scan rely on systems cache only ?
>

I don't know what the performance impact of that will be; still good
compared to reading from disk, but I assume much more CPU time.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?