Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Jeff Davis on 6 Mar 2007 12:56

On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > Another approach I proposed back in December is to not have a
> > variable like that at all, but scan the buffer cache for pages
> > belonging to the table you're scanning to initialize the scan.
> > Scanning all the BufferDescs is a fairly CPU and lock heavy
> > operation, but it might be ok given that we're talking about large
> > I/O bound sequential scans. It would require no DBA tuning and
> > would work more robustly in varying conditions. I'm not sure where
> > you would continue after scanning the in-cache pages. At the
> > highest in-cache block number, perhaps.
>
> If there was some way to do that, it'd be what I'd vote for.
>

I still don't know how to make this take advantage of the OS buffer
cache.

However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?

> Given the partitioning of the buffer lock that Tom did it might not
> be that horrible for many cases, either, since you'd only need to
> scan through one partition.
>
> We also don't need an exact count, either. Perhaps there's some way
> we could keep a counter or something...

Exact count of what? The pages already in cache?

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

From: Tom Lane on 6 Mar 2007 12:59

Jeff Davis <pgsql(a)j-davis.com> writes:
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go.

I think that's a good way to guarantee that you'll not finish in time
for 8.3. Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win. (Pages in
cache will be picked up by a seqscan already.)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: Jeff Davis on 6 Mar 2007 13:09

On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote:
> Jeff Davis <pgsql(a)j-davis.com> writes:
> > If I were to implement this idea, I think Heikki's bitmap of pages
> > already read is the way to go.
>
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3. Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win. (Pages in
> cache will be picked up by a seqscan already.)
>

I agree that it's a good idea stick with the current implementation
which is, as far as I can see, meeting all of my performance goals.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Heikki Linnakangas on 6 Mar 2007 13:29

Jeff Davis wrote:
> On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
>> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
>>> Another approach I proposed back in December is to not have a
>>> variable like that at all, but scan the buffer cache for pages
>>> belonging to the table you're scanning to initialize the scan.
>>> Scanning all the BufferDescs is a fairly CPU and lock heavy
>>> operation, but it might be ok given that we're talking about large
>>> I/O bound sequential scans. It would require no DBA tuning and
>>> would work more robustly in varying conditions. I'm not sure where
>>> you would continue after scanning the in-cache pages. At the
>>> highest in-cache block number, perhaps.
>> If there was some way to do that, it'd be what I'd vote for.
>>
>
> I still don't know how to make this take advantage of the OS buffer
> cache.

Yep, I don't see any way to do that. I think we could live with that,
though. If we went with the sync_scan_offset approach, you'd have to
leave a lot of safety margin in that as well.

> However, no DBA tuning is a huge advantage, I agree with that.
>
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go. Can you guys give me some pointers about
> how to walk through the shared buffers, reading the pages that I need,
> while being sure not to read a page that's been evicted, and also not
> potentially causing a performance regression somewhere else?

You could take a look at BufferSync, for example. It walks through the
buffer cache, syncing all dirty buffers.

FWIW, I've attached a function I wrote some time ago when I was playing
with the same idea for vacuums. A call to the new function loops through
the buffer cache and returns the next buffer that belong to a certain
relation. I'm not sure that it's correct and safe, and there's not much
comments, but should work if you want to play with it...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From: Heikki Linnakangas on 6 Mar 2007 13:47

Tom Lane wrote:
> Jeff Davis <pgsql(a)j-davis.com> writes:
>> If I were to implement this idea, I think Heikki's bitmap of pages
>> already read is the way to go.
>
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3. Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win. (Pages in
> cache will be picked up by a seqscan already.)

The scenario that I'm worried about is that you have a table that's
slightly larger than RAM. If you issue many seqscans on that table, one
at a time, every seqscan will have to read the whole table from disk,
even though say 90% of it is in cache when the scan starts.

This can be alleviated by using a large enough sync_scan_offset, but a
single setting like that is tricky to tune, especially if your workload
is not completely constant. Tune it too low, and you don't get much
benefit, tune it too high and your scans diverge and you lose all benefit.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

First | Prev | Next | Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?