Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Hannu Krosing on 5 Mar 2007 04:10

Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan:
> Hi Tom,
>
> > Even granting that your conclusions are accurate, we are not
> > in the business of optimizing Postgres for a single CPU architecture.
>
> I think you're missing my/our point:
>
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

How will this approach play together with synchronized scan patches ?

Or should synchronized scan rely on systems cache only ?

> - Luke
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(a)postgresql.org so that your
> message can get through to the mailing list cleanly
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: Tom Lane on 5 Mar 2007 04:15

"Luke Lonergan" <LLonergan(a)greenplum.com> writes:
> I think you're missing my/our point:

> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".

No, this is not a bug; it is operating as designed. The point of the
current bufmgr algorithm is to replace the page least recently used,
and that's what it's doing.

If you want to lobby for changing the algorithm, then you need to
explain why one test case on one platform justifies de-optimizing
for a lot of other cases. In almost any concurrent-access situation
I think that what you are suggesting would be a dead loss --- for
instance we might as well forget about Jeff Davis' synchronized-scan
work.

In any case, I'm still not convinced that you've identified the problem
correctly, because your explanation makes no sense to me. How can the
processor's L2 cache improve access to data that it hasn't got yet?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

From: Florian Weimer on 5 Mar 2007 04:20

* Tom Lane:

> That makes absolutely zero sense. The data coming from the disk was
> certainly not in processor cache to start with, and I hope you're not
> suggesting that it matters whether the *target* page of a memcpy was
> already in processor cache. If the latter, it is not our bug to fix.

Uhm, if it's not in the cache, you typically need to evict some cache
lines to make room for the data, so I'd expect an indirect performance
hit. I could be mistaken, though.

--
Florian Weimer <fweimer(a)bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra�e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

From: Hannu Krosing on 5 Mar 2007 04:41

Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane:
> "Luke Lonergan" <LLonergan(a)greenplum.com> writes:
> > I think you're missing my/our point:
>
> > The Postgres shared buffer cache algorithm appears to have a bug. When
> > there is a sequential scan the blocks are filling the entire shared
> > buffer cache. This should be "fixed".
>
> No, this is not a bug; it is operating as designed.

Maybe he means that there is an oversight (aka "bug") in the design ;)

> The point of the
> current bufmgr algorithm is to replace the page least recently used,
> and that's what it's doing.
>
> If you want to lobby for changing the algorithm, then you need to
> explain why one test case on one platform justifies de-optimizing
> for a lot of other cases.

If you know beforehand that you will definitely overflow cache and not
reuse it anytime soon, then it seems quite reasonable to not even start
polluting the cache. Especially, if you get a noticable boost in
performance while doing so.

> In almost any concurrent-access situation
> I think that what you are suggesting would be a dead loss

Only if the concurrent access patern is over data mostly fitting in
buffer cache. If we can avoid polluting buffer cache with data we know
we will use only once, more useful data will be available.

> --- for
> instance we might as well forget about Jeff Davis' synchronized-scan
> work.

Depends on ratio of system cache/shared buffer cache. I don't think
Jeff's patch is anywere near the point it needs to start worrying about
data swapping between system cache and shared burrers, or L2 cache usage

> In any case, I'm still not convinced that you've identified the problem
> correctly, because your explanation makes no sense to me. How can the
> processor's L2 cache improve access to data that it hasn't got yet?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

From: Mark Kirkwood on 5 Mar 2007 05:00

Gavin Sherry wrote:
> On Mon, 5 Mar 2007, Mark Kirkwood wrote:
>
>> To add a little to this - forgetting the scan resistant point for the
>> moment... cranking down shared_buffers to be smaller than the L2 cache
>> seems to help *any* sequential scan immensely, even on quite modest HW:
>>
> (snipped)
>> When I've profiled this activity, I've seen a lot of time spent
>> searching for/allocating a new buffer for each page being fetched.
>> Obviously having less of them to search through will help, but having
>> less than the L2 cache-size worth of 'em seems to help a whole lot!
>
> Could you demonstrate that point by showing us timings for shared_buffers
> sizes from 512K up to, say, 2 MB? The two numbers you give there might
> just have to do with managing a large buffer.

Yeah - good point:

PIII 1.26 Ghz 512Kb L2 cache 2G RAM

Test is elapsed time for: SELECT count(*) FROM lineitem

lineitem has 1535724 pages (11997 MB)

Shared Buffers Elapsed IO rate (from vmstat)
-------------- ------- ---------------------
400MB 101 s 122 MB/s

2MB 100 s
1MB 97 s
768KB 93 s
512KB 86 s
256KB 77 s
128KB 74 s 166 MB/s

I've added the observed IO rate for the two extreme cases (the rest can
be pretty much deduced via interpolation).

Note that the system will do about 220 MB/s with the now (in)famous dd
test, so we have a bit of headroom (not too bad for a PIII).

Cheers

Mark

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?