Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: "Luke Lonergan" on 5 Mar 2007 03:30

Hi Tom,

> Now this may only prove that the disk subsystem on this
> machine is too cheap to let the system show any CPU-related
> issues.

Try it with a warm IO cache. As I posted before, we see double the
performance of a VACUUM from a table in IO cache when the shared buffer
cache isn't being polluted. The speed with large buffer cache should be
about 450 MB/s and the speed with a buffer cache smaller than L2 should
be about 800 MB/s.

The real issue here isn't the L2 behavior, though that's important when
trying to reach very high IO speeds, the issue is that we're seeing the
buffer cache pollution in the first place. When we instrument the
blocks selected by the buffer page selection algorithm, we see that they
iterate sequentially, filling the shared buffer cache. That's the
source of the problem here.

Do we have a regression test somewhere for this?

- Luke

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

From: Grzegorz Jaskiewicz on 5 Mar 2007 03:17

On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
> n into account.
>
> I'm also less than convinced that it'd be helpful for a big seqscan:
> won't reading a new disk page into memory via DMA cause that memory to
> get flushed from the processor cache anyway?

Nope. DMA is writing directly into main memory. If the area was in
the L2/L1 cache, it will get invalidated. But if it isn't there, it
is okay.

--
Grzegorz Jaskiewicz
gj(a)pointblue.com.pl

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

From: Tom Lane on 5 Mar 2007 03:45

"Luke Lonergan" <LLonergan(a)greenplum.com> writes:
>> So either way, it isn't in processor cache after the read.
>> So how can there be any performance benefit?

> It's the copy from kernel IO cache to the buffer cache that is L2
> sensitive. When the shared buffer cache is polluted, it thrashes the L2
> cache. When the number of pages being written to in the kernel->user
> space writes fits in L2, then the L2 lines are "written through" (see
> the link below on page 264 for the write combining features of the
> opteron for example) and the writes to main memory are deferred.

That makes absolutely zero sense. The data coming from the disk was
certainly not in processor cache to start with, and I hope you're not
suggesting that it matters whether the *target* page of a memcpy was
already in processor cache. If the latter, it is not our bug to fix.

> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
> 25112.PDF

Even granting that your conclusions are accurate, we are not in the
business of optimizing Postgres for a single CPU architecture.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Luke Lonergan" on 5 Mar 2007 03:51

Hi Tom,

> Even granting that your conclusions are accurate, we are not
> in the business of optimizing Postgres for a single CPU architecture.

I think you're missing my/our point:

The Postgres shared buffer cache algorithm appears to have a bug. When
there is a sequential scan the blocks are filling the entire shared
buffer cache. This should be "fixed".

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.

- Luke

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: Heikki Linnakangas on 5 Mar 2007 04:09

Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug. When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache. This should be "fixed".
>
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

It's not that simple. Using the whole buffer cache for a single seqscan
is ok, if there's currently no better use for the buffer cache. Running
a single select will indeed use the whole cache, but if you run any
other smaller queries, the pages they need should stay in cache and the
seqscan will loop through the other buffers.

In fact, the pages that are left in the cache after the seqscan finishes
would be useful for the next seqscan of the same table if we were smart
enough to read those pages first. That'd make a big difference for
seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if
Jeff's sync seqscan patch adresses that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?