Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Jeff Davis on 6 Mar 2007 21:28

On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> >>> Another approach I proposed back in December is to not have a
> >>> variable like that at all, but scan the buffer cache for pages
> >>> belonging to the table you're scanning to initialize the scan.
> >>> Scanning all the BufferDescs is a fairly CPU and lock heavy
> >>> operation, but it might be ok given that we're talking about large
> >>> I/O bound sequential scans. It would require no DBA tuning and
> >>> would work more robustly in varying conditions. I'm not sure where
> >>> you would continue after scanning the in-cache pages. At the
> >>> highest in-cache block number, perhaps.
> >> If there was some way to do that, it'd be what I'd vote for.
> >>
> >
> > I still don't know how to make this take advantage of the OS buffer
> > cache.
>
> Yep, I don't see any way to do that. I think we could live with that,
> though. If we went with the sync_scan_offset approach, you'd have to
> leave a lot of safety margin in that as well.
>

Right, there would certainly have to be a safety margin with
sync_scan_offset. However, your plan only works when the shared buffers
are dominated by this sequential scan. Let's say you have 40% of
physical memory for shared buffers, and say that 50% are being used for
hot pages in other parts of the database. That means you have access to
only 20% of physical memory to optimize for this sequential scan, and
20% of the physical memory is basically unavailable (being used for
other parts of the database).

In my current implementation, you could set sync_scan_offset to 1.0
(meaning 1.0 x shared_buffers), giving you 40% of physical memory that
would be used for starting this sequential scan. In this case, that
should be a good margin of error, considering that as much as 80% of the
physical memory might actually be in cache (OS or PG cache).

This all needs to be backed up by testing, of course. I'm just
extrapolating some numbers that look vaguely reasonable to me.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: "Luke Lonergan" on 6 Mar 2007 22:32

Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size.

My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Sherry Moore [mailto:sherry.moore(a)sun.com]
Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time
To: Simon Riggs
Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Simon,

> and what you haven't said
>
> - all of this is orthogonal to the issue of buffer cache spoiling in
> PostgreSQL itself. That issue does still exist as a non-OS issue, but
> we've been discussing in detail the specific case of L2 cache effects
> with specific kernel calls. All of the test results have been
> stand-alone, so we've not done any measurements in that area. I say this
> because you make the point that reducing the working set size of write
> workloads has no effect on the L2 cache issue, but ISTM its still
> potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
difference with VACUUM was caused by whether the size of the
user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
--
Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym

From: Hannu Krosing on 7 Mar 2007 02:22

Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
> On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> > Jeff Davis wrote:
> > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > >>> Another approach I proposed back in December is to not have a
> > >>> variable like that at all, but scan the buffer cache for pages
> > >>> belonging to the table you're scanning to initialize the scan.
> > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy
> > >>> operation, but it might be ok given that we're talking about large
> > >>> I/O bound sequential scans. It would require no DBA tuning and
> > >>> would work more robustly in varying conditions. I'm not sure where
> > >>> you would continue after scanning the in-cache pages. At the
> > >>> highest in-cache block number, perhaps.
> > >> If there was some way to do that, it'd be what I'd vote for.
> > >>
> > >
> > > I still don't know how to make this take advantage of the OS buffer
> > > cache.

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

> > Yep, I don't see any way to do that. I think we could live with that,
> > though. If we went with the sync_scan_offset approach, you'd have to
> > leave a lot of safety margin in that as well.
> >
>
> Right, there would certainly have to be a safety margin with
> sync_scan_offset. However, your plan only works when the shared buffers
> are dominated by this sequential scan. Let's say you have 40% of
> physical memory for shared buffers, and say that 50% are being used for
> hot pages in other parts of the database. That means you have access to
> only 20% of physical memory to optimize for this sequential scan, and
> 20% of the physical memory is basically unavailable (being used for
> other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

> In my current implementation, you could set sync_scan_offset to 1.0
> (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
> would be used for starting this sequential scan. In this case, that
> should be a good margin of error, considering that as much as 80% of the
> physical memory might actually be in cache (OS or PG cache).
>
> This all needs to be backed up by testing, of course. I'm just
> extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG "give me this page only if it is in
shared cache already", then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ:
IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER
READ PAGE N
N++
ELSE IF M>=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES
READ PAGE M
M--
ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st
WAIT FOR PAGE N TO BECOME AVAILABLE
READ PAGE N
N++
ELSE: -- BECOME 1st reader
READ PAGE N
N++
PROCESS PAGE
--
IF N > PAGES_IF_TABLE: N=0
IF M < 0: M=PAGES_IF_TABLE

This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.

Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Marko Kreen" on 7 Mar 2007 03:49

On 3/7/07, Hannu Krosing <hannu(a)skype.net> wrote:
> Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
> systems ?

Supposedly you could mmap() a file and then do mincore() on the
area to see which pages are cached.

But you were talking about postgres cache before, there it should
be easily implementable.

--
marko

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Sherry Moore on 6 Mar 2007 22:05

Hi Simon,

> and what you haven't said
>
> - all of this is orthogonal to the issue of buffer cache spoiling in
> PostgreSQL itself. That issue does still exist as a non-OS issue, but
> we've been discussing in detail the specific case of L2 cache effects
> with specific kernel calls. All of the test results have been
> stand-alone, so we've not done any measurements in that area. I say this
> because you make the point that reducing the working set size of write
> workloads has no effect on the L2 cache issue, but ISTM its still
> potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
difference with VACUUM was caused by whether the size of the
user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
--
Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

First | Prev | Next | Last
Pages: 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?