Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: "Simon Riggs" on 6 Mar 2007 17:27

On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote:

> - Based on a lot of the benchmarks and workloads I traced, the
> target buffer of read operations are typically accessed again
> shortly after the read, while writes are usually not. Therefore,
> the default operation mode is to bypass L2 for writes, but not
> for reads.

Hi Sherry,

I'm trying to relate what you've said to how we should proceed from
here. My understanding of what you've said is:

- Tom's assessment that the observed performance quirk could be fixed in
the OS kernel is correct and you have the numbers to prove it

- currently Solaris only does NTA for 128K reads, which we don't
currently do. If we were to request 16 blocks at time, we would get this
benefit on Solaris, at least. The copyout_max_cached parameter can be
patched, but isn't a normal system tunable.

- other workloads you've traced *do* reuse the same buffer again very
soon afterwards when reading sequentially (not writes). Reducing the
working set size is an effective technique in improving performance if
we don't have a kernel that does NTA or we don't read in big enough
chunks (we need both to get NTA to kick in).

and what you haven't said

- all of this is orthogonal to the issue of buffer cache spoiling in
PostgreSQL itself. That issue does still exist as a non-OS issue, but
we've been discussing in detail the specific case of L2 cache effects
with specific kernel calls. All of the test results have been
stand-alone, so we've not done any measurements in that area. I say this
because you make the point that reducing the working set size of write
workloads has no effect on the L2 cache issue, but ISTM its still
potentially a cache spoiling issue.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

From: Jeff Davis on 6 Mar 2007 19:28

On Tue, 2007-03-06 at 18:47 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Jeff Davis <pgsql(a)j-davis.com> writes:
> >> If I were to implement this idea, I think Heikki's bitmap of pages
> >> already read is the way to go.
> >
> > I think that's a good way to guarantee that you'll not finish in time
> > for 8.3. Heikki's idea is just at the handwaving stage at this point,
> > and I'm not even convinced that it will offer any win. (Pages in
> > cache will be picked up by a seqscan already.)
>
> The scenario that I'm worried about is that you have a table that's
> slightly larger than RAM. If you issue many seqscans on that table, one
> at a time, every seqscan will have to read the whole table from disk,
> even though say 90% of it is in cache when the scan starts.
>

If you're issuing sequential scans one at a time, that 90% of the table
that was cached is probably not cached any more, unless the scans are
close together in time without overlapping (serial sequential scans).
And the problem you describe is no worse than current behavior, where
you have exactly the same problem.

> This can be alleviated by using a large enough sync_scan_offset, but a
> single setting like that is tricky to tune, especially if your workload
> is not completely constant. Tune it too low, and you don't get much
> benefit, tune it too high and your scans diverge and you lose all benefit.
>

I see why you don't want to manually tune this setting, however it's
really not that tricky. You can be quite conservative and still use a
good fraction of your physical memory. I will come up with some numbers
and see how much we have to gain.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: Jim Nasby on 6 Mar 2007 19:38

On Mar 6, 2007, at 12:17 AM, Tom Lane wrote:
> Jim Nasby <decibel(a)decibel.org> writes:
>> An idea I've been thinking about would be to have the bgwriter or
>> some other background process actually try and keep the free list
>> populated,
>
> The bgwriter already tries to keep pages "just in front" of the clock
> sweep pointer clean.

True, but that still means that each backend has to run the clock-
sweep. AFAICT that's something that backends will serialize on (due
to BufFreelistLock), so it would be best to make StrategyGetBuffer as
fast as possible. It certainly seems like grabbing a buffer off the
free list is going to be a lot faster than running the clock sweep.
That's why I think it'd be better to have the bgwriter run the clock
sweep and put enough buffers on the free list to try and keep up with
demand.
--
Jim Nasby jim(a)nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Jim Nasby on 6 Mar 2007 19:43

On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
>> We also don't need an exact count, either. Perhaps there's some way
>> we could keep a counter or something...
>
> Exact count of what? The pages already in cache?

Yes. The idea being if you see there's 10k pages in cache, you can
likely start 9k pages behind the current scan point and still pick
everything up.

But this is nowhere near as useful as the bitmap idea, so I'd only
look at it if it's impossible to make the bitmaps work. And like
others have said, that should wait until there's at least a first-
generation patch that's going to make it into 8.3.
--
Jim Nasby jim(a)nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: Jeff Davis on 6 Mar 2007 21:10

On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote:
> On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
> >> We also don't need an exact count, either. Perhaps there's some way
> >> we could keep a counter or something...
> >
> > Exact count of what? The pages already in cache?
>
> Yes. The idea being if you see there's 10k pages in cache, you can
> likely start 9k pages behind the current scan point and still pick
> everything up.
>
> But this is nowhere near as useful as the bitmap idea, so I'd only
> look at it if it's impossible to make the bitmaps work. And like
> others have said, that should wait until there's at least a first-
> generation patch that's going to make it into 8.3.

You still haven't told me how we take advantage of the OS buffer cache
with the bitmap idea. What makes you think that my current
implementation is "nowhere near as useful as the bitmap idea"?

My current implementation is making use of OS buffers + shared memory;
the bitmap idea can only make use of shared memory, and is likely
throwing the OS buffers away completely.

I also suspect that the bitmap idea relies too much on the idea that
there's a contiguous cache trail in the shared buffers alone. Any
devation from that -- which could be caused by PG's page replacement
algorithm, especially in combination with a varied load pattern -- would
negate any benefit from the bitmap idea. I feel much more confident that
there will exist a trail of pages that are cached in *either* the PG
shared buffers *or* the OS buffer cache. There may be holes/gaps in
either one, but it's much more likely that they combine into a
contiguous series of cached pages. Do you have an idea how I might test
this claim?

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?