Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Jim Nasby on 5 Mar 2007 23:11

On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote:
> Tom,
>
>> I seem to recall that we've previously discussed the idea of
>> letting the
>> clock sweep decrement the usage_count before testing for 0, so that a
>> buffer could be reused on the first sweep after it was initially
>> used,
>> but that we rejected it as being a bad idea. But at least with large
>> shared_buffers it doesn't sound like such a bad idea.
>
> We did discuss an number of formulas for setting buffers with
> different
> clock-sweep numbers, including ones with higher usage_count for
> indexes and
> starting numbers of 0 for large seq scans as well as vacuums.
> However, we
> didn't have any way to prove that any of these complex algorithms
> would
> result in higher performance, so went with the simplest formula,
> with the
> idea of tinkering with it when we had more data. So maybe now's
> the time.
>
> Note, though, that the current algorithm is working very, very well
> for OLTP
> benchmarks, so we'd want to be careful not to gain performance in
> one area at
> the expense of another. In TPCE testing, we've been able to increase
> shared_buffers to 10GB with beneficial performance effect (numbers
> posted
> when I have them) and even found that "taking over RAM" with the
> shared_buffers (ala Oracle) gave us equivalent performance to using
> the FS
> cache. (yes, this means with a little I/O management engineering
> we could
> contemplate discarding use of the FS cache for a net performance
> gain. Maybe
> for 8.4)

An idea I've been thinking about would be to have the bgwriter or
some other background process actually try and keep the free list
populated, so that backends needing to grab a page would be much more
likely to find one there (and not have to wait to scan through the
entire buffer pool, perhaps multiple times).

My thought is to keep track of how many page requests occurred during
a given interval, and use that value (probably averaged over time) to
determine how many pages we'd like to see on the free list. The
background process would then run through the buffers decrementing
usage counts until it found enough for the free list. Before putting
a buffer on the 'free list', it would write the buffer out; I'm not
sure if it would make sense to de-associate the buffer with whatever
it had been storing or not, though. If we don't do that, that would
mean that we could pull pages back off the free list if we wanted to.
That would be helpful if the background process got a bit over-zealous.
--
Jim Nasby jim(a)nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

From: Jim Nasby on 5 Mar 2007 23:02

On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a
> variable like that at all, but scan the buffer cache for pages
> belonging to the table you're scanning to initialize the scan.
> Scanning all the BufferDescs is a fairly CPU and lock heavy
> operation, but it might be ok given that we're talking about large
> I/O bound sequential scans. It would require no DBA tuning and
> would work more robustly in varying conditions. I'm not sure where
> you would continue after scanning the in-cache pages. At the
> highest in-cache block number, perhaps.

If there was some way to do that, it'd be what I'd vote for.

Given the partitioning of the buffer lock that Tom did it might not
be that horrible for many cases, either, since you'd only need to
scan through one partition.

We also don't need an exact count, either. Perhaps there's some way
we could keep a counter or something...
--
Jim Nasby jim(a)nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: Tom Lane on 6 Mar 2007 02:17

Jim Nasby <decibel(a)decibel.org> writes:
> An idea I've been thinking about would be to have the bgwriter or
> some other background process actually try and keep the free list
> populated,

The bgwriter already tries to keep pages "just in front" of the clock
sweep pointer clean.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Simon Riggs" on 6 Mar 2007 03:14

On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote:
> Simon Riggs wrote:

> But it would break the idea of letting a second seqscan follow in the
> first's hot cache trail, no?

No, but it would make it somewhat harder to achieve without direct
synchronization between scans. It could still work; lets see.

I'm not sure thats an argument against fixing the problem with the
buffer strategy though. We really want both, not just one or the other.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: Sherry Moore on 6 Mar 2007 00:34

Hi Tom,

Sorry about the delay. I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K). The block
size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

<81 tread >./readtest -h
Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count]
-v: Verbose mode
-N: Normalize results by number of reads
-s <size>: Working set size (may specify K,M,G suffix)
-n iter: Number of test iterations
-f filename: Name of the file to read from
-d [+|-]delta: Distance between subsequent reads
-c count: Number of reads
-h: Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

<82 tread >./readtest -s 16k -f boot_archive
46445262
<83 tread >./readtest -s 128M -f boot_archive
118294230
<84 tread >./readtest -s 16k -f boot_archive -n 100
4230210856
<85 tread >./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

<89 tread >./readtest -s 16k -f boot_archive
43606882
<90 tread >./readtest -s 128M -f boot_archive
100547909
<91 tread >./readtest -s 16k -f boot_archive -n 100
4251823995
<92 tread >./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses. When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller. As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

- Based on a lot of the benchmarks and workloads I traced, the
target buffer of read operations are typically accessed again
shortly after the read, while writes are usually not. Therefore,
the default operation mode is to bypass L2 for writes, but not
for reads.

- The Opteron's L1 cache size is 64K. If reads are larger than
128KB, it would have displacement flushed itself anyway, so for
large reads, I will also bypass L2. I am working on dynamically
setting copyout_max_cached based on the L1 D-cache size on the
system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations. It is a calculated balance.

Thanks,
Sherry

On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
> "Luke Lonergan" <LLonergan(a)greenplum.com> writes:
> > Good info - it's the same in Solaris, the routine is uiomove (Sherry
> > wrote it).
>
> Cool. Maybe Sherry can comment on the question whether it's possible
> for a large-scale-memcpy to not take a hit on filling a cache line
> that wasn't previously in cache?
>
> I looked a bit at the Linux code that's being used here, but it's all
> x86_64 assembler which is something I've never studied :-(.
>
> regards, tom lane

--
Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?