Bug: Buffer cache is not scan resistant [PgSql]

Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?

From: Jeff Davis on 5 Mar 2007 17:43

On Mon, 2007-03-05 at 21:03 +0000, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a variable
> like that at all, but scan the buffer cache for pages belonging to the
> table you're scanning to initialize the scan. Scanning all the
> BufferDescs is a fairly CPU and lock heavy operation, but it might be ok
> given that we're talking about large I/O bound sequential scans. It
> would require no DBA tuning and would work more robustly in varying
> conditions. I'm not sure where you would continue after scanning the
> in-cache pages. At the highest in-cache block number, perhaps.
>

I assume you're referring to this:

"each backend keeps a bitmap of pages it has processed during the scan,
and read the pages in the order they're available in cache."

which I think is a great idea. However, I was unable to devise a good
answer to all these questions at once:

* How do we attempt to maintain sequential reads on the underlying I/O
layer?

* My current implementation takes advantage of the OS buffer cache, how
could we maintain that advantage from PostgreSQL-specific cache logic?

* How do I test to see whether it actually helps in a realistic
scenario? It seems like it would help the most when scans are
progressing at different rates, but how often do people have CPU-bound
queries on tables that don't fit into physical memory (and how long
would it take for me to benchmark such a query)?

It seems like your idea is more analytical, and my current
implementation is more guesswork. I like the analytical approach, but I
don't know that we have enough information to pull it off because we're
missing what's in the OS buffer cache. The OS buffer cache is crucial to
Synchronized Scanning, because shared buffers are evicted based on a
more complex set of circumstances, whereas the OS buffer cache is
usually LRU and forms a nicer "cache trail" (upon which Synchronized
Scanning is largely based).

If you have some tests you'd like me to run, I'm planning to do some
benchmarks this week and next. I can see if my current patch holds up
under the scenarios you're worried about.

Regards,
Jeff Davis

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: "Florian G. Pflug" on 5 Mar 2007 18:54

Simon Riggs wrote:
> On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon(a)2ndquadrant.com> writes:
>>> Itakgaki-san and I were discussing in January the idea of cache-looping,
>>> whereby a process begins to reuse its own buffers in a ring of ~32
>>> buffers. When we cycle back round, if usage_count==1 then we assume that
>>> we can reuse that buffer. This avoids cache swamping for read and write
>>> workloads, plus avoids too-frequent WAL writing for VACUUM.
>>> This would maintain the beneficial behaviour for OLTP,
>> Justify that claim. It sounds to me like this would act very nearly the
>> same as having shared_buffers == 32 ...
>
> Sure. We wouldn't set the hint for IndexScans or Inserts, only for
> SeqScans, VACUUM and COPY.
>
> So OLTP-only workloads would be entirely unaffected. In the presence of
> a mixed workload the scan tasks would have only a limited effect on the
> cache, maintaining performance for the response time critical tasks. So
> its an OLTP benefit because of cache protection and WAL-flush reduction
> during VACUUM.
>
> As we've seen, the scan tasks look like they'll go faster with this.

But it would break the idea of letting a second seqscan follow in the
first's hot cache trail, no?

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Mark Kirkwood on 5 Mar 2007 19:06

Tom Lane wrote:
>
> But what I wanted to see was the curve of
> elapsed time vs shared_buffers?
>

Of course! (lets just write that off to me being pre coffee...).

With the patch applied:

Shared Buffers Elapsed vmstat IO rate
-------------- ------- --------------
400MB 101 s 122 MB/s
2MB 101 s
1MB 97 s
768KB 94 s
512KB 84 s
256KB 79 s
128KB 75 s 166 MB/s

Looks *very* similar.

Cheers

Mark

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

From: Tom Lane on 5 Mar 2007 19:43

Mark Kirkwood <markir(a)paradise.net.nz> writes:
> Tom Lane wrote:
>> But what I wanted to see was the curve of
>> elapsed time vs shared_buffers?
> ...
> Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself. I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so). Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can. Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;
shared_buffers
----------------
128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem;
count
----------
10240000
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior. (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26368 13.7984 /home/tgl/testversion/bin/postgres
12765 6.6799 /libata
2238 1.1711 /lib64/libc-2.4.so
1112 0.5819 /dm_mod

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...|
samples| %|
------------------
113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux
26284 16.3070 /home/tgl/testversion/bin/postgres
12004 7.4475 /libata
2093 1.2985 /lib64/libc-2.4.so
996 0.6179 /dm_mod

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
57779 40.9591 copy_user_generic
18175 12.8841 __delay
3994 2.8313 _raw_spin_lock
2388 1.6928 put_page
2184 1.5482 mwait_idle
2083 1.4766 _raw_write_unlock
1909 1.3533 _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples % symbol name
37437 33.0783 copy_user_generic
17891 15.8080 __delay
3372 2.9794 _raw_spin_lock
2218 1.9598 mwait_idle
2067 1.8263 _raw_write_unlock
1837 1.6231 _raw_write_lock
1531 1.3527 put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache. This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior? A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist. In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache footprint is limited. But immediately
after postmaster start, (nearly) all the buffers are in the freelist and
so they all cycle through VACUUM's usage. In any real-world situation,
of course, the freelist is going to be nearly empty most of the time and
so I don't think this part is worth changing.

So I now concede Luke's argument that this behavior is related to L2
cache usage. But the next question is whether we ought to change
regular seqscan's behavior to mimic VACUUM. I'm very worried about
pessimizing other cases if we do. ISTM there's a fairly clear case that
this might be fixable at the kernel level. Moreover, the issue only
arises because of the way that the copy-from-kernel-buffer-to-userspace
routine behaves, which means that if we go to a regime where we stop
relying on OS caching and ask for direct DMA into our buffers, the
advantage would disappear anyway. Lastly, I think the case where a win
is possible is fairly narrow --- as soon as you've got anything but the
one seqscan going on, it's not going to help.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Gregory Stark on 5 Mar 2007 20:31

"Tom Lane" <tgl(a)sss.pgh.pa.us> writes:

> I don't see any good reason why overwriting a whole cache line oughtn't be
> the same speed either way.

I can think of a couple theories, but I don't know if they're reasonable. The
one the comes to mind is the inter-processor cache coherency protocol. When
writing to a cache line the processor already owns maybe it can skip having to
check for other processors owning that cache line?

What happens if VACUUM comes across buffers that *are* already in the buffer
cache. Does it throw those on the freelist too? That seems like it would be
dangerous if they were in the buffer cache for a reason.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?