Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: "Luke Lonergan" on 4 Mar 2007 14:11 I'm putting this out there before we publish a fix so that we can discuss how best to fix it. Doug and Sherry recently found the source of an important performance issue with the Postgres shared buffer cache. The issue is summarized like this: the buffer cache in PGSQL is not "scan resistant" as advertised. A sequential scan of a table larger than cache will pollute the buffer cache in almost all circumstances. Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500 (thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and "memtable" is a table that fits into I/O cache: With our default setting of shared_buffers (16MB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1221 MB/s 973 MB/s VACUUM 1709 MB/s 1206 MB/s We had observed that VACUUM would perform better when done right after a SELECT. In the above example, the faster rate from disk was 1608 MB/s, compared to the normal rate of 1206 MB/s. We verified this behavior on Postgres 8.2 as well. The buffer selection algorithm is choosing buffer pages scattered throughout the buffer cache in almost all circumstances. Sherry traced the behavior to the processor repeatedly flushing the L2 cache. Doug found that we weren't using the Postgres buffer cache the way we expected, instead we were loading the scanned data from disk into the cache even though there was no possibility of reusing it. In addition to pushing other, possibly useful pages from the cache, it has the additional behavior of invalidating the L2 cache for the remainder of the executor path that uses the data. To prove that the buffer cache was the source of the problem, we dropped the shared buffer size to fit into L2 cache (1MB per Opteron core), and this is what we saw (drop size of shared buffers to 680KB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1320 MB/s 1059 MB/s VACUUM 3033 MB/s 1597 MB/s These results do not vary with the order of operations. Thoughts on the best way to fix the buffer selection algorithm? Ideally, one page would be used in the buffer cache in circumstances where the table to be scanned is (significantly?) larger than the size of the buffer cache. - Luke ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
From: Tom Lane on 4 Mar 2007 20:36 "Luke Lonergan" <llonergan(a)greenplum.com> writes: > The issue is summarized like this: the buffer cache in PGSQL is not "scan > resistant" as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
From: "Luke Lonergan" on 4 Mar 2007 21:32 When we instrument the page selections made within the buffer cache, they are sequential and span the entire address space of the cache. With respect to whether it's L2, it's a conclusion based on the experimental results. It's not the TLB, as we also tested for the 512 entries for each L2. One thing I left out of the previous post: the difference between fast and slow behavior was that in the fast case, the buffer selection alternated between two buffer pages. This was the case only when the preceding statement was a SELECT and the statement was VACUUM. - Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Tom Lane [mailto:tgl(a)sss.pgh.pa.us] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant "Luke Lonergan" <llonergan(a)greenplum.com> writes: > The issue is summarized like this: the buffer cache in PGSQL is not "scan > resistant" as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane
From: "Luke Lonergan" on 4 Mar 2007 21:35 One more thing: the L2 is invalidated when re-written from the kernel IO cache, but the pages addressed in L2 retain their values when 'writeen thru' which allows the new data to be re-used up the executor chain. - Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Tom Lane [mailto:tgl(a)sss.pgh.pa.us] Sent: Sunday, March 04, 2007 08:36 PM Eastern Standard Time To: Luke Lonergan Cc: PGSQL Hackers; Doug Rady; Sherry Moore Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant "Luke Lonergan" <llonergan(a)greenplum.com> writes: > The issue is summarized like this: the buffer cache in PGSQL is not "scan > resistant" as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane
From: Mark Kirkwood on 5 Mar 2007 00:03
Tom Lane wrote: > "Luke Lonergan" <llonergan(a)greenplum.com> writes: >> The issue is summarized like this: the buffer cache in PGSQL is not "scan >> resistant" as advertised. > > Sure it is. As near as I can tell, your real complaint is that the > bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; > which is hardly surprising considering it doesn't know the size of L2 > cache. That's not a consideration that we've ever taken into account. > To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, SELECT count(*) FROM lineitem (which is about 11GB) performance: Shared_buffers Elapsed -------------- ------- 400MB 101 s 128KB 74 s When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Cheers Mark ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |