Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: "Luke Lonergan" on 5 Mar 2007 20:42 Hi Tom, Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it). - Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Tom Lane [mailto:tgl(a)sss.pgh.pa.us] Sent: Monday, March 05, 2007 07:43 PM Eastern Standard Time To: Mark Kirkwood Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQL Hackers; Doug Rady; Sherry Moore Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Mark Kirkwood <markir(a)paradise.net.nz> writes: > Tom Lane wrote: >> But what I wanted to see was the curve of >> elapsed time vs shared_buffers? > ... > Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow disk drive (this machine is a dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test shows the disk read speed as 43MB/sec or so). Test case is a several-gig table, no indexes, fully vacuumed so that neither VACUUM nor COUNT(*) have to do anything but seqscan as fast as they can. Given a *freshly started* postmaster, I see regression=# show shared_buffers; shared_buffers ---------------- 128MB (1 row) regression=# \timing Timing is on. regression=# vacuum lineitem; VACUUM Time: 63652.333 ms regression=# vacuum lineitem; VACUUM Time: 63562.303 ms regression=# select count(*) from lineitem; count ---------- 10240000 (1 row) Time: 63142.174 ms regression=# vacuum lineitem; VACUUM Time: 61638.421 ms regression=# vacuum lineitem; VACUUM Time: 61785.905 ms I didn't show it here, but you can repeat the VACUUM all you want before the SELECT, and its times are stable; and you can repeat all you want after the SELECT, and the times are stable but a couple seconds lower. Restart the postmaster and it goes back to the slower behavior. (I'm keeping autovac off so it doesn't change the results.) I decided to get down and dirty with oprofile, and soon found that the user-space CPU consumption is indistinguishable in both states: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26368 13.7984 /home/tgl/testversion/bin/postgres 12765 6.6799 /libata 2238 1.1711 /lib64/libc-2.4.so 1112 0.5819 /dm_mod CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26284 16.3070 /home/tgl/testversion/bin/postgres 12004 7.4475 /libata 2093 1.2985 /lib64/libc-2.4.so 996 0.6179 /dm_mod Inside the kernel, there's only one routine that's significantly different: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 57779 40.9591 copy_user_generic 18175 12.8841 __delay 3994 2.8313 _raw_spin_lock 2388 1.6928 put_page 2184 1.5482 mwait_idle 2083 1.4766 _raw_write_unlock 1909 1.3533 _raw_write_lock CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 37437 33.0783 copy_user_generic 17891 15.8080 __delay 3372 2.9794 _raw_spin_lock 2218 1.9598 mwait_idle 2067 1.8263 _raw_write_unlock 1837 1.6231 _raw_write_lock 1531 1.3527 put_page So that's part of the mystery: apparently copy_user_generic is coded in such a way that it's faster to copy into memory that's already in processor cache. This strikes me as something that probably could/should be fixed in the kernel; I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. The other thing that was bothering me is why does the SELECT change VACUUM's behavior? A debugging printout added to ReadBuffer gave the answer: after postmaster start, we see things like read block 353094 into buffer 11386 read block 353095 into buffer 11387 read block 353096 into buffer 11388 read block 353097 into buffer 11389 read block 353098 into buffer 11390 read block 353099 into buffer 11391 read block 353100 into buffer 11392 read block 353101 into buffer 11393 read block 353102 into buffer 11394 read block 353103 into buffer 11395 and after the SELECT it behaves like read block 336761 into buffer 9403 read block 336762 into buffer 9402 read block 336763 into buffer 9403 read block 336764 into buffer 9402 read block 336765 into buffer 9403 read block 336766 into buffer 9402 read block 336767 into buffer 9403 read block 336768 into buffer 9402 read block 336769 into buffer 9403 read block 336770 into buffer 9402 What's going on is that VACUUM puts each buffer it's finished with on the tail of the freelist. In the post-SELECT state, there are just two buffers cycling through the freelist (not sure why not only one, but it doesn't matter) and so the cache footprint is limited. But immediately after postmaster start, (nearly) all the buffers are in the freelist and so they all cycle through VACUUM's usage. In any real-world situation, of course, the freelist is going to be nearly empty most of the time and so I don't think this part is worth changing. So I now concede Luke's argument that this behavior is related to L2 cache usage. But the next question is whether we ought to change regular seqscan's behavior to mimic VACUUM. I'm very worried about pessimizing other cases if we do. ISTM there's a fairly clear case that this might be fixable at the kernel level. Moreover, the issue only arises because of the way that the copy-from-kernel-buffer-to-userspace routine behaves, which means that if we go to a regime where we stop relying on OS caching and ask for direct DMA into our buffers, the advantage would disappear anyway. Lastly, I think the case where a win is possible is fairly narrow --- as soon as you've got anything but the one seqscan going on, it's not going to help. regards, tom lane
From: Tom Lane on 5 Mar 2007 21:39 Gregory Stark <stark(a)enterprisedb.com> writes: > What happens if VACUUM comes across buffers that *are* already in the buffer > cache. Does it throw those on the freelist too? Not unless they have usage_count 0, in which case they'd be subject to recycling by the next clock sweep anyway. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
From: Tom Lane on 5 Mar 2007 22:58 "Luke Lonergan" <LLonergan(a)greenplum.com> writes: > Good info - it's the same in Solaris, the routine is uiomove (Sherry > wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
From: "Luke Lonergan" on 5 Mar 2007 23:23 Tom, On 3/5/07 7:58 PM, "Tom Lane" <tgl(a)sss.pgh.pa.us> wrote: > I looked a bit at the Linux code that's being used here, but it's all > x86_64 assembler which is something I've never studied :-(. Here's the C wrapper routine in Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/ move.c Here's the x86 assembler routine for Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 /ml/copy.s The actual uiomove routine is a simple wrapper that calls the assembler kcopy or xcopyout routines. There are two versions (for Opteron), one that uses the NTA instructions that bypass the L2 cache on writing to avoid L2 cache pollution, and the second writes normally - through the L2 cache. Which one is used depends on a parameter (global) based on the size of the I/O. It is tuned to identify operations that might pollute the L2 cache (sound familiar?) I think what we're seeing is a generic artifact of the write-through behavior of the cache. I wouldn't expect this to get any better with DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number of user space buffers we'll still hit the increased L2 thrashing. I think we're best off with a hybrid approach - when we "detect" a seq scan larger (much larger?) than buffer cache, we can switch into the "cache bypass" behavior, much like the above code uses the NTA instruction when appropriate. We can handle syncscan using a small buffer space. - Luke ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
From: Tom Lane on 6 Mar 2007 00:29
"Luke Lonergan" <llonergan(a)greenplum.com> writes: > Here's the x86 assembler routine for Solaris: > http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 > /ml/copy.s > The actual uiomove routine is a simple wrapper that calls the assembler > kcopy or xcopyout routines. There are two versions (for Opteron), one that > uses the NTA instructions that bypass the L2 cache on writing to avoid L2 > cache pollution, and the second writes normally - through the L2 cache. Hm. If it were bypassing the L2 cache then the hit would be taken when PG later tries to read the data. This is clearly not what's happening in the Linux system, but I've not seen any oprofile-equivalent data for Solaris? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |