Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: "Luke Lonergan" on 5 Mar 2007 05:04 Hi Mark, > lineitem has 1535724 pages (11997 MB) > > Shared Buffers Elapsed IO rate (from vmstat) > -------------- ------- --------------------- > 400MB 101 s 122 MB/s > > 2MB 100 s > 1MB 97 s > 768KB 93 s > 512KB 86 s > 256KB 77 s > 128KB 74 s 166 MB/s > > I've added the observed IO rate for the two extreme cases > (the rest can be pretty much deduced via interpolation). > > Note that the system will do about 220 MB/s with the now > (in)famous dd test, so we have a bit of headroom (not too bad > for a PIII). What's really interesting: try this with a table that fits into I/O cache (say half your system memory), and run VACUUM on the table. That way the effect will stand out more dramatically. - Luke ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: "Luke Lonergan" on 5 Mar 2007 04:58 > > The Postgres shared buffer cache algorithm appears to have a bug. > > When there is a sequential scan the blocks are filling the entire > > shared buffer cache. This should be "fixed". > > No, this is not a bug; it is operating as designed. The > point of the current bufmgr algorithm is to replace the page > least recently used, and that's what it's doing. At least we've established that for certain. > If you want to lobby for changing the algorithm, then you > need to explain why one test case on one platform justifies > de-optimizing for a lot of other cases. In almost any > concurrent-access situation I think that what you are > suggesting would be a dead loss --- for instance we might as > well forget about Jeff Davis' synchronized-scan work. Instead of forgetting about it, we'd need to change it. > In any case, I'm still not convinced that you've identified > the problem correctly, because your explanation makes no > sense to me. How can the processor's L2 cache improve access > to data that it hasn't got yet? The evidence seems to clearly indicate reduced memory writing due to an L2 related effect. The actual data shows a dramatic reduction in main memory writing when the destination of the written data fits in the L2 cache. I'll try to fit a hypothesis to explain it. Assume you've got a warm IO cache in the OS. The heapscan algorithm now works like this: 0) select a destination user buffer 1) uiomove->kcopy memory from the IO cache to the user buffer 1A) step 1: read from kernel space 1B) step 2: write to user space 2) the user buffer is accessed many times by the executor nodes above Repeat There are two situations we are evaluating: one where the addresses of the user buffer are scattered over a space larger than the size of L2 (caseA) and one where they are confined to the size of L2 (caseB). Note that we could also consider another situation where the addresses are scattered over a space smaller than the TLB entries mapped by the L2 cache (512 max) and larger than the size of L2, but we've tried that and it proved uninteresting. For both cases step 1A is the same: each block (8KB) write from (1) will read from IO cache into 128 L2 (64B each) lines, evicting the previous data there. In step 1B for caseA the destination for the writes is mostly an address not currently mapped into L2 cache, so 128 victim L2 lines are found (LRU), stored into, and writes are flushed to main memory. In step 1B for caseB, the destination for the writes is located in L2 already. The 128 L2 lines are stored into, and the write to main memory is delayed under the assumption that these lines are "hot" as they were already in L2. I don't know enough to be sure this is the right answer, but it does fit the experimental data. - Luke ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org
From: Gregory Stark on 5 Mar 2007 05:10 "Luke Lonergan" <LLonergan(a)greenplum.com> writes: > The evidence seems to clearly indicate reduced memory writing due to an > L2 related effect. You might try using valgrind's cachegrind tool which I understand can actually emulate various processors' cache to show how efficiently code uses it. I haven't done much with it though so I don't know how applicable it would be to a large-scale effect like this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
From: "Pavan Deolasee" on 5 Mar 2007 12:22 Tom Lane wrote: > Mark Kirkwood <markir(a)paradise.net.nz> writes: > >> Shared Buffers Elapsed IO rate (from vmstat) >> -------------- ------- --------------------- >> 400MB 101 s 122 MB/s >> 2MB 100 s >> 1MB 97 s >> 768KB 93 s >> 512KB 86 s >> 256KB 77 s >> 128KB 74 s 166 MB/s >> > > So I'm back to asking what we're really measuring here. Buffer manager > inefficiency of some sort, but what? Have you tried oprofile? > Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as replacement victim. There is no real advantage of having larger shared buffer pool in this particular test. A heap buffer is hardly accessed again once the seqscan passes over it. Can we try with a smaller value for BM_MAX_USAGE_COUNT and see if that has any positive impact for large shared pool in this case ? Thanks, Pavan ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
From: "Luke Lonergan" on 5 Mar 2007 12:22
Hi Tom, On 3/5/07 8:53 AM, "Tom Lane" <tgl(a)sss.pgh.pa.us> wrote: > Hm, that seems to blow the "it's an L2 cache effect" theory out of the > water. If it were a cache effect then there should be a performance > cliff at the point where the cache size is exceeded. I see no such > cliff, in fact the middle part of the curve is darn near a straight > line on a log scale ... > > So I'm back to asking what we're really measuring here. Buffer manager > inefficiency of some sort, but what? Have you tried oprofile? How about looking at the CPU performance counters directly using cpustat: cpustat -c BU_fill_into_L2,umask=0x1 1 This shows us how many L2 fills there are on all four cores (we use all four). In the case without buffer cache pollution, below is the trace of L2 fills. In the pollution case we fill 27 million lines, in the pollution case we fill 44 million lines. VACUUM orders (no buffer pollution): 51.006 1 tick 2754293 51.006 2 tick 3159565 51.006 3 tick 2971929 51.007 0 tick 3577487 52.006 1 tick 4214179 52.006 3 tick 3650193 52.006 2 tick 3905828 52.007 0 tick 3465261 53.006 1 tick 1818766 53.006 3 tick 1546018 53.006 2 tick 1709385 53.007 0 tick 1483371 And here is the case with buffer pollution: VACUUM orders (with buffer pollution) 22.006 0 tick 1576114 22.006 1 tick 1542604 22.006 2 tick 1987366 22.006 3 tick 1784567 23.006 3 tick 2706059 23.006 2 tick 2362048 23.006 0 tick 2190719 23.006 1 tick 2088827 24.006 0 tick 2247473 24.006 1 tick 2153850 24.006 2 tick 2422730 24.006 3 tick 2758795 25.006 0 tick 2419436 25.006 1 tick 2229602 25.006 2 tick 2619333 25.006 3 tick 2712332 26.006 1 tick 1827923 26.006 2 tick 1886556 26.006 3 tick 2909746 26.006 0 tick 1467164 ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |