From: "Luke Lonergan" on
Hi Mark,

> lineitem has 1535724 pages (11997 MB)
>
> Shared Buffers Elapsed IO rate (from vmstat)
> -------------- ------- ---------------------
> 400MB 101 s 122 MB/s
>
> 2MB 100 s
> 1MB 97 s
> 768KB 93 s
> 512KB 86 s
> 256KB 77 s
> 128KB 74 s 166 MB/s
>
> I've added the observed IO rate for the two extreme cases
> (the rest can be pretty much deduced via interpolation).
>
> Note that the system will do about 220 MB/s with the now
> (in)famous dd test, so we have a bit of headroom (not too bad
> for a PIII).

What's really interesting: try this with a table that fits into I/O
cache (say half your system memory), and run VACUUM on the table. That
way the effect will stand out more dramatically.

- Luke


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Luke Lonergan" on


> > The Postgres shared buffer cache algorithm appears to have a bug.
> > When there is a sequential scan the blocks are filling the entire
> > shared buffer cache. This should be "fixed".
>
> No, this is not a bug; it is operating as designed. The
> point of the current bufmgr algorithm is to replace the page
> least recently used, and that's what it's doing.

At least we've established that for certain.

> If you want to lobby for changing the algorithm, then you
> need to explain why one test case on one platform justifies
> de-optimizing for a lot of other cases. In almost any
> concurrent-access situation I think that what you are
> suggesting would be a dead loss --- for instance we might as
> well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.

> In any case, I'm still not convinced that you've identified
> the problem correctly, because your explanation makes no
> sense to me. How can the processor's L2 cache improve access
> to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect. The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it. Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove->kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB). Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already. The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are "hot" as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

From: Gregory Stark on
"Luke Lonergan" <LLonergan(a)greenplum.com> writes:

> The evidence seems to clearly indicate reduced memory writing due to an
> L2 related effect.

You might try using valgrind's cachegrind tool which I understand can actually
emulate various processors' cache to show how efficiently code uses it. I
haven't done much with it though so I don't know how applicable it would be to
a large-scale effect like this.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

From: "Pavan Deolasee" on
Tom Lane wrote:
> Mark Kirkwood <markir(a)paradise.net.nz> writes:
>
>> Shared Buffers Elapsed IO rate (from vmstat)
>> -------------- ------- ---------------------
>> 400MB 101 s 122 MB/s
>> 2MB 100 s
>> 1MB 97 s
>> 768KB 93 s
>> 512KB 86 s
>> 256KB 77 s
>> 128KB 74 s 166 MB/s
>>
>
> So I'm back to asking what we're really measuring here. Buffer manager
> inefficiency of some sort, but what? Have you tried oprofile?
>
Isn't the size of the shared buffer pool itself acting as a performance
penalty in this case ? May be StrategyGetBuffer() needs to make multiple
passes over the buffers before the usage_count of any buffer is reduced
to zero and the buffer is chosen as replacement victim.

There is no real advantage of having larger shared buffer pool in this
particular test. A heap buffer is hardly accessed again once the seqscan
passes over it. Can we try with a smaller value for
BM_MAX_USAGE_COUNT and see if that has any positive impact
for large shared pool in this case ?

Thanks,
Pavan


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

From: "Luke Lonergan" on
Hi Tom,

On 3/5/07 8:53 AM, "Tom Lane" <tgl(a)sss.pgh.pa.us> wrote:

> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
> water. If it were a cache effect then there should be a performance
> cliff at the point where the cache size is exceeded. I see no such
> cliff, in fact the middle part of the curve is darn near a straight
> line on a log scale ...
>
> So I'm back to asking what we're really measuring here. Buffer manager
> inefficiency of some sort, but what? Have you tried oprofile?

How about looking at the CPU performance counters directly using cpustat:
cpustat -c BU_fill_into_L2,umask=0x1 1

This shows us how many L2 fills there are on all four cores (we use all
four). In the case without buffer cache pollution, below is the trace of L2
fills. In the pollution case we fill 27 million lines, in the pollution
case we fill 44 million lines.

VACUUM orders (no buffer pollution):
51.006 1 tick 2754293
51.006 2 tick 3159565
51.006 3 tick 2971929
51.007 0 tick 3577487
52.006 1 tick 4214179
52.006 3 tick 3650193
52.006 2 tick 3905828
52.007 0 tick 3465261
53.006 1 tick 1818766
53.006 3 tick 1546018
53.006 2 tick 1709385
53.007 0 tick 1483371

And here is the case with buffer pollution:
VACUUM orders (with buffer pollution)
22.006 0 tick 1576114
22.006 1 tick 1542604
22.006 2 tick 1987366
22.006 3 tick 1784567
23.006 3 tick 2706059
23.006 2 tick 2362048
23.006 0 tick 2190719
23.006 1 tick 2088827
24.006 0 tick 2247473
24.006 1 tick 2153850
24.006 2 tick 2422730
24.006 3 tick 2758795
25.006 0 tick 2419436
25.006 1 tick 2229602
25.006 2 tick 2619333
25.006 3 tick 2712332
26.006 1 tick 1827923
26.006 2 tick 1886556
26.006 3 tick 2909746
26.006 0 tick 1467164



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?