From: Greg Smith on
Robert Haas wrote:
> Well, why can't they just hang out as dirty buffers in the OS cache,
> which is also designed to solve this problem?
>

If the OS were guaranteed to be as suitable for this purpose as the
approach taken in the database, this might work. But much like the
clock sweep approach should outperform a simpler OS caching
implementation in many common workloads, there are a couple of spots
where making dirty writes the OS's problem can fall down:

1) That presumes that OS write coalescing will solve the problem for you
by merging repeat writes, which depending on implementation it might not.

2) On some filesystems, such as ext3, any write with an fsync behind it
will flush the whole write cache out and defeat this optimization.
Since the spread checkpoint design has some such writes going to the
data disk in the middle of the currently processing checkpoing, in those
situations that's likely to push the first write of that block to disk
before it can be combined with a second. If you'd have kept it in the
buffer cache it might survive as long as a full checkpoint cycle longer..

3) The "timeout" as it were for shared buffers is driven by the distance
between checkpoints, typically as long as 5 minutes. The longest a
filesystem will hold onto a write is probably less. On Linux it's
typically 30 seconds before the OS considers a write important to get
out to disk, longest case; if you've already filled a lot of RAM with
writes it can be substantially less.

> I guess the obvious question is whether Windows "doesn't need" more
> shared memory than that, or whether it "can't effectively use" more
> memory than that.
>

It's probably can't effectively use. We know for a fact that
applications where blocks regularly accumulate high usage counts and
have repeat read/writes to them, which includes pgbench, benefit in
several easy to measure ways from using larger amounts of database
buffer cache. There's just plain old less churn of buffers going in and
out of there. The alternate explanation of "Windows is just so much
better at read/write caching that you should give it most of the RAM
anyway" doesn't really sound as probable as the more commonly proposed
theory "Windows doesn't handle large blocks of shared memory well".

Note that there's no discussion of the why behind this is in the commit
you just did, just the description of what happens. The reasons why are
left undefined, which I feel is appropriate given we really don't know
for sure. Still waiting for somebody to let loose the Visual Studio
profiler and measure what's causing the degradation at larger sizes.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Fri, Apr 16, 2010 at 9:47 PM, Greg Smith <greg(a)2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Well, why can't they just hang out as dirty buffers in the OS cache,
>> which is also designed to solve this problem?
>
> If the OS were guaranteed to be as suitable for this purpose as the approach
> taken in the database, this might work.  But much like the clock sweep
> approach should outperform a simpler OS caching implementation in many
> common workloads, there are a couple of spots where making dirty writes the
> OS's problem can fall down:
>
> 1) That presumes that OS write coalescing will solve the problem for you by
> merging repeat writes, which depending on implementation it might not.
>
> 2) On some filesystems, such as ext3, any write with an fsync behind it will
> flush the whole write cache out and defeat this optimization.  Since the
> spread checkpoint design has some such writes going to the data disk in the
> middle of the currently processing checkpoing, in those situations that's
> likely to push the first write of that block to disk before it can be
> combined with a second.  If you'd have kept it in the buffer cache it might
> survive as long as a full checkpoint cycle longer..
>
> 3) The "timeout" as it were for shared buffers is driven by the distance
> between checkpoints, typically as long as 5 minutes.  The longest a
> filesystem will hold onto a write is probably less.  On Linux it's typically
> 30 seconds before the OS considers a write important to get out to disk,
> longest case; if you've already filled a lot of RAM with writes it can be
> substantially less.

Thanks for the explanation. That makes sense. Does this imply that
the problems with shared_buffers being too small are going to be less
with a read-mostly load?

>> I guess the obvious question is whether Windows "doesn't need" more
>> shared memory than that, or whether it "can't effectively use" more
>> memory than that.
>
> It's probably can't effectively use.  We know for a fact that applications
> where blocks regularly accumulate high usage counts and have repeat
> read/writes to them, which includes pgbench, benefit in several easy to
> measure ways from using larger amounts of database buffer cache.  There's
> just plain old less churn of buffers going in and out of there.  The
> alternate explanation of "Windows is just so much better at read/write
> caching that you should give it most of the RAM anyway" doesn't really sound
> as probable as the more commonly proposed theory "Windows doesn't handle
> large blocks of shared memory well".
>
> Note that there's no discussion of the why behind this is in the commit you
> just did, just the description of what happens.  The reasons why are left
> undefined, which I feel is appropriate given we really don't know for sure.
>  Still waiting for somebody to let loose the Visual Studio profiler and
> measure what's causing the degradation at larger sizes.

Right - my purpose in wanting to revise the documentation was not to
give a complete tutorial, which is obviously not practical, but to
give people some guidelines that are better than our previous
suggestion to use "a few tens of megabytes", which I think we've
accomplished. The follow-up questions are mostly for my own benefit
rather than the docs...

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jim Nasby on
On Apr 16, 2010, at 4:56 PM, Robert Haas wrote:
> From reading this and other threads, I think I generally understand
> that the perils of setting shared_buffers too high: memory is needed
> for other things, like work_mem, a problem which is exacerbated by the
> fact that there is some double buffering going on. Also, if the
> buffer cache gets too large, checkpoints can involve writing out
> enormous amounts of dirty data, which can be bad.

I've also seen large shared buffer settings perform poorly outside of IO issues, presumably due to some kind of internal lock contention. I tried running 8.3 with 24G for a while, but dropped it back down to our default of 8G after noticing some performance problems. Unfortunately I don't remember the exact details, let alone having a repeatable test case.
--
Jim C. Nasby, Database Architect jim(a)nasby.net
512.569.9461 (cell) http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on
Robert Haas <robertmhaas(a)gmail.com> wrote:

> 2. Reading the section on checkpoint_segments reminds me, again,
> that the current value seems extremely conservative on modern
> hardware. It's quite easy to hit this when doing large bulk data
> loads or even a big ol' CTAS. I think we should consider raising
> this for 9.1.

Perhaps, but be aware the current default benchmarked better
than a larger setting in bulk loads.

http://archives.postgresql.org/pgsql-hackers/2009-06/msg01382.php

The apparent reason is that when there were fewer of them the WAL
files were re-used before the RAID controller flushed them from BBU
cache, causing an overall reduction in disk writes. I have little
doubt that *without* a good BBU cached controller a larger setting
is better, but it's not universally true that bigger is better on
this one.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Mon, Apr 19, 2010 at 10:21 AM, Kevin Grittner
<Kevin.Grittner(a)wicourts.gov> wrote:
> Robert Haas <robertmhaas(a)gmail.com> wrote:
>
>> 2. Reading the section on checkpoint_segments reminds me, again,
>> that the current value seems extremely conservative on modern
>> hardware.  It's quite easy to hit this when doing large bulk data
>> loads or even a big ol' CTAS.  I think we should consider raising
>> this for 9.1.
>
> Perhaps, but be aware the current default benchmarked better
> than a larger setting in bulk loads.
>
> http://archives.postgresql.org/pgsql-hackers/2009-06/msg01382.php
>
> The apparent reason is that when there were fewer of them the WAL
> files were re-used before the RAID controller flushed them from BBU
> cache, causing an overall reduction in disk writes.  I have little
> doubt that *without* a good BBU cached controller a larger setting
> is better, but it's not universally true that bigger is better on
> this one.

I don't actually know what's best. I'm just concerned that we have a
default in postgresql.conf and a tuning guide that says "don't do
that". Maybe the tuning guide needs to be more nuanced, or maybe
postgresql.conf needs to be changed, but it makes no sense to have
them saying contradictory things.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers