Prev: [HACKERS] standbycheck was:(Re: [HACKERS] testing hot standby
Next: [BUGS] BUG #5412: test case produced, possible race condition.
From: Greg Smith on 16 Apr 2010 21:47 Robert Haas wrote: > Well, why can't they just hang out as dirty buffers in the OS cache, > which is also designed to solve this problem? > If the OS were guaranteed to be as suitable for this purpose as the approach taken in the database, this might work. But much like the clock sweep approach should outperform a simpler OS caching implementation in many common workloads, there are a couple of spots where making dirty writes the OS's problem can fall down: 1) That presumes that OS write coalescing will solve the problem for you by merging repeat writes, which depending on implementation it might not. 2) On some filesystems, such as ext3, any write with an fsync behind it will flush the whole write cache out and defeat this optimization. Since the spread checkpoint design has some such writes going to the data disk in the middle of the currently processing checkpoing, in those situations that's likely to push the first write of that block to disk before it can be combined with a second. If you'd have kept it in the buffer cache it might survive as long as a full checkpoint cycle longer.. 3) The "timeout" as it were for shared buffers is driven by the distance between checkpoints, typically as long as 5 minutes. The longest a filesystem will hold onto a write is probably less. On Linux it's typically 30 seconds before the OS considers a write important to get out to disk, longest case; if you've already filled a lot of RAM with writes it can be substantially less. > I guess the obvious question is whether Windows "doesn't need" more > shared memory than that, or whether it "can't effectively use" more > memory than that. > It's probably can't effectively use. We know for a fact that applications where blocks regularly accumulate high usage counts and have repeat read/writes to them, which includes pgbench, benefit in several easy to measure ways from using larger amounts of database buffer cache. There's just plain old less churn of buffers going in and out of there. The alternate explanation of "Windows is just so much better at read/write caching that you should give it most of the RAM anyway" doesn't really sound as probable as the more commonly proposed theory "Windows doesn't handle large blocks of shared memory well". Note that there's no discussion of the why behind this is in the commit you just did, just the description of what happens. The reasons why are left undefined, which I feel is appropriate given we really don't know for sure. Still waiting for somebody to let loose the Visual Studio profiler and measure what's causing the degradation at larger sizes. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg(a)2ndQuadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 16 Apr 2010 22:08 On Fri, Apr 16, 2010 at 9:47 PM, Greg Smith <greg(a)2ndquadrant.com> wrote: > Robert Haas wrote: >> Well, why can't they just hang out as dirty buffers in the OS cache, >> which is also designed to solve this problem? > > If the OS were guaranteed to be as suitable for this purpose as the approach > taken in the database, this might work. But much like the clock sweep > approach should outperform a simpler OS caching implementation in many > common workloads, there are a couple of spots where making dirty writes the > OS's problem can fall down: > > 1) That presumes that OS write coalescing will solve the problem for you by > merging repeat writes, which depending on implementation it might not. > > 2) On some filesystems, such as ext3, any write with an fsync behind it will > flush the whole write cache out and defeat this optimization. Since the > spread checkpoint design has some such writes going to the data disk in the > middle of the currently processing checkpoing, in those situations that's > likely to push the first write of that block to disk before it can be > combined with a second. If you'd have kept it in the buffer cache it might > survive as long as a full checkpoint cycle longer.. > > 3) The "timeout" as it were for shared buffers is driven by the distance > between checkpoints, typically as long as 5 minutes. The longest a > filesystem will hold onto a write is probably less. On Linux it's typically > 30 seconds before the OS considers a write important to get out to disk, > longest case; if you've already filled a lot of RAM with writes it can be > substantially less. Thanks for the explanation. That makes sense. Does this imply that the problems with shared_buffers being too small are going to be less with a read-mostly load? >> I guess the obvious question is whether Windows "doesn't need" more >> shared memory than that, or whether it "can't effectively use" more >> memory than that. > > It's probably can't effectively use. We know for a fact that applications > where blocks regularly accumulate high usage counts and have repeat > read/writes to them, which includes pgbench, benefit in several easy to > measure ways from using larger amounts of database buffer cache. There's > just plain old less churn of buffers going in and out of there. The > alternate explanation of "Windows is just so much better at read/write > caching that you should give it most of the RAM anyway" doesn't really sound > as probable as the more commonly proposed theory "Windows doesn't handle > large blocks of shared memory well". > > Note that there's no discussion of the why behind this is in the commit you > just did, just the description of what happens. The reasons why are left > undefined, which I feel is appropriate given we really don't know for sure. > Still waiting for somebody to let loose the Visual Studio profiler and > measure what's causing the degradation at larger sizes. Right - my purpose in wanting to revise the documentation was not to give a complete tutorial, which is obviously not practical, but to give people some guidelines that are better than our previous suggestion to use "a few tens of megabytes", which I think we've accomplished. The follow-up questions are mostly for my own benefit rather than the docs... ....Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Jim Nasby on 20 Apr 2010 13:07 On Apr 16, 2010, at 4:56 PM, Robert Haas wrote: > From reading this and other threads, I think I generally understand > that the perils of setting shared_buffers too high: memory is needed > for other things, like work_mem, a problem which is exacerbated by the > fact that there is some double buffering going on. Also, if the > buffer cache gets too large, checkpoints can involve writing out > enormous amounts of dirty data, which can be bad. I've also seen large shared buffer settings perform poorly outside of IO issues, presumably due to some kind of internal lock contention. I tried running 8.3 with 24G for a while, but dropped it back down to our default of 8G after noticing some performance problems. Unfortunately I don't remember the exact details, let alone having a repeatable test case. -- Jim C. Nasby, Database Architect jim(a)nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: "Kevin Grittner" on 19 Apr 2010 10:21 Robert Haas <robertmhaas(a)gmail.com> wrote: > 2. Reading the section on checkpoint_segments reminds me, again, > that the current value seems extremely conservative on modern > hardware. It's quite easy to hit this when doing large bulk data > loads or even a big ol' CTAS. I think we should consider raising > this for 9.1. Perhaps, but be aware the current default benchmarked better than a larger setting in bulk loads. http://archives.postgresql.org/pgsql-hackers/2009-06/msg01382.php The apparent reason is that when there were fewer of them the WAL files were re-used before the RAID controller flushed them from BBU cache, causing an overall reduction in disk writes. I have little doubt that *without* a good BBU cached controller a larger setting is better, but it's not universally true that bigger is better on this one. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 19 Apr 2010 10:26
On Mon, Apr 19, 2010 at 10:21 AM, Kevin Grittner <Kevin.Grittner(a)wicourts.gov> wrote: > Robert Haas <robertmhaas(a)gmail.com> wrote: > >> 2. Reading the section on checkpoint_segments reminds me, again, >> that the current value seems extremely conservative on modern >> hardware. It's quite easy to hit this when doing large bulk data >> loads or even a big ol' CTAS. I think we should consider raising >> this for 9.1. > > Perhaps, but be aware the current default benchmarked better > than a larger setting in bulk loads. > > http://archives.postgresql.org/pgsql-hackers/2009-06/msg01382.php > > The apparent reason is that when there were fewer of them the WAL > files were re-used before the RAID controller flushed them from BBU > cache, causing an overall reduction in disk writes. I have little > doubt that *without* a good BBU cached controller a larger setting > is better, but it's not universally true that bigger is better on > this one. I don't actually know what's best. I'm just concerned that we have a default in postgresql.conf and a tuning guide that says "don't do that". Maybe the tuning guide needs to be more nuanced, or maybe postgresql.conf needs to be changed, but it makes no sense to have them saying contradictory things. ....Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |