Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: "Simon Riggs" on 6 Mar 2007 17:27 On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote: > - Based on a lot of the benchmarks and workloads I traced, the > target buffer of read operations are typically accessed again > shortly after the read, while writes are usually not. Therefore, > the default operation mode is to bypass L2 for writes, but not > for reads. Hi Sherry, I'm trying to relate what you've said to how we should proceed from here. My understanding of what you've said is: - Tom's assessment that the observed performance quirk could be fixed in the OS kernel is correct and you have the numbers to prove it - currently Solaris only does NTA for 128K reads, which we don't currently do. If we were to request 16 blocks at time, we would get this benefit on Solaris, at least. The copyout_max_cached parameter can be patched, but isn't a normal system tunable. - other workloads you've traced *do* reuse the same buffer again very soon afterwards when reading sequentially (not writes). Reducing the working set size is an effective technique in improving performance if we don't have a kernel that does NTA or we don't read in big enough chunks (we need both to get NTA to kick in). and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
From: Jeff Davis on 6 Mar 2007 19:28 On Tue, 2007-03-06 at 18:47 +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > > Jeff Davis <pgsql(a)j-davis.com> writes: > >> If I were to implement this idea, I think Heikki's bitmap of pages > >> already read is the way to go. > > > > I think that's a good way to guarantee that you'll not finish in time > > for 8.3. Heikki's idea is just at the handwaving stage at this point, > > and I'm not even convinced that it will offer any win. (Pages in > > cache will be picked up by a seqscan already.) > > The scenario that I'm worried about is that you have a table that's > slightly larger than RAM. If you issue many seqscans on that table, one > at a time, every seqscan will have to read the whole table from disk, > even though say 90% of it is in cache when the scan starts. > If you're issuing sequential scans one at a time, that 90% of the table that was cached is probably not cached any more, unless the scans are close together in time without overlapping (serial sequential scans). And the problem you describe is no worse than current behavior, where you have exactly the same problem. > This can be alleviated by using a large enough sync_scan_offset, but a > single setting like that is tricky to tune, especially if your workload > is not completely constant. Tune it too low, and you don't get much > benefit, tune it too high and your scans diverge and you lose all benefit. > I see why you don't want to manually tune this setting, however it's really not that tricky. You can be quite conservative and still use a good fraction of your physical memory. I will come up with some numbers and see how much we have to gain. Regards, Jeff Davis ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: Jim Nasby on 6 Mar 2007 19:38 On Mar 6, 2007, at 12:17 AM, Tom Lane wrote: > Jim Nasby <decibel(a)decibel.org> writes: >> An idea I've been thinking about would be to have the bgwriter or >> some other background process actually try and keep the free list >> populated, > > The bgwriter already tries to keep pages "just in front" of the clock > sweep pointer clean. True, but that still means that each backend has to run the clock- sweep. AFAICT that's something that backends will serialize on (due to BufFreelistLock), so it would be best to make StrategyGetBuffer as fast as possible. It certainly seems like grabbing a buffer off the free list is going to be a lot faster than running the clock sweep. That's why I think it'd be better to have the bgwriter run the clock sweep and put enough buffers on the free list to try and keep up with demand. -- Jim Nasby jim(a)nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
From: Jim Nasby on 6 Mar 2007 19:43 On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: >> We also don't need an exact count, either. Perhaps there's some way >> we could keep a counter or something... > > Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. -- Jim Nasby jim(a)nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: Jeff Davis on 6 Mar 2007 21:10
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote: > On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: > >> We also don't need an exact count, either. Perhaps there's some way > >> we could keep a counter or something... > > > > Exact count of what? The pages already in cache? > > Yes. The idea being if you see there's 10k pages in cache, you can > likely start 9k pages behind the current scan point and still pick > everything up. > > But this is nowhere near as useful as the bitmap idea, so I'd only > look at it if it's impossible to make the bitmaps work. And like > others have said, that should wait until there's at least a first- > generation patch that's going to make it into 8.3. You still haven't told me how we take advantage of the OS buffer cache with the bitmap idea. What makes you think that my current implementation is "nowhere near as useful as the bitmap idea"? My current implementation is making use of OS buffers + shared memory; the bitmap idea can only make use of shared memory, and is likely throwing the OS buffers away completely. I also suspect that the bitmap idea relies too much on the idea that there's a contiguous cache trail in the shared buffers alone. Any devation from that -- which could be caused by PG's page replacement algorithm, especially in combination with a varied load pattern -- would negate any benefit from the bitmap idea. I feel much more confident that there will exist a trail of pages that are cached in *either* the PG shared buffers *or* the OS buffer cache. There may be holes/gaps in either one, but it's much more likely that they combine into a contiguous series of cached pages. Do you have an idea how I might test this claim? Regards, Jeff Davis ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |