Prev: xlogViewer / xlogdump
Next: CVS corruption/mistagging?
From: Jeff Davis on 6 Mar 2007 21:28 On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote: > Jeff Davis wrote: > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > >>> Another approach I proposed back in December is to not have a > >>> variable like that at all, but scan the buffer cache for pages > >>> belonging to the table you're scanning to initialize the scan. > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy > >>> operation, but it might be ok given that we're talking about large > >>> I/O bound sequential scans. It would require no DBA tuning and > >>> would work more robustly in varying conditions. I'm not sure where > >>> you would continue after scanning the in-cache pages. At the > >>> highest in-cache block number, perhaps. > >> If there was some way to do that, it'd be what I'd vote for. > >> > > > > I still don't know how to make this take advantage of the OS buffer > > cache. > > Yep, I don't see any way to do that. I think we could live with that, > though. If we went with the sync_scan_offset approach, you'd have to > leave a lot of safety margin in that as well. > Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. Regards, Jeff Davis ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
From: "Luke Lonergan" on 6 Mar 2007 22:32 Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. - Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Sherry Moore [mailto:sherry.moore(a)sun.com] Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time To: Simon Riggs Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Simon, > and what you haven't said > > - all of this is orthogonal to the issue of buffer cache spoiling in > PostgreSQL itself. That issue does still exist as a non-OS issue, but > we've been discussing in detail the specific case of L2 cache effects > with specific kernel calls. All of the test results have been > stand-alone, so we've not done any measurements in that area. I say this > because you make the point that reducing the working set size of write > workloads has no effect on the L2 cache issue, but ISTM its still > potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
From: Hannu Krosing on 7 Mar 2007 02:22 Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: > On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote: > > Jeff Davis wrote: > > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > > >>> Another approach I proposed back in December is to not have a > > >>> variable like that at all, but scan the buffer cache for pages > > >>> belonging to the table you're scanning to initialize the scan. > > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy > > >>> operation, but it might be ok given that we're talking about large > > >>> I/O bound sequential scans. It would require no DBA tuning and > > >>> would work more robustly in varying conditions. I'm not sure where > > >>> you would continue after scanning the in-cache pages. At the > > >>> highest in-cache block number, perhaps. > > >> If there was some way to do that, it'd be what I'd vote for. > > >> > > > > > > I still don't know how to make this take advantage of the OS buffer > > > cache. Maybe it should not ? Mostly there can be use of OS cache only if it is much bigger than shared buffer cache. It may make sense to forget about OS cache and just tell those who can make use of sync scans to set most of memory aside for shared buffers. Then we can do better predictions/lookups of how much of a table is actually in memory. Dual caching is usually not very beneficial anyway, not to mention about difficulties in predicting any doual-caching effects. > > Yep, I don't see any way to do that. I think we could live with that, > > though. If we went with the sync_scan_offset approach, you'd have to > > leave a lot of safety margin in that as well. > > > > Right, there would certainly have to be a safety margin with > sync_scan_offset. However, your plan only works when the shared buffers > are dominated by this sequential scan. Let's say you have 40% of > physical memory for shared buffers, and say that 50% are being used for > hot pages in other parts of the database. That means you have access to > only 20% of physical memory to optimize for this sequential scan, and > 20% of the physical memory is basically unavailable (being used for > other parts of the database). The simplest thing in case table si much bigger than buffer cache usable for it is to start the second scan at the point the first scan is traversing *now*, and hope that the scans will stay together. Or start at some fixed lag, which makes the first scan to be always the one issuing reads and second just freerides on buffers already in cache. It may even be a good idea to throttle the second scan to stay N pages behind if the OS readahead gets confused when same file is read from multiple processes. If the table is smaller than the cache, then just scan it without syncing. Trying to read buffers in the same order starting from near the point where ppages are still in shared buffer cache seems good mostly for case where table is as big as or just a little larger than cache. > In my current implementation, you could set sync_scan_offset to 1.0 > (meaning 1.0 x shared_buffers), giving you 40% of physical memory that > would be used for starting this sequential scan. In this case, that > should be a good margin of error, considering that as much as 80% of the > physical memory might actually be in cache (OS or PG cache). > > This all needs to be backed up by testing, of course. I'm just > extrapolating some numbers that look vaguely reasonable to me. If there is an easy way to tell PG "give me this page only if it is in shared cache already", then a good approach might be to start 2nd scan at the point where 1st is now, and move in both directions simultabeously, like this: First scan is at page N. Second scan: M=N-1 WHILE NOT ALL PAGES ARE READ: IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER READ PAGE N N++ ELSE IF M>=0 AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES READ PAGE M M-- ELSE IF FIRST READER STILL GOING: -- NO OLDER PAGES, WAIT FOR 1st WAIT FOR PAGE N TO BECOME AVAILABLE READ PAGE N N++ ELSE: -- BECOME 1st reader READ PAGE N N++ PROCESS PAGE -- IF N > PAGES_IF_TABLE: N=0 IF M < 0: M=PAGES_IF_TABLE This should work reasonably well for LRU caches and it may be made to work with clock sweep scheme if the sweep arranges pages to purge in file order. If we could make the IF PAGE x IS IN CACHE part also know about OS cache this could also make use of os cache. Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
From: "Marko Kreen" on 7 Mar 2007 03:49 On 3/7/07, Hannu Krosing <hannu(a)skype.net> wrote: > Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix > systems ? Supposedly you could mmap() a file and then do mincore() on the area to see which pages are cached. But you were talking about postgres cache before, there it should be easily implementable. -- marko ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
From: Sherry Moore on 6 Mar 2007 22:05
Hi Simon, > and what you haven't said > > - all of this is orthogonal to the issue of buffer cache spoiling in > PostgreSQL itself. That issue does still exist as a non-OS issue, but > we've been discussing in detail the specific case of L2 cache effects > with specific kernel calls. All of the test results have been > stand-alone, so we've not done any measurements in that area. I say this > because you make the point that reducing the working set size of write > workloads has no effect on the L2 cache issue, but ISTM its still > potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |