Prev: Hot Standby query cancellation and Streaming Replication integration
Next: ProcSignalSlot vs. PGPROC
From: Josh Berkus on 1 Mar 2010 14:21 > So I think the primary solution currently is to raise max_standby_age. > > However there is a concern with max_standby_age. If you set it to, > say, 300s. Then run a 300s query on the slave which causes the slave > to fall 299s behind. Now you start a new query on the slave -- it gets > a snapshot based on the point in time that the slave is currently at. > If it hits a conflict it will only have 1s to finish before the > conflict causes the query to be cancelled. Completely aside from that, how many users are going to be happy with a slave server which is constantly 5 minutes behind? --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Josh Berkus on 1 Mar 2010 15:04 On 3/1/10 11:43 AM, Tom Lane wrote: > Stefan Kaltenbrunner <stefan(a)kaltenbrunner.cc> writes: >> Greg Stark wrote: >>> For what it's worth Oracle has an option to have your standby >>> intentionally hold back n minutes behind and I've seen that set to 5 >>> minutes. > >> yeah a lot of people are doing that intentionally... > > It's the old DBA screwup safety valve ... drop the main accounts table, > you have five minutes to stop replication before it's dropped on the > standby. Speaking of which, does the current HS+SR code have a > provision to force the standby to stop tracking WAL and come up live, > even when there's more WAL available? Because that's what you'd need > in order for such a thing to be helpful in that scenario. the "fast" recovery option should do this. You'd need some fast reaction times, though. However, this leaves aside Greg's point about snapshot age and successive queries; does anyone dispute his analysis? Simon? --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Josh Berkus on 1 Mar 2010 17:32 On 2/28/10 7:12 PM, Robert Haas wrote: >> However, I'd still like to hear from someone with the requisite >> > technical knowledge whether capturing and retrying the current query in >> > a query cancel is even possible. > > I'm not sure who you want to hear from here, but I think that's a dead end. "dead end" as in "too hard to implement"? Or for some other reason? It's undeniable that auto-retry would be better from a user's perspective than a user-visible cancel. So if it's *reasonable* to implement, I think we should be working on it. I'm also very puzzled as to why nobody else wants to even discuss it; it's like some wierd blackout. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Greg Smith on 1 Mar 2010 20:34 Josh Berkus wrote: > However, this leaves aside Greg's point about snapshot age and > successive queries; does anyone dispute his analysis? Simon? > There's already a note on the Hot Standby TODO about unexpectly bad max_standby_delay behavior being possible on an idle system, with no suggested resolution for it besides better SR integration. The issue Greg Stark has noted is another variation on that theme. It's already on my list of theorized pathological but as yet undemonstrated concerns that Simon and I identified, the one I'm working through creating a test cases to prove/disprove. I'm past "it's possible..." talks at this point though as not to spook anyone unnecessarily, and am only raising things I can show concrete examples of in action. White box testing at some point does require pausing one's investigation of what's in the box and getting on with the actual testing instead. The only real spot where my opinion diverges here that I have yet to find any situation where 'max_standby_delay=-1' makes any sense to me. When I try running my test cases with that setting, the whole system just reacts far too strangely. My first patch here is probably going to be adding more visibility into the situation when queries are blocking replication forever, because I think the times I find myself at "why is the system hung right now?" are when that happens and it's not obvious as an admin what's going on. Also, the idea that a long running query on the standby could cause an unbounded delay in replication is so foreign to my sensibilities that I don't ever include it in the list of useful solutions to the problems I'm worried about. The option is there, not disputing that it makes sense for some people because there seems some demand for it, just can't see how it fits into any of the use-cases I'm concerned about. I haven't said anything about query retry mainly because I can't imagine any way it's possible to build it in time for this release, so whether it's eventually feasible or not doesn't enter into what I'm worried about right now. In any case, I would prioritize that behind work on preventing the most common situations that cause cancellations in the first place, until those are handled so well that retry is the most effective improvement left to consider. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg(a)2ndQuadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Greg Smith on 2 Mar 2010 00:50
Bruce Momjian wrote: > Joachim Wieland wrote: > >> 1) With the current implementation they will see better performance on >> the master and more aggressive vacuum (!), since they have less >> long-running queries now on the master and autovacuum can kick in and >> clean up with less delay than before. On the other hand their queries >> on the standby might fail and they will start thinking that this HS+SR >> feature is not as convincing as they thought it was... >> > > I assumed they would set max_standby_delay = -1 and be happy. > The admin in this situation might be happy until the first time the primary fails and a failover is forced, at which point there is an unbounded amount of recovery data to apply that was stuck waiting behind whatever long-running queries were active. I don't know if you've ever watched what happens to a pre-8.2 cold standby when you start it up with hundreds or thousands of backed up WAL files to process before the server can start, but it's not a fast process. I watched a production 8.1 standby get >4000 files behind once due to an archive_command bug, and it's not something I'd like to ever chew my nails off to again. If your goal was HA and you're trying to bring up the standby, the server is down the whole time that's going on. This is why no admin who prioritizes HA would consider 'max_standby_delay = -1' a reasonable setting, and those are the sort of users Joachim's example was discussing. Only takes one rogue query that runs for a long time to make the standby so far behind it's useless for HA purposes. And you also have to ask yourself "if recovery is halted while waiting for this query to run, how stale is the data on the standby getting?". That's true for any large setting for this parameter, but using -1 for the unlimited setting also gives the maximum possible potential for such staleness. 'max_standby_delay = -1' is really only a reasonable idea if you are absolutely certain all queries are going to be short, which we can't dismiss as an unfounded use case so it has value. I would expect you have to also combine it with a matching reasonable statement_timeout to enforce that expectation to make that situation safer. In any of the "offload batch queries to the failover standby" situations, it's unlikely an unlimited value for this setting will be practical. Perhaps you set max_standby_delay to some number of hours, to match your expected worst-case query run time and reduce the chance of cancellation. Not putting a limit on it at all is a situation no DBA with healthy paranoia is going to be happy with the potential downside of in a HA environment, given that both unbounded staleness and recovery time are then both possible. The potential of a failed long-running query is much less risky than either of those. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg(a)2ndQuadrant.com www.2ndQuadrant.us |