From: Josh Berkus on

> So I think the primary solution currently is to raise max_standby_age.
>
> However there is a concern with max_standby_age. If you set it to,
> say, 300s. Then run a 300s query on the slave which causes the slave
> to fall 299s behind. Now you start a new query on the slave -- it gets
> a snapshot based on the point in time that the slave is currently at.
> If it hits a conflict it will only have 1s to finish before the
> conflict causes the query to be cancelled.

Completely aside from that, how many users are going to be happy with a
slave server which is constantly 5 minutes behind?

--Josh Berkus

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on
On 3/1/10 11:43 AM, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan(a)kaltenbrunner.cc> writes:
>> Greg Stark wrote:
>>> For what it's worth Oracle has an option to have your standby
>>> intentionally hold back n minutes behind and I've seen that set to 5
>>> minutes.
>
>> yeah a lot of people are doing that intentionally...
>
> It's the old DBA screwup safety valve ... drop the main accounts table,
> you have five minutes to stop replication before it's dropped on the
> standby. Speaking of which, does the current HS+SR code have a
> provision to force the standby to stop tracking WAL and come up live,
> even when there's more WAL available? Because that's what you'd need
> in order for such a thing to be helpful in that scenario.

the "fast" recovery option should do this. You'd need some fast
reaction times, though.

However, this leaves aside Greg's point about snapshot age and
successive queries; does anyone dispute his analysis? Simon?

--Josh Berkus

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on
On 2/28/10 7:12 PM, Robert Haas wrote:
>> However, I'd still like to hear from someone with the requisite
>> > technical knowledge whether capturing and retrying the current query in
>> > a query cancel is even possible.
>
> I'm not sure who you want to hear from here, but I think that's a dead end.

"dead end" as in "too hard to implement"? Or for some other reason?

It's undeniable that auto-retry would be better from a user's
perspective than a user-visible cancel. So if it's *reasonable* to
implement, I think we should be working on it. I'm also very puzzled as
to why nobody else wants to even discuss it; it's like some wierd blackout.

--Josh Berkus

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Josh Berkus wrote:
> However, this leaves aside Greg's point about snapshot age and
> successive queries; does anyone dispute his analysis? Simon?
>

There's already a note on the Hot Standby TODO about unexpectly bad
max_standby_delay behavior being possible on an idle system, with no
suggested resolution for it besides better SR integration. The issue
Greg Stark has noted is another variation on that theme. It's already
on my list of theorized pathological but as yet undemonstrated concerns
that Simon and I identified, the one I'm working through creating a test
cases to prove/disprove. I'm past "it's possible..." talks at this
point though as not to spook anyone unnecessarily, and am only raising
things I can show concrete examples of in action. White box testing at
some point does require pausing one's investigation of what's in the box
and getting on with the actual testing instead.

The only real spot where my opinion diverges here that I have yet to
find any situation where 'max_standby_delay=-1' makes any sense to me.
When I try running my test cases with that setting, the whole system
just reacts far too strangely. My first patch here is probably going to
be adding more visibility into the situation when queries are blocking
replication forever, because I think the times I find myself at "why is
the system hung right now?" are when that happens and it's not obvious
as an admin what's going on.

Also, the idea that a long running query on the standby could cause an
unbounded delay in replication is so foreign to my sensibilities that I
don't ever include it in the list of useful solutions to the problems
I'm worried about. The option is there, not disputing that it makes
sense for some people because there seems some demand for it, just can't
see how it fits into any of the use-cases I'm concerned about.

I haven't said anything about query retry mainly because I can't imagine
any way it's possible to build it in time for this release, so whether
it's eventually feasible or not doesn't enter into what I'm worried
about right now. In any case, I would prioritize that behind work on
preventing the most common situations that cause cancellations in the
first place, until those are handled so well that retry is the most
effective improvement left to consider.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Bruce Momjian wrote:
> Joachim Wieland wrote:
>
>> 1) With the current implementation they will see better performance on
>> the master and more aggressive vacuum (!), since they have less
>> long-running queries now on the master and autovacuum can kick in and
>> clean up with less delay than before. On the other hand their queries
>> on the standby might fail and they will start thinking that this HS+SR
>> feature is not as convincing as they thought it was...
>>
>
> I assumed they would set max_standby_delay = -1 and be happy.
>

The admin in this situation might be happy until the first time the
primary fails and a failover is forced, at which point there is an
unbounded amount of recovery data to apply that was stuck waiting behind
whatever long-running queries were active. I don't know if you've ever
watched what happens to a pre-8.2 cold standby when you start it up with
hundreds or thousands of backed up WAL files to process before the
server can start, but it's not a fast process. I watched a production
8.1 standby get >4000 files behind once due to an archive_command bug,
and it's not something I'd like to ever chew my nails off to again. If
your goal was HA and you're trying to bring up the standby, the server
is down the whole time that's going on.

This is why no admin who prioritizes HA would consider
'max_standby_delay = -1' a reasonable setting, and those are the sort of
users Joachim's example was discussing. Only takes one rogue query that
runs for a long time to make the standby so far behind it's useless for
HA purposes. And you also have to ask yourself "if recovery is halted
while waiting for this query to run, how stale is the data on the
standby getting?". That's true for any large setting for this
parameter, but using -1 for the unlimited setting also gives the maximum
possible potential for such staleness.

'max_standby_delay = -1' is really only a reasonable idea if you are
absolutely certain all queries are going to be short, which we can't
dismiss as an unfounded use case so it has value. I would expect you
have to also combine it with a matching reasonable statement_timeout to
enforce that expectation to make that situation safer.

In any of the "offload batch queries to the failover standby"
situations, it's unlikely an unlimited value for this setting will be
practical. Perhaps you set max_standby_delay to some number of hours,
to match your expected worst-case query run time and reduce the chance
of cancellation. Not putting a limit on it at all is a situation no DBA
with healthy paranoia is going to be happy with the potential downside
of in a HA environment, given that both unbounded staleness and recovery
time are then both possible. The potential of a failed long-running
query is much less risky than either of those.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us