max_standby_delay considered harmful [PgSql]

Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby

From: Andres Freund on 10 May 2010 08:11

On Monday 10 May 2010 14:00:45 Heikki Linnakangas wrote:
> Florian Pflug wrote:
> > On May 10, 2010, at 11:43 , Heikki Linnakangas wrote:
> >> If you're not going to apply any more WAL records before shutdown, you
> >> could also just release all the AccessExclusiveLocks held by the startup
> >> process. Whatever the transaction was doing with the locked relation, if
> >> we're not going to replay any more WAL records before shutdown, we will
> >> not see the transaction committing or doing anything else with the
> >> relation, so we should be safe. Whatever state the data on disk is in,
> >> it must be valid, or we would have a problem with crash recovery
> >> recovering up to this WAL record and then starting up too.
> >
> > Sounds plausible. But wouldn't this imply that HS could *always* postpone
> > the acquisition of an AccessExclusiveLocks until right before the
> > corresponding commit record is replayed? If fail to see a case where
> > this would fail, yet recovery in case of an intermediate crash would be
> > correct.
>
> I guess it could in some situations, but for example the
> AccessExclusiveLock taken at the end of lazy vacuum to truncate the
> relation must be held during the truncation, or concurrent readers will
> get upset.
Actually all the locks that do not need to be taken on the slave would not
need to be an ACCESS EXCLUSIVE but a EXCLUSIVE on the master, right? That
should be "fixed" on the master, not hacked up on the slave and is by far out
of scope of 9.0.
Thats an area where I definitely would like to improve pg in the future...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Bruce Momjian on 10 May 2010 10:57

Simon Riggs wrote:
> Bruce has used the word crippleware for the current state. Raising a
> problem and then blocking solutions is the best way I know to cripple a
> release. It should be clear that I've done my best to avoid this

FYI, it was Robert Haas who used the term "crippleware" to describe a
boolean value for max_standby_delay, and I was just repeating his term,
and disputing it would be crippleware.

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Bruce Momjian on 10 May 2010 11:07

Robert Haas wrote:
> Wultsch (who doesn't ever want to kill queries and therefore would be
> happy with a boolean), Yeb Havinga (who never wants to stall recovery
> and therefore would also be happy with a boolean), and Florian Pflug
> (who points out that pause/resume is actually a nontrivial feature).
> Apologies if I've left anyone out or misrepresented their position.
>
> Overall I would say opinion is about evenly split between:
>
> - leave it as-is
> - make it a Boolean
> - change it in some way but to something more expressive than a Boolean
>
> I can't presume to extract a consensus from that; I don't think there
> is one. You could say "the majority of people want to change
> something" and that would be true; you could also say "the majority of
> people don't want a Boolean" and that would also be true.

Yep, this is where we are. Discussion had stopped, so it seemed like
time for a decision, and with no one agreeing on what to do, feature
removal seemed like the best approach. Suggesting we will fix it later
in beta is not a solution.

Now, if everyone agrees we should do X, and X in simple, lets do X, but
I am stil not seeing that.

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Mike Rylander on 10 May 2010 11:22

On Mon, May 10, 2010 at 6:03 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Robert Haas wrote:
>> On Thu, May 6, 2010 at 2:47 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>>>> Now that I've realized what the real problem is with max_standby_delay
>>>> (namely, that inactivity on the master can use up the delay), I think
>>>> we should do what Tom originally suggested here. It's not as good as
>>>> a really working max_standby_delay, but we're not going to have that
>>>> for 9.0, and it's clearly better than a boolean.
>>> I guess I'm not clear on how what Tom proposed is fundamentally
>>> different from max_standby_delay = -1. If there's enough concurrent
>>> queries, recovery would never catch up.
>>
>> If your workload is that the standby server is getting pounded with
>> queries like crazy, then it's probably not that different: it will
>> fall progressively further behind. But I suspect many people will set
>> up standby servers where most of the activity happens on the primary,
>> but they run some reporting queries on the standby. If you expect
>> your reporting queries to finish in <10s, you could set the max delay
>> to say 60s. In the event that something gets wedged, recovery will
>> eventually kill it and move on rather than just getting stuck forever.
>> If the volume of queries is known not to be too high, it's reasonable
>> to expect that a few good whacks will be enough to get things back on
>> track.
>
> Yeah, I could live with that.
>
> A problem with using the name "max_standby_delay" for Tom's suggestion
> is that it sounds like a hard limit, which it isn't. But if we name it
> something like:
>
> # -1 = no timeout
> # 0 = kill conflicting queries immediately
> # > 0 wait for N seconds, then kill query
> standby_conflict_timeout = -1
>
> it's more clear that the setting is a timeout for each *conflict*, and
> it's less surprising that the standby can fall indefinitely behind in
> the worst case. If we name the setting along those lines, I could live
> with that.

+1 from the peanut gallery.

--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker(a)esilibrary.com
| web: http://www.esilibrary.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Stephen Frost on 10 May 2010 12:01

* Aidan Van Dyk (aidan(a)highrise.ca) wrote:
> * Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> [100510 06:03]:
> > A problem with using the name "max_standby_delay" for Tom's suggestion
> > is that it sounds like a hard limit, which it isn't. But if we name it
> > something like:
>
> I'ld still rather an "if your killing something, make sure you kill
> enough to get all the way current" behaviour, but that's just me....

I agree with that comment, and it's more like what max_standby_delay
was. That's what I had thought Tom was proposing initially,
since it makes a heck of alot more sense to me than "just keep
waiting, just keep waiting..".

Now, if it's possible to have things queue up behind the recovery
process, such that the recovery process will only wait up to
timeout * # of locks held when recovery started, that might be alright,
but that's not the impression I've gotten about how this will work.

Of course, I also want to be able to have a Nagios hook that checks how
far behind the slave has gotten, and a way to tell the slave "oook,
you're too far behind, just forcibly catch up right *now*". If I could
use reload to change max_standby_delay (or whatever) and I can figure
out how long the delay is (even if I have to update a table on the
master and then see what it says on the slave..), I'd be happy.

That being said, I do think it makes more sense to wait until we've got
a conflict to start the timer, and I rather like avoiding the
uncertainty of time sync between master and slave by using WAL arrival
time on the slave.

Thanks,

Stephen

First | Prev | Next | Last
Pages: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby