max_standby_delay considered harmful [PgSql]

Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby

From: Josh Berkus on 3 May 2010 15:37

Simon,

> My initial view was that the High Availability goal/role should be the
> default or most likely mode of operation. I would say that the current
> max_standby_delay favours the HA route since it specifically limits the
> amount by which server can fall behind.

I don't understand how Tom's approach would cause the slave to be
further behind than the current max_standy_delay code, and I can see
ways in which it would result in less delay. So, explain?

The main issue with Tom's list which struck me was that
max_standby_delay was linked to the system clock. HS is going to get
used by a lot of PG users who aren't running time sync on their servers,
or who let it get out of whack without fixing it. I'd thought that the
delay was somehow based on transaction timestamps coming from the
master. Keep in mind that there will be a *lot* of people using this
feature, including ones without compentent & available sysadmins.

The lock method appeals to me simply because it would eliminate the
"mass cancel" issues which Greg Smith was reporting every time the timer
runs down. That is, it seems to me that only the oldest queries would
be cancelled and not any new ones. The biggest drawback I can see to
Tom's approach is possible blocking on the slave due to the lock wait
from the recovery process. However, this could be managed with the new
lock-waits GUC, as well as statement timeout.

Overall, I think Tom's proposal gives me what I would prefer, which is
degraded performance on the slave but in ways which users are used to,
rather than a lot of query cancel, which will interfere with user
application porting.

Would the recovery lock show up in pg_locks? That would also be a good
diagnostic tool.

I am happy to test some of this on Amazon or GoGrid, which is what I was
planning on doing anyway.

P.S. can we avoid the "considered harmful" phrase? It carries a lot of
baggage ...

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 3 May 2010 15:39

Robert Haas <robertmhaas(a)gmail.com> writes:
> On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>> I'm inclined to think that we should throw away all this logic and just
>> have the slave cancel competing queries if the replay process waits
>> more than max_standby_delay seconds to acquire a lock.

> What if we somehow get into a situation where the replay process is
> waiting for a lock over and over and over again, because it keeps
> killing conflicting processes but something restarts them and they
> take locks over again?

They won't be able to take locks "over again", because the lock manager
won't allow requests to pass a pending previous request, except in
very limited circumstances that shouldn't hold here. They'll queue
up behind the replay process's lock request, not in front of it.
(If that isn't the case, it needs to be fixed, quite independently
of this concern.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Greg Sabino Mullane" on 3 May 2010 15:41

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

> Based on that, I don't know that there's really much user-seen behaviour
> between the two, except in 'oddball' situations, where there's a time
> skew between the servers, or a large lag, etc, in which case I think

Certainly that one particular case can be solved by making the
servers be in time sync a prereq for HS working (in the traditional way).
And by "prereq" I mean a "user beware" documentation warning.

- --
Greg Sabino Mullane greg(a)turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201005031539
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkvfJr0ACgkQvJuQZxSWSsgSRwCgwAZpKJDqHX28y90rCx/CNXDt
JGgAoO9JeoBacvTJ09UJ+o1Nek3KtcYR
=gvch
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 3 May 2010 15:54

On Mon, 2010-05-03 at 15:32 -0400, Stephen Frost wrote:
> Simon,
>
> * Simon Riggs (simon(a)2ndQuadrant.com) wrote:
> > Tom's proposed behaviour (has also been proposed before) favours the
> > avoid query cancellation route though could lead to huge amounts of lag.
>
> My impression of Tom's suggestion was that it would also be a maximum
> amount of delay which would be allowed before killing off queries- not
> that it would be able to wait indefinitely until no one is blocking.
> Based on that, I don't know that there's really much user-seen behaviour
> between the two, except in 'oddball' situations, where there's a time
> skew between the servers, or a large lag, etc, in which case I think
> Tom's proposal would be more likely what's 'expected', whereas what you
> would get with the existing implementation (zero time delay, or far too
> much) would be a 'gotcha'..

If recovery waits for max_standby_delay every time something gets in its
way, it should be clear that if many things get in its way it will
progressively fall behind. There is no limit to this and it can always
fall further behind. It does result in fewer cancelled queries and I do
understand many may like that.

That is *significantly* different from how it works now. (Plus: If there
really was no difference, why not leave it as is?)

The bottom line is this is about conflict resolution. There is simply no
way to resolve conflicts without favouring one or other of the
protagonists. Whatever mechanism you come up with that favours one will,
disfavour the other. I'm happy to give choices, but I'm not happy to
force just one kind of conflict resolution.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 3 May 2010 16:08

On Mon, 2010-05-03 at 15:39 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(a)gmail.com> writes:
> > On Mon, May 3, 2010 at 11:37 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> >> I'm inclined to think that we should throw away all this logic and just
> >> have the slave cancel competing queries if the replay process waits
> >> more than max_standby_delay seconds to acquire a lock.
>
> > What if we somehow get into a situation where the replay process is
> > waiting for a lock over and over and over again, because it keeps
> > killing conflicting processes but something restarts them and they
> > take locks over again?
>
> They won't be able to take locks "over again", because the lock manager
> won't allow requests to pass a pending previous request, except in
> very limited circumstances that shouldn't hold here. They'll queue
> up behind the replay process's lock request, not in front of it.
> (If that isn't the case, it needs to be fixed, quite independently
> of this concern.)

Most conflicts aren't lock-manager locks, they are snapshot conflicts,
though clearly different workloads will have different characteristics.

Some conflicts are buffer conflicts and the semantics of buffer cleanup
locks and many other internal locks are that shared locks queue jump
past exclusive lock requests. Not something we should touch, now at
least.

I understand that you aren't impressed by everything about the current
patch but rushed changes may not help either.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby