max_standby_delay considered harmful [PgSql]

Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby

From: Simon Riggs on 12 May 2010 10:40

On Wed, 2010-05-12 at 16:03 +0200, Stefan Kaltenbrunner wrote:
> Simon Riggs wrote:
> > On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
> >> On Wed, May 12, 2010 at 7:26 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> >>> On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
> >>>
> >>>> I'm not sure what to make of this. Sometimes not shutting down
> >>>> doesn't sound like a feature to me.
> >>> It acts exactly the same in recovery as in normal running. It is not a
> >>> special feature of recovery at all, bug or otherwise.
> >> Simon, that doesn't make any sense. We are talking about a backend
> >> getting stuck forever on an exclusive lock that is held by the startup
> >> process and which will never be released (for example, because the
> >> master has shut down and no more WAL can be obtained for replay). The
> >> startup process does not hold locks in normal operation.
> >
> > When I test it, startup process holding a lock does not prevent shutdown
> > of a standby.
> >
> > I'd be happy to see your test case showing a bug exists and that the
> > behaviour differs from normal running.
>
> In my testing the postmaster simply does not shut down even with no
> clients connected any more once in a while - most of the time it works
> just fine but in like 1 out of 10 cases it get's stuck - my testcase (as
> detailed in the related thread) is simply doing an interval load on the
> master (pgbench -T 120 && sleep 30 && pgbench -T 120 - rinse and repeat
> as needed) and pgbench -S && pg_ctl restart && pgbench -S in a lop on
> the standby. once in a while the standby will simply not shut down
> (forever - not only by eceeding the default timeout of pgctl which seems
> to get triggered much more often on the standby than on the master -
> have not looked into that yet in detail)

If you could recreate that on a server in debug mode we can see what's
happening. If you can attach to the server and get a back trace that
would help. I've not seen that behaviour at all during testing and if
the issue is sporadic its not likely to help much trying to recreate
myself.

This could be an issue with SR, or an issue with the shutdown code
itself.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 12 May 2010 11:28

On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
> On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
> > On Wed, May 12, 2010 at 7:26 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> > > On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
> > >
> > >> I'm not sure what to make of this. Sometimes not shutting down
> > >> doesn't sound like a feature to me.
> > >
> > > It acts exactly the same in recovery as in normal running. It is not a
> > > special feature of recovery at all, bug or otherwise.
> >
> > Simon, that doesn't make any sense. We are talking about a backend
> > getting stuck forever on an exclusive lock that is held by the startup
> > process and which will never be released (for example, because the
> > master has shut down and no more WAL can be obtained for replay). The
> > startup process does not hold locks in normal operation.
>
> When I test it, startup process holding a lock does not prevent shutdown
> of a standby.
>
> I'd be happy to see your test case showing a bug exists and that the
> behaviour differs from normal running.

Let me put this differently: I accept that Stefan has reported a
problem. Neither Tom nor myself can reproduce the problem. I've re-run
Stefan's test case and restarted the server more than 400 times now
without any issue.

I re-read your post where you gave what you yourself called "uninformed
speculation". There's no real polite way to say it, but yes your
speculation does appear to be uninformed, since it is incorrect. Reasons
would be not least that Stefan's tests don't actually send any locks to
the standby anyway (!), but even if they did your speculation as to the
cause is still all wrong, as explained.

There is no evidence to link this behaviour with HS, as yet, and you
should be considering the possibility the problem lies elsewhere,
especially since it could be code you committed that is at fault.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 12 May 2010 12:04

On Wed, May 12, 2010 at 11:28 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
>> On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
>> > On Wed, May 12, 2010 at 7:26 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
>> > > On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
>> > >
>> > >> I'm not sure what to make of this. Sometimes not shutting down
>> > >> doesn't sound like a feature to me.
>> > >
>> > > It acts exactly the same in recovery as in normal running. It is not a
>> > > special feature of recovery at all, bug or otherwise.
>> >
>> > Simon, that doesn't make any sense. We are talking about a backend
>> > getting stuck forever on an exclusive lock that is held by the startup
>> > process and which will never be released (for example, because the
>> > master has shut down and no more WAL can be obtained for replay). The
>> > startup process does not hold locks in normal operation.
>>
>> When I test it, startup process holding a lock does not prevent shutdown
>> of a standby.
>>
>> I'd be happy to see your test case showing a bug exists and that the
>> behaviour differs from normal running.
>
> Let me put this differently: I accept that Stefan has reported a
> problem. Neither Tom nor myself can reproduce the problem. I've re-run
> Stefan's test case and restarted the server more than 400 times now
> without any issue.

OK, I'm glad to hear you've been testing this. I wasn't aware of that.

> I re-read your post where you gave what you yourself called "uninformed
> speculation". There's no real polite way to say it, but yes your
> speculation does appear to be uninformed, since it is incorrect. Reasons
> would be not least that Stefan's tests don't actually send any locks to
> the standby anyway (!),

Hmm. Well, assuming you're correct, that does seem to be a, uh,
slight problem with my theory.

> but even if they did your speculation as to the
> cause is still all wrong, as explained.

You lost me. I don't understand why the problem that I'm referring to
couldn't happen, even if it's not what's happening here.

> There is no evidence to link this behaviour with HS, as yet, and you
> should be considering the possibility the problem lies elsewhere,
> especially since it could be code you committed that is at fault.

Huh?? The evidence that this bug is linked with HS is that it occurs
on a server running in HS mode, and not otherwise. As for whether the
bug is code I committed, that's certainly possible, but keep in mind
it didn't work at all before IN HOT STANDBY MODE - and that will be
code you committed.

I'm going to go test this and see if I can figure out what's going on.
I hope you will keep at it also - as you point out, your knowledge of
this code far exceeds mine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 12 May 2010 12:49

On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote:

> Huh?? The evidence that this bug is linked with HS is that it occurs
> on a server running in HS mode, and not otherwise. As for whether the
> bug is code I committed, that's certainly possible, but keep in mind
> it didn't work at all before IN HOT STANDBY MODE - and that will be
> code you committed.

I'll say it now, so its plain. I'm not going to investigate every bug
that occurs on Postgres, just because someone was in HS when they found
it. Any more than all bugs on Postgres in normal running are MVCC bugs.
There needs to be reasonable evidence or a conjecture by someone that
knows something about the code. If HS were the only thing changed in
recovery in this release, that might not seem reasonable, but since we
have much new code and I am not the only developer, it is.

Normal shutdown didn't work on a standby before HS was committed and it
didn't work afterwards either. Use all the capitals you like but if you
use poor arguments and combine that with no evidence then we'll not get
very far, either in working together or in solving the actual bugs.
Please don't continue to make wild speculations about things related to
HS and recovery, so that issues do not become confused; there is no need
to comment on every thread.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 12 May 2010 13:05

On Wed, May 12, 2010 at 5:49 PM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote:
>
>> Huh?? The evidence that this bug is linked with HS is that it occurs
>> on a server running in HS mode, and not otherwise. As for whether the
>> bug is code I committed, that's certainly possible, but keep in mind
>> it didn't work at all before IN HOT STANDBY MODE - and that will be
>> code you committed.
>
> I'll say it now, so its plain. I'm not going to investigate every bug
> that occurs on Postgres, just because someone was in HS when they found
> it.

Fair enough, though your help debugging is always appreciated
regardless of whether a problem is HS related or not. Nobody's
obligated to work on anything in Postgres after all.

I'm not sure who to blame for the shouting match over whose commit
introduced the bug -- it doesn't seem like a relevant or useful thing
to argue about, please both stop.

> there is no need
> to comment on every thread.

This is out of line.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Prev: Further Hot Standby documentation required
Next: [HACKERS] Streaming replication - unable to stop the standby