Stefan's bug (was: max_standby_delay considered harmful) [PgSql]

Prev: comment needs to be updated for HS?
Next: pg_upgrade code questions

From: Alvaro Herrera on 12 May 2010 22:36

Excerpts from Robert Haas's message of mié may 12 20:48:41 -0400 2010:
> On Wed, May 12, 2010 at 3:55 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> > I am wondering if we are not correctly handling the case where we get
> > a shutdown request while we are still in the PM_STARTUP state. It
> > looks like we might go ahead and switch to PM_RECOVERY and then
> > PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some
> > logic to handle the shutdown when the startup process exits, but if
> > the startup process never exits it looks like we might get stuck.
>
> I can reproduce the behavior Stefan is seeing consistently using the
> attached patch.
>
> W1: postgres -D ~/pgslave
> W2: pg_ctl -D ~/pgslave stop

If there's anything to learn from this patch, is that sleep is
uninterruptible on some platforms. This is why sleeps elsewhere are
broken down in loops, sleeping in small increments and checking
interrupts each time. Maybe some of the sleeps in the new HS code need
to be handled this way?
--

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 12 May 2010 22:41

On Wed, May 12, 2010 at 10:36 PM, Alvaro Herrera
<alvherre(a)alvh.no-ip.org> wrote:
> Excerpts from Robert Haas's message of mié may 12 20:48:41 -0400 2010:
>> On Wed, May 12, 2010 at 3:55 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> > I am wondering if we are not correctly handling the case where we get
>> > a shutdown request while we are still in the PM_STARTUP state. It
>> > looks like we might go ahead and switch to PM_RECOVERY and then
>> > PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some
>> > logic to handle the shutdown when the startup process exits, but if
>> > the startup process never exits it looks like we might get stuck.
>>
>> I can reproduce the behavior Stefan is seeing consistently using the
>> attached patch.
>>
>> W1: postgres -D ~/pgslave
>> W2: pg_ctl -D ~/pgslave stop
>
> If there's anything to learn from this patch, is that sleep is
> uninterruptible on some platforms. This is why sleeps elsewhere are
> broken down in loops, sleeping in small increments and checking
> interrupts each time. Maybe some of the sleeps in the new HS code need
> to be handled this way?

I don't think the problem is that the sleep is uninterruptible. I
think the problem is that a smart shutdown request received while in
the PM_STARTUP state does not acted upon until we enter the PM_RUN
state. That is, there's a race condition between the SIGUSR1 that the
startup process sends to the postmaster to signal that recovery has
begun and the SIGTERM being sent by pg_ctl.

However, since I haven't succeeded in producing a fix yet, take that
with a grain of salt...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 16 May 2010 21:25

On Wed, May 12, 2010 at 10:41 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Wed, May 12, 2010 at 10:36 PM, Alvaro Herrera
> <alvherre(a)alvh.no-ip.org> wrote:
>> Excerpts from Robert Haas's message of mié may 12 20:48:41 -0400 2010:
>>> On Wed, May 12, 2010 at 3:55 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>> > I am wondering if we are not correctly handling the case where we get
>>> > a shutdown request while we are still in the PM_STARTUP state. It
>>> > looks like we might go ahead and switch to PM_RECOVERY and then
>>> > PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some
>>> > logic to handle the shutdown when the startup process exits, but if
>>> > the startup process never exits it looks like we might get stuck.
>>>
>>> I can reproduce the behavior Stefan is seeing consistently using the
>>> attached patch.
>>>
>>> W1: postgres -D ~/pgslave
>>> W2: pg_ctl -D ~/pgslave stop
>>
>> If there's anything to learn from this patch, is that sleep is
>> uninterruptible on some platforms. This is why sleeps elsewhere are
>> broken down in loops, sleeping in small increments and checking
>> interrupts each time. Maybe some of the sleeps in the new HS code need
>> to be handled this way?
>
> I don't think the problem is that the sleep is uninterruptible. I
> think the problem is that a smart shutdown request received while in
> the PM_STARTUP state does not acted upon until we enter the PM_RUN
> state. That is, there's a race condition between the SIGUSR1 that the
> startup process sends to the postmaster to signal that recovery has
> begun and the SIGTERM being sent by pg_ctl.
>
> However, since I haven't succeeded in producing a fix yet, take that
> with a grain of salt...

I'm replying to this thread rather than the max_standby_delay thread
because this line of discussion is not actually related to
max_standby_delay.

Fujii Masao previously reported this problem and proposed this fix:
http://archives.postgresql.org/pgsql-hackers/2010-04/msg00592.php

I responded by proposing that we instead simply adjust things so that
a smart shutdown is always handled at the end of recovery, just prior
to entering normal running. On further reflection, though, I've
concluded that was a dumb idea. Currently, when a smart shutdown
occurs during recovery, we handle it immediately UNLESS it happens
before we receive the signal that tells us it's OK to start the
background writer, in which case we get confused and lock everyone out
of the database until a fast or immediate shutdown request arrives.
So this is just a race condition, plain and simple. Therefore I think
Fujii Masao's original idea was the best, but I have what I believe is
an equivalent but simpler implementation, which is attached.

Thoughts? Should we try to fix this in 8.4 also, or just in HEAD?
8.3 and 8.2 never handle a smart shutdown prior to entering normal
running, and while that seems pretty useless, doing something
different would be a behavior change, so that seems like a
non-starter. 8.4 has the same behavior as HEAD, though it's not
documented in the release notes, so it's not clear how intentional the
change was.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From: Simon Riggs on 17 May 2010 03:13

On Sun, 2010-05-16 at 21:25 -0400, Robert Haas wrote:

> I have what I believe is
> an equivalent but simpler implementation, which is attached.

There's no code comments to explain this, so without in-depth analysis
of the problem, Masao's patch and this one its not possible to say
anything.

Please explain in detail why its the right approach and put that in a
comment, so we'll understand now and in the future.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 17 May 2010 03:38

On Mon, May 17, 2010 at 10:25 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> Therefore I think
> Fujii Masao's original idea was the best, but I have what I believe is
> an equivalent but simpler implementation, which is attached.

Seems good.

I found another two problems related to shutdown in PM_STARTUP state:

(1)
Smart or fast shutdown requested in PM_STARTUP state always removes
the backup_label file if it exists. But it might be still required
for subsequent recovery. I changed your patch so that additionally
the postmaster skips deleting the backup_label in that case.

(2)
pg_ctl -ms stop emits the following warning whenever there is the
backup_label file in $PGDATA.

WARNING: online backup mode is active
Shutdown will not complete until pg_stop_backup() is called.

This warning doesn't fit in with the shutdown during recovery case.
Since smart shutdown might be requested by other than pg_ctl, the
warning should be emitted in server side rather than client, I think.
How about moving the warning to the server side?

> Thoughts? Should we try to fix this in 8.4 also, or just in HEAD?
> 8.3 and 8.2 never handle a smart shutdown prior to entering normal
> running, and while that seems pretty useless, doing something
> different would be a behavior change, so that seems like a
> non-starter. 8.4 has the same behavior as HEAD, though it's not
> documented in the release notes, so it's not clear how intentional the
> change was.

In 8.4, smart shutdown during recovery waits until the startup process
has exited. So the backporting to 8.4 doesn't improve any situation,
I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: comment needs to be updated for HS?
Next: pg_upgrade code questions