Stefan's bug (was: max_standby_delay considered harmful) [PgSql]

Prev: comment needs to be updated for HS?
Next: pg_upgrade code questions

From: Fujii Masao on 26 May 2010 21:06

On Wed, May 26, 2010 at 9:40 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Mon, May 24, 2010 at 10:35 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>> On Mon, May 24, 2010 at 10:26 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>> This looks pretty reasonable to me, but I guess I feel like it would
>>> be better to drive the CancelBackup() decision off of whether we've
>>> ever reached PM_RUN rather than consulting XLogCtl. It just feels
>>> cleaner to me to drive all of the postmaster decisions off of the same
>>> signalling mechanism rather than having a separate one (that only
>>> works because it's used very late in shutdown when we theoretically
>>> don't need a lock) just for this one case.
>>
>> Okay, how about the attached patch? It uses the postmaster-local flag
>> "ReachedEndOfRecovery" (better name?) instead of XLogCtl one.
>
> I've committed part of this patch, with the naming change that Tom
> suggested.

Thanks!

> The parts I haven't committed are:
>
> 1. I don't see why we need to reset ReachedEndOfRecovery starting over
> from PM_NO_CHILDREN. It seems to me that once we reach PM_RUN, we
> CAN'T go back to needing the backup label file, even if we have a
> subsequent backend crash. If I'm wrong, please let me know why and
> I'll go put this back (with an appropriate comment).

That reset has nothing to do with cancellation of the backup mode.
I just added it since postmaster might use ReachedNormalRunning for
another purpose in the future. For now, I have no objection not to
add it.

> 2. The changes to avoid launching WALReceiver except during certain
> PM_* states. It seems fairly sensible, but what is the case where
> adding this logic prevents a problem?

The problem is that shutdown would get stuck when walreceiver is
invoked in PM_WAIT_BACKEND state. Imagine the following scenario:

1. Fast shutdown is requested just before the startup process calls
RequestXLogStreaming() which is the function to request postmaster
to invoke walreceiver.

2. pmdie() sends SIGTERM to the startup process, but not walreceiver
because it's not been started yet. Then, pmdie() switches the
state into PM_WAIT_BACKENDS.

3. The startup process goes through RequestXLogStreaming() and requests
postmaster to start walreceiver before processing SIGTERM sent from
pmdie(). Then the startup process exits, and postmaster invokes
walreceiver in PM_WAIT_BACKENDS state.

4. Once postmaster has reached PM_WAIT_BACKENDS, there is no way to send
SIGTERM to walreceiver. OTOH, postmaster cannot advance the state from
PM_WAIT_BACKENDS until walreceiver has gone away. So shutdown gets
stuck.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 26 May 2010 21:10

On Wed, May 26, 2010 at 9:54 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Tue, May 25, 2010 at 6:19 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>> On Tue, May 18, 2010 at 3:09 PM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>>>>> (2)
>>>>> pg_ctl -ms stop emits the following warning whenever there is the
>>>>> backup_label file in $PGDATA.
>>>>>
>>>>> WARNING: online backup mode is active
>>>>> Shutdown will not complete until pg_stop_backup() is called.
>>>>>
>>>>> This warning doesn't fit in with the shutdown during recovery case.
>>>>> Since smart shutdown might be requested by other than pg_ctl, the
>>>>> warning should be emitted in server side rather than client, I think.
>>>>> How about moving the warning to the server side?
>>>
>>> Though I'm not sure if this should be fixed for 9.0, I attached the
>>> patch (move_bkp_cancel_warning_v1.patch).
>>
>> This patch is worth applying for 9.0? If not, I'll add it into
>> the next CF for 9.1.
>
> I'm not convinced that this is a good idea, because ISTM it will make
> the error message to be less likely to be seen by the person running
> pg_ctl. In any case, it's a behavior change, so I think that means
> it's a no-go for 9.0.
>
> In terms of 9.1, it might make sense to log something to both places.
> But maybe we shouldn't just do it once - maybe it should happen every
> 30 s or so until we actually manage to shut down, with a list of
> what's still blocking shutdown a la errdetail_busy_db.

Fair enough. I'll think of the issue in detail again for 9.1.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 26 May 2010 22:03

On Wed, May 26, 2010 at 9:06 PM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
> On Wed, May 26, 2010 at 9:40 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> On Mon, May 24, 2010 at 10:35 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>>> On Mon, May 24, 2010 at 10:26 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>>> This looks pretty reasonable to me, but I guess I feel like it would
>>>> be better to drive the CancelBackup() decision off of whether we've
>>>> ever reached PM_RUN rather than consulting XLogCtl. It just feels
>>>> cleaner to me to drive all of the postmaster decisions off of the same
>>>> signalling mechanism rather than having a separate one (that only
>>>> works because it's used very late in shutdown when we theoretically
>>>> don't need a lock) just for this one case.
>>>
>>> Okay, how about the attached patch? It uses the postmaster-local flag
>>> "ReachedEndOfRecovery" (better name?) instead of XLogCtl one.
>>
>> I've committed part of this patch, with the naming change that Tom
>> suggested.
>
> Thanks!
>
>> The parts I haven't committed are:
>>
>> 1. I don't see why we need to reset ReachedEndOfRecovery starting over
>> from PM_NO_CHILDREN. It seems to me that once we reach PM_RUN, we
>> CAN'T go back to needing the backup label file, even if we have a
>> subsequent backend crash. If I'm wrong, please let me know why and
>> I'll go put this back (with an appropriate comment).
>
> That reset has nothing to do with cancellation of the backup mode.
> I just added it since postmaster might use ReachedNormalRunning for
> another purpose in the future. For now, I have no objection not to
> add it.

OK, good. I'm not sure it would be right to add it any way - reached
normal running sounds to me like it ought to mean "reached normal
running, ever" rather than "reached normal running since the last
backend crash". In any event, it's moot for now.

>> 2. The changes to avoid launching WALReceiver except during certain
>> PM_* states. It seems fairly sensible, but what is the case where
>> adding this logic prevents a problem?
>
> The problem is that shutdown would get stuck when walreceiver is
> invoked in PM_WAIT_BACKEND state. Imagine the following scenario:
>
> 1. Fast shutdown is requested just before the startup process calls
> RequestXLogStreaming() which is the function to request postmaster
> to invoke walreceiver.
>
> 2. pmdie() sends SIGTERM to the startup process, but not walreceiver
> because it's not been started yet. Then, pmdie() switches the
> state into PM_WAIT_BACKENDS.
>
> 3. The startup process goes through RequestXLogStreaming() and requests
> postmaster to start walreceiver before processing SIGTERM sent from
> pmdie(). Then the startup process exits, and postmaster invokes
> walreceiver in PM_WAIT_BACKENDS state.
>
> 4. Once postmaster has reached PM_WAIT_BACKENDS, there is no way to send
> SIGTERM to walreceiver. OTOH, postmaster cannot advance the state from
> PM_WAIT_BACKENDS until walreceiver has gone away. So shutdown gets
> stuck.

OK, makes sense. I have committed this part also.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev |
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: comment needs to be updated for HS?
Next: pg_upgrade code questions