master in standby mode croaks [PgSql]

Prev: mremap and bus error
Next: [HACKERS] Compile fail, alpha5 & gcc 4.3.3 in elog.c

From: Simon Riggs on 1 Apr 2010 19:06

On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
> I discovered tonight that if you shut down a server, create
> recovery.conf with standby_mode = 'on', and start it back up again,
> you get this:
>
> LOG: database system was shut down at 2010-03-30 22:34:09 EDT
> LOG: entering standby mode
> FATAL: recovery connections cannot start because the
> recovery_connections parameter is disabled on the WAL source server
> LOG: startup process (PID 22980) exited with exit code 1
> LOG: aborting startup due to startup process failure
>
> Now, you might certainly argue that this is a stupid thing to do (my
> motivation was to test some stuff) but certainly it's fair to say that
> error message is darn misleading, since in fact recovery_connections
> was NOT disabled. I believe this is the same "start up from a shut
> down checkpoint" problem that's been discussed previously so I won't
> belabor the point other than to say that

I don't think it is the same thing at all. This is a separate error and
should be rejected as such.

> I still think we need to fix this.

Agreed, as a separate issue.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 2 Apr 2010 04:51

On Apr 1, 2010, at 7:06 PM, Simon Riggs <simon(a)2ndQuadrant.com> wrote:
> On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
>> I discovered tonight that if you shut down a server, create
>> recovery.conf with standby_mode = 'on', and start it back up again,
>> you get this:
>>
>> LOG: database system was shut down at 2010-03-30 22:34:09 EDT
>> LOG: entering standby mode
>> FATAL: recovery connections cannot start because the
>> recovery_connections parameter is disabled on the WAL source server
>> LOG: startup process (PID 22980) exited with exit code 1
>> LOG: aborting startup due to startup process failure
>>
>> Now, you might certainly argue that this is a stupid thing to do (my
>> motivation was to test some stuff) but certainly it's fair to say
>> that
>> error message is darn misleading, since in fact recovery_connections
>> was NOT disabled. I believe this is the same "start up from a shut
>> down checkpoint" problem that's been discussed previously so I won't
>> belabor the point other than to say that
>
> I don't think it is the same thing at all. This is a separate error
> and
> should be rejected as such.
>
>> I still think we need to fix this.
>
> Agreed, as a separate issue.

OK, fair enough. I admit I didn't investigate what was causing this.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 2 Apr 2010 05:36

On Fri, 2010-04-02 at 04:51 -0400, Robert Haas wrote:
> On Apr 1, 2010, at 7:06 PM, Simon Riggs <simon(a)2ndQuadrant.com> wrote:
> > On Tue, 2010-03-30 at 22:40 -0400, Robert Haas wrote:
> >> I discovered tonight that if you shut down a server, create
> >> recovery.conf with standby_mode = 'on', and start it back up again,
> >> you get this:
> >>
> >> LOG: database system was shut down at 2010-03-30 22:34:09 EDT
> >> LOG: entering standby mode
> >> FATAL: recovery connections cannot start because the
> >> recovery_connections parameter is disabled on the WAL source server
> >> LOG: startup process (PID 22980) exited with exit code 1
> >> LOG: aborting startup due to startup process failure
> >>
> >> Now, you might certainly argue that this is a stupid thing to do (my
> >> motivation was to test some stuff) but certainly it's fair to say
> >> that
> >> error message is darn misleading, since in fact recovery_connections
> >> was NOT disabled. I believe this is the same "start up from a shut
> >> down checkpoint" problem that's been discussed previously so I won't
> >> belabor the point other than to say that
> >
> > I don't think it is the same thing at all. This is a separate error
> > and
> > should be rejected as such.

I can't duplicate this error based upon what you have said.

With just standby_mode = 'on' the standby just waits forever, with a ps
message set to
postgres: startup process waiting for 000000010000000000000000

That's not very good, but it isn't the error you describe.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 10 Apr 2010 09:02

On Fri, Apr 2, 2010 at 5:36 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> I can't duplicate this error based upon what you have said.

I fooled around with this some more and I think I know what's going
on. The error message I received was:

recovery connections cannot start because the recovery_connections
parameter is disabled on the WAL source server

This is generated when !checkPoint.XLogStandbyInfoMode. That, in
turn, is set on the master to the results of XLogStandbyInfoActive(),
which is defined as XLogRequestRecoveryConnections && XLogIsNeeded().
XLogIsNeeded() is defined as XLogArchivingActive() || (max_wal_senders
> 0), and XLogArchivingActive() is defined as XLogArchiveMode. So
when you expand it all out, this error message gets triggered when the
following condition does not hold on the master:

XLogRequestRecoveryConnections && (XLogArchiveMode || (max_wal_senders > 0))

So this can fail in either of two ways: (1)
XLogRequestRecoveryConnections (aka recovery_connections) might be
false, which is the situation described in the error message, or (2)
XLogArchiveMode (archive_mode) might be false and at the same time
max_wal_senders might be zero. As it happens, the default
configuration of the system is recovery_connections = true,
archive_mode = false, max_wal_senders = 0, so with an out-of-the-box
config it fails for the reason that isn't the one described in the
error message.

One possible approach here is to improve the error message, but it
seems to me that having the ability of Hot Standby to run on the slave
partially controlled by three different GUCs is awfully complicated.
I think the root of the problem here is that recovery_connections
controls one behavior on the primary (whether or not we WAL-log
certain information needed for HS) and a completely unrelated behavior
on the standby (whether or not we try to allow read-only backends into
the system). In 8.4 and prior, it was always the job of archive_mode
to decide whether WAL-logging was needed. Maybe we should go back to
that and make it an enum:

wal_mode = {standby | archive | off}

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 14 Apr 2010 04:21

On Sat, 2010-04-10 at 09:02 -0400, Robert Haas wrote:

> So this can fail in either of two ways

If I understand this correctly, it is unconvincing as a failure mode
since it doesn't follow any of the documented procedures for creating a
standby. There are many ways to screw up that ignore the manual, which
is why the manual exists.

If you can show a full test case, with failure, then I'll follow it
through.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3
Prev: mremap and bus error
Next: [HACKERS] Compile fail, alpha5 & gcc 4.3.3 in elog.c