recovery_connections cannot start (was Re: [HACKERS] master in standby mode croaks) [PgSql]

Prev: [HACKERS] CIText and pattern_ops
Next: pgsql: Add missing optimizer hooks for functioncost and number of rows.

From: Robert Haas on 22 Apr 2010 12:04

On Sat, Apr 17, 2010 at 6:52 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Sat, Apr 17, 2010 at 6:41 PM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
>> On Sat, 2010-04-17 at 17:44 -0400, Robert Haas wrote:
>>
>>> > I will change the error message.
>>>
>>> I gave a good deal of thought to trying to figure out a cleaner
>>> solution to this problem than just changing the error message and
>>> failed. So let's change the error message. Of course I'm not quite
>>> sure what we should change it TO, given that the situation is the
>>> result of an interaction between three different GUCs and we have no
>>> way to distinguish which one(s) are the problem.
>>
>> "You need all three" covers it.
>
> Actually you need standby_connections and either archive_mode=on or
> max_wal_senders>0, I think.

One way we could fix this is use 2 bits rather than 1 for
XLogStandbyInfoMode. One bit could indicate that either
archive_mode=on or max_wal_senders>0, and the second bit could
indicate that recovery_connections=on. If the second bit is unset, we
could emit the existing complaint:

recovery connections cannot start because the recovery_connections
parameter is disabled on the WAL source server

If the other bit is unset, then we could instead complain:

recovery connections cannot start because archive_mode=off and
max_wal_senders=0 on the WAL source server

If we don't want to use two bits there, it's hard to really describe
all the possibilities in a reasonable number of characters. The only
thing I can think of is to print a message and a hint:

recovery_connections cannot start due to incorrect settings on the WAL
source server
HINT: make sure recovery_connections=on and either archive_mode=on or
max_wal_senders>0

I haven't checked whether the hint would be displayed in the log on
the standby, but presumably we could make that be the case if it's not
already.

I think the first way is better because it gives the user more
specific information about what they need to fix. Thinking about how
each case might happen, since the default for recovery_connections is
'on', it seems that recovery_connections=off will likely only be an
issue if the user has explicitly turned it off. The other case, where
archive_mode=off and max_wal_senders=0, will likely only occur if
someone takes a snapshot of the master without first setting up
archiving or SR. Both of these will probably happen relatively
rarely, but since we're burning a whole byte for XLogStandbyInfoMode
(plus 3 more bytes of padding?), it seems like we might as well snag
one more bit for clarity.

Thoughts?

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 22 Apr 2010 20:03

On Fri, Apr 23, 2010 at 1:04 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> One way we could fix this is use 2 bits rather than 1 for
> XLogStandbyInfoMode. One bit could indicate that either
> archive_mode=on or max_wal_senders>0, and the second bit could
> indicate that recovery_connections=on. If the second bit is unset, we
> could emit the existing complaint:
>
> recovery connections cannot start because the recovery_connections
> parameter is disabled on the WAL source server
>
> If the other bit is unset, then we could instead complain:
>
> recovery connections cannot start because archive_mode=off and
> max_wal_senders=0 on the WAL source server
>
> If we don't want to use two bits there, it's hard to really describe
> all the possibilities in a reasonable number of characters. The only
> thing I can think of is to print a message and a hint:
>
> recovery_connections cannot start due to incorrect settings on the WAL
> source server
> HINT: make sure recovery_connections=on and either archive_mode=on or
> max_wal_senders>0
>
> I haven't checked whether the hint would be displayed in the log on
> the standby, but presumably we could make that be the case if it's not
> already.
>
> I think the first way is better because it gives the user more
> specific information about what they need to fix. Thinking about how
> each case might happen, since the default for recovery_connections is
> 'on', it seems that recovery_connections=off will likely only be an
> issue if the user has explicitly turned it off. The other case, where
> archive_mode=off and max_wal_senders=0, will likely only occur if
> someone takes a snapshot of the master without first setting up
> archiving or SR. Both of these will probably happen relatively
> rarely, but since we're burning a whole byte for XLogStandbyInfoMode
> (plus 3 more bytes of padding?), it seems like we might as well snag
> one more bit for clarity.
>
> Thoughts?

I like the second choice since it's simpler and enough for me.
But I have no objection to the first.

When we encounter the error, we would need to not only change
those parameter values but also take a fresh base backup and
restart the standby using it. The description of this required
procedure needs to be in the document or error message, I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 23 Apr 2010 06:44

On Fri, Apr 23, 2010 at 5:24 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Fujii Masao wrote:
>> On Fri, Apr 23, 2010 at 1:04 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>> One way we could fix this is use 2 bits rather than 1 for
>>> XLogStandbyInfoMode. One bit could indicate that either
>>> archive_mode=on or max_wal_senders>0, and the second bit could
>>> indicate that recovery_connections=on. If the second bit is unset, we
>>> could emit the existing complaint:
>>>
>>> recovery connections cannot start because the recovery_connections
>>> parameter is disabled on the WAL source server
>>>
>>> If the other bit is unset, then we could instead complain:
>>>
>>> recovery connections cannot start because archive_mode=off and
>>> max_wal_senders=0 on the WAL source server
>>>
>>> If we don't want to use two bits there, it's hard to really describe
>>> all the possibilities in a reasonable number of characters. The only
>>> thing I can think of is to print a message and a hint:
>>>
>>> recovery_connections cannot start due to incorrect settings on the WAL
>>> source server
>>> HINT: make sure recovery_connections=on and either archive_mode=on or
>>> max_wal_senders>0
>>>
>>> I haven't checked whether the hint would be displayed in the log on
>>> the standby, but presumably we could make that be the case if it's not
>>> already.
>>>
>>> I think the first way is better because it gives the user more
>>> specific information about what they need to fix. Thinking about how
>>> each case might happen, since the default for recovery_connections is
>>> 'on', it seems that recovery_connections=off will likely only be an
>>> issue if the user has explicitly turned it off. The other case, where
>>> archive_mode=off and max_wal_senders=0, will likely only occur if
>>> someone takes a snapshot of the master without first setting up
>>> archiving or SR. Both of these will probably happen relatively
>>> rarely, but since we're burning a whole byte for XLogStandbyInfoMode
>>> (plus 3 more bytes of padding?), it seems like we might as well snag
>>> one more bit for clarity.
>>>
>>> Thoughts?
>>
>> I like the second choice since it's simpler and enough for me.
>> But I have no objection to the first.
>>
>> When we encounter the error, we would need to not only change
>> those parameter values but also take a fresh base backup and
>> restart the standby using it. The description of this required
>> procedure needs to be in the document or error message, I think.
>
> I quite liked Robert's proposal to add an explicit GUC to control what
> extra information is logged
> (http://archives.postgresql.org/pgsql-hackers/2010-04/msg00509.php). It
> is quite difficult to explain the current behavior, a simple explicit
> wal_mode GUC would be a lot simpler. It wouldn't add any extra steps to
> setting the system up, you currently need to set archive_mode='on'
> anyway to enable archiving. You would just set wal_mode='archive' or
> wal_mode='standby' instead, depending on what you want to do with the WAL.

I liked it, too, but I sort of decided it didn't buy much. There are
three separate sets of things that need to be controlled:

1. What WAL to emit - (a) just enough for crash recovery, (b) enough
for log shipping, (c) enough for log shipping with recovery
connections.

2. Whether to run the archiver.

3. Whether to allow streaming replication connections (and if so, how many).

If the answer to (1) is "just enough for crash recovery", then (2) and
(3) must be "no". But if (1) is either of the other two options, then
any combination of answers for (2) and (3) is seemingly sensible,
though having both (2) and (3) as no is probably of limited utility.
But at a mimium, you could certainly have:

crash recovery/no archiver/no SR
log shipping/archiver/no SR
log shipping/no archiver/SR
log shipping/archiver/SR
recovery connections/archiver/no SR
recovery connections/no archiver/SR
recovery connections/archiver/SR

I don't see any reasonable way to package all of that up in a single
GUC. Thoughts?

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 23 Apr 2010 07:21

On Fri, Apr 23, 2010 at 7:12 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Robert Haas wrote:
>> On Fri, Apr 23, 2010 at 5:24 AM, Heikki Linnakangas
>> <heikki.linnakangas(a)enterprisedb.com> wrote:
>>> I quite liked Robert's proposal to add an explicit GUC to control what
>>> extra information is logged
>>> (http://archives.postgresql.org/pgsql-hackers/2010-04/msg00509.php). It
>>> is quite difficult to explain the current behavior, a simple explicit
>>> wal_mode GUC would be a lot simpler. It wouldn't add any extra steps to
>>> setting the system up, you currently need to set archive_mode='on'
>>> anyway to enable archiving. You would just set wal_mode='archive' or
>>> wal_mode='standby' instead, depending on what you want to do with the WAL.
>>
>> I liked it, too, but I sort of decided it didn't buy much. There are
>> three separate sets of things that need to be controlled:
>>
>> 1. What WAL to emit - (a) just enough for crash recovery, (b) enough
>> for log shipping, (c) enough for log shipping with recovery
>> connections.
>>
>> 2. Whether to run the archiver.
>>
>> 3. Whether to allow streaming replication connections (and if so, how many).
>
> Streaming replication needs the same information in the WAL as archiving
> does,

True.

> there's no difference between 2 and 3. (the "how many" aspect of 3
> is controlled by max_wal_senders).

False.

I thought what you think too, but discovered otherwise when I read the
code. Some uses of archive_mode are used to control what WAL is
generated, but others control a *process* called the archiver.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 23 Apr 2010 07:54

On Fri, Apr 23, 2010 at 7:40 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Ok, that brings us back to square one. We could still add the wal_mode
> GUC to explicitly control how much WAL is written (replacing
> recovery_connections in the primary), I think it would still make the
> system easier to explain. But it would add an extra hurdle to enabling
> archiving, you'd have to set wal_mode='archive', archive_mode='on', and
> archive_command. I'm not sure if that would be better or worse than the
> current situation.

I wasn't either, that's why I gave up. It didn't seem worth doing a
major GUC reorganization on the eve of beta unless there was a clear
win. I think there may be a way to improve this but I don't think
it's we should take the time now to figure out what it is. Let's
revisit it for 9.1, and just improve the error reporting for now.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] CIText and pattern_ops
Next: pgsql: Add missing optimizer hooks for functioncost and number of rows.