Streaming replication and triggering failover [PgSql]

Prev: Listen / Notify - what to do when the queue is full
Next: [HACKERS] synchronized snapshots

From: Magnus Hagander on 8 Jan 2010 05:04

On Fri, Jan 8, 2010 at 10:58, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> The trigger file logic feels a bit backwards. As the patch stands, when
> the standby starts up, it retries connecting to the master server
> indefinitely, until a connection is successfully established. Then it
> streams until the connection breaks. If the connection is dropped
> abruptly, because of a network problem or crash in the master, standby
> retries indefinitely.
>
> If master is shut down cleanly, standby gets out of recovery mode, and
> starts up. Unless the trigger file is present; if it is, standby waits
> for it to go away before finishing recovery.
>
> So the trigger file is really a "holdoff file", like a safety catch on a
> gun. At the very least it should be renamed, but I don't think that's a
> very useful behavior anyway.
>
> It doesn't seem wise to consider a clean shutdown of the master as a
> signal to trigger failover. If you're setting up a HA system, that by
> itself is not robust enough; you also need to trigger failover if the
> master goes down unexpectedly, or if the standby was disconnected for
> some reason when the master was shut down. Secondly, what if you want to
> restart the master server, without initiating failover? You'll have to
> restart the standby too, to have it reconnect.
>
> Let's have a default of no failover, and retry connecting to the master
> indefinitely. When you *do* want to fail over, create the trigger file.
> When the standby sees the trigger file, it should stop streaming, finish
> up replaying what it had streamed up to that point, and start up as new
> master.

+1.

The default should be to "maintain the replication cluster", if
nothing else then by principle of least surprise.

It would also agree with a well-established procedure, which is what
pg_standby does. Keeping the same basic behavior around something like
this can only be a good thing.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 8 Jan 2010 05:41

Magnus Hagander wrote:
> On Fri, Jan 8, 2010 at 10:58, Heikki Linnakangas
> <heikki.linnakangas(a)enterprisedb.com> wrote:
>> So the trigger file is really a "holdoff file", like a safety catch on a
>> gun. At the very least it should be renamed, but I don't think that's a
>> very useful behavior anyway.
>>
>> It doesn't seem wise to consider a clean shutdown of the master as a
>> signal to trigger failover. If you're setting up a HA system, that by
>> itself is not robust enough; you also need to trigger failover if the
>> master goes down unexpectedly, or if the standby was disconnected for
>> some reason when the master was shut down. Secondly, what if you want to
>> restart the master server, without initiating failover? You'll have to
>> restart the standby too, to have it reconnect.
>>
>> Let's have a default of no failover, and retry connecting to the master
>> indefinitely. When you *do* want to fail over, create the trigger file.
>> When the standby sees the trigger file, it should stop streaming, finish
>> up replaying what it had streamed up to that point, and start up as new
>> master.
>
> +1.
>
> The default should be to "maintain the replication cluster", if
> nothing else then by principle of least surprise.
>
> It would also agree with a well-established procedure, which is what
> pg_standby does. Keeping the same basic behavior around something like
> this can only be a good thing.

Thinking more clearly, my comment above about the trigger file logic
being backwards was bollocks; if the master is shut down, standby waits
for the trigger file to appear, not to go away. And creating the trigger
file during replication causes it to finish, and failover to happen.

Nevertheless, let's make the default "no failover" if no trigger file
location is configured, and remove the notion that normal shutdown of
master stops recovery.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 8 Jan 2010 08:02

On Fri, Jan 8, 2010 at 7:41 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Thinking more clearly, my comment above about the trigger file logic
> being backwards was bollocks; if the master is shut down, standby waits
> for the trigger file to appear, not to go away. And creating the trigger
> file during replication causes it to finish, and failover to happen.
>
> Nevertheless, let's make the default "no failover" if no trigger file
> location is configured, and remove the notion that normal shutdown of
> master stops recovery.

You dropped CheckForStandbyTrigger() called at the end of recovery.
I think that this would be problem when an invalid record is found before
we reaches a streaming recovery state. The standby would be out-of-control
of the clusterware, and be brought up. Which might cause a split-brain
syndrome. We should need something to prevent such unexpected
activation?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 8 Jan 2010 08:31

Fujii Masao wrote:
> You dropped CheckForStandbyTrigger() called at the end of recovery.
> I think that this would be problem when an invalid record is found before
> we reaches a streaming recovery state. The standby would be out-of-control
> of the clusterware, and be brought up. Which might cause a split-brain
> syndrome. We should need something to prevent such unexpected
> activation?

I modified ReadRecord to PANIC if an invalid record is found during
streaming recovery.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 8 Jan 2010 08:41

On Fri, Jan 8, 2010 at 10:31 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Fujii Masao wrote:
>> You dropped CheckForStandbyTrigger() called at the end of recovery.
>> I think that this would be problem when an invalid record is found before
>> we reaches a streaming recovery state. The standby would be out-of-control
>> of the clusterware, and be brought up. Which might cause a split-brain
>> syndrome. We should need something to prevent such unexpected
>> activation?
>
> I modified ReadRecord to PANIC if an invalid record is found during
> streaming recovery.

Oh, sorry. It was my misunderstanding :(

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2
Prev: Listen / Notify - what to do when the queue is full
Next: [HACKERS] synchronized snapshots