Streaming replication, retrying from archive [PgSql]

Prev: [HACKERS] Streaming replication, retrying from archive
Next: [HACKERS] archive_timeout behavior for no activity

From: Robert Haas on 14 Jan 2010 09:36

On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> Imagine this scenario:
>
> 1. Master is up and running, standby is connected and streaming happily
> 2. Network goes down, connection is broken.
> 3. Standby falls behind a lot. Old WAL files that the standby needs are
> archived, and deleted from master.
> 4. Network is restored. Standby reconnects
> 5. Standby will get an error because the WAL file it needs is not in the
> master anymore.
>
> What will currently happen is:
>
> 6, Standby retries connecting and failing indefinitely, until the admin
> restarts it.
>
> What we would *like* to happen is:
>
> 6. Standby fetches the missing WAL files from archive, then reconnects
> and continues streaming.
>
> Can we fix that?

Just MHO here, but this seems like a bigger project than we should be
starting at this stage of the game.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on 14 Jan 2010 09:39

On Thu, Jan 14, 2010 at 15:36, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
> <heikki.linnakangas(a)enterprisedb.com> wrote:
>> Imagine this scenario:
>>
>> 1. Master is up and running, standby is connected and streaming happily
>> 2. Network goes down, connection is broken.
>> 3. Standby falls behind a lot. Old WAL files that the standby needs are
>> archived, and deleted from master.
>> 4. Network is restored. Standby reconnects
>> 5. Standby will get an error because the WAL file it needs is not in the
>> master anymore.
>>
>> What will currently happen is:
>>
>> 6, Standby retries connecting and failing indefinitely, until the admin
>> restarts it.
>>
>> What we would *like* to happen is:
>>
>> 6. Standby fetches the missing WAL files from archive, then reconnects
>> and continues streaming.
>>
>> Can we fix that?
>
> Just MHO here, but this seems like a bigger project than we should be
> starting at this stage of the game.

+1.

We want this eventually (heck, it'd be awesome!), but let's get what
we have now stable first.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 14 Jan 2010 10:23

Magnus Hagander wrote:
> On Thu, Jan 14, 2010 at 15:36, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> On Thu, Jan 14, 2010 at 9:15 AM, Heikki Linnakangas
>> <heikki.linnakangas(a)enterprisedb.com> wrote:
>>> Imagine this scenario:
>>>
>>> 1. Master is up and running, standby is connected and streaming happily
>>> 2. Network goes down, connection is broken.
>>> 3. Standby falls behind a lot. Old WAL files that the standby needs are
>>> archived, and deleted from master.
>>> 4. Network is restored. Standby reconnects
>>> 5. Standby will get an error because the WAL file it needs is not in the
>>> master anymore.
>>>
>>> What will currently happen is:
>>>
>>> 6, Standby retries connecting and failing indefinitely, until the admin
>>> restarts it.
>>>
>>> What we would *like* to happen is:
>>>
>>> 6. Standby fetches the missing WAL files from archive, then reconnects
>>> and continues streaming.
>>>
>>> Can we fix that?
>> Just MHO here, but this seems like a bigger project than we should be
>> starting at this stage of the game.
>
> +1.
>
> We want this eventually (heck, it'd be awesome!), but let's get what
> we have now stable first.

If we don't fix that within the server, we will need to document that
caveat and every installation will need to work around that one way or
another. Maybe with some monitoring software and an automatic restart. Ugh.

I wasn't really asking if it's possible to fix, I meant "Let's think
about *how* to fix that".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Dimitri Fontaine on 14 Jan 2010 11:06

Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> writes:
> If we don't fix that within the server, we will need to document that
> caveat and every installation will need to work around that one way or
> another. Maybe with some monitoring software and an automatic restart. Ugh.
>
> I wasn't really asking if it's possible to fix, I meant "Let's think
> about *how* to fix that".

Did I mention my viewpoint on that already?
http://archives.postgresql.org/pgsql-hackers/2009-07/msg00943.php

It could well be I'm talking about things that have no relation at all
to what is in the patch currently, and that make no sense for where we
want the patch to go. But I'd like to know about that so that I'm not
banging my head on the nearest wall each time the topic surfaces.

Regards,
--
dim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 14 Jan 2010 11:28

On Fri, Jan 15, 2010 at 12:23 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> If we don't fix that within the server, we will need to document that
> caveat and every installation will need to work around that one way or
> another. Maybe with some monitoring software and an automatic restart. Ugh.
>
> I wasn't really asking if it's possible to fix, I meant "Let's think
> about *how* to fix that".

OK. How about the following (though it's a rough design)?

(1) If walsender cannot read the WAL file because of ENOENT, it sends the
special message indicating that error to walreceiver. This message is
shipped on the COPY protocol.

(2-a) If the message arrives, walreceiver exits by using proc_exit().
(3-a) If the startup process detects the exit of walreceiver in
WaitNextXLogAvailable(),
it switches back to a normal archive recovery mode, closes
the currently opened
WAL file, resets some variables (readId, readSeg, etc), and
calls FetchRecord()
again. Then it tries to restore the WAL file from the
archive if the restore_command
is supplied, and switches to a streaming recovery mode again
if invalid WAL is
found.

Or

(2-b) If the message arrives, walreceiver executes restore_command,
and then sets
the receivedUpto to the end location of the restored WAL
file. The restored file is
expected to be filled because it doesn't exist in the
primary's pg_xlog. So that
update of the receivedUpto is OK.
(3-b) After one WAL file is restored, walreceiver tries to connect to
the primary, and
starts replication again. If the ENOENT error occurs again,
we go back to the (1).

I like the latter approach since it's simpler. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] Streaming replication, retrying from archive
Next: [HACKERS] archive_timeout behavior for no activity