Streaming replication, and walsender during recovery [PgSql]

Prev: [HACKERS] Bloom filters bloom filters bloom filters
Next: Build farm tweaks

From: Heikki Linnakangas on 28 Jan 2010 02:47

Fujii Masao wrote:
> OK. Here is the patch which supports a walsender process during recovery;
>
> * Change walsender so as to send the WAL written by the walreceiver
> if it has been started during recovery.
> * Kill the walsenders started during recovery at the end of recovery
> because replication cannot survive the change of timeline ID.

I think there's a race condition at the end of recovery. When the
shutdown checkpoint is written, with new TLI, doesn't a cascading
walsender try to send that to the standby as soon as it's flushed to
disk? But it won't find it in the WAL segment with the old TLI that it's
reading.

Also, when segments are restored from the archive, using
restore_command, the cascading walsender won't find them because they're
not written in pg_xlog like normal WAL segments.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 28 Jan 2010 05:22

On Thu, Jan 28, 2010 at 4:47 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> I think there's a race condition at the end of recovery. When the
> shutdown checkpoint is written, with new TLI, doesn't a cascading
> walsender try to send that to the standby as soon as it's flushed to
> disk? But it won't find it in the WAL segment with the old TLI that it's
> reading.

Right. But I don't think that such a shutdown checkpoint record is worth
being sent by a cascading walsender. I think that such a walsender has
only to exit without regard to the WAL segment with the new TLI.

> Also, when segments are restored from the archive, using
> restore_command, the cascading walsender won't find them because they're
> not written in pg_xlog like normal WAL segments.

Yeah, I need to adjust my approach to the recent 'xlog-refactor' change.
The archived file needs to be restored without a name change, and remain
in pg_xlog until the bgwriter will have recycled it.

But that change would make the xlog.c even more complicated. Should we
postpone the 'cascading walsender' feature into v9.1, and, in v9.0, just
forbid walsender to be started during recovery?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 28 Jan 2010 05:43

Fujii Masao wrote:
> On Thu, Jan 28, 2010 at 4:47 PM, Heikki Linnakangas
> <heikki.linnakangas(a)enterprisedb.com> wrote:
>> I think there's a race condition at the end of recovery. When the
>> shutdown checkpoint is written, with new TLI, doesn't a cascading
>> walsender try to send that to the standby as soon as it's flushed to
>> disk? But it won't find it in the WAL segment with the old TLI that it's
>> reading.
>
> Right. But I don't think that such a shutdown checkpoint record is worth
> being sent by a cascading walsender. I think that such a walsender has
> only to exit without regard to the WAL segment with the new TLI.
>
>> Also, when segments are restored from the archive, using
>> restore_command, the cascading walsender won't find them because they're
>> not written in pg_xlog like normal WAL segments.
>
> Yeah, I need to adjust my approach to the recent 'xlog-refactor' change.
> The archived file needs to be restored without a name change, and remain
> in pg_xlog until the bgwriter will have recycled it.

I guess you could just say that it's working as designed, and WAL files
restored from archive can't be streamed. Presumably the cascaded slave
can find them in the archive too. But it is pretty weird, doesn't feel
right.

This reminds me of something I've been pondering anyway. Currently,
restore_command copies the restored WAL segment as pg_xlog/RECOVERYXLOG
instead of the usual 00000... filename. That avoids overwriting any
pre-existing WAL segments in pg_xlog, which may still contain useful
data. Using the same filename over and over also means that we don't
need to worry about deleting old log files during archive recovery.

The downside in standby mode is that once standby has restored segment X
from archive, and it's restarted, it must find X in the archive again or
it won't be able to start up. The archive better be a directory on the
same host.

Streaming Replication, however, took another approach. It does overwrite
any existing files in pg_xlog, we do need to worry about deleting old
files, and if the master goes down, we can always find files we've
already streamed in pg_xlog, so the standby can recover even if the
master can't be contacted anymore.

That's a bit inconsistent, and causes the problem that a cascading
walsender won't find the files restored from archive.

How about restoring/streaming files to a new directory, say
pg_xlog/restored/, with the real filenames? At least in standby_mode,
probably best to keep the current behavior in PITR. That would feel more
clean, you could easily tell apart files originating from the server
itself and those copied from the master.

> But that change would make the xlog.c even more complicated. Should we
> postpone the 'cascading walsender' feature into v9.1, and, in v9.0, just
> forbid walsender to be started during recovery?

That's certainly the simplest solution...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 28 Jan 2010 06:40

On Thu, Jan 28, 2010 at 7:43 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> How about restoring/streaming files to a new directory, say
> pg_xlog/restored/, with the real filenames? At least in standby_mode,
> probably best to keep the current behavior in PITR. That would feel more
> clean, you could easily tell apart files originating from the server
> itself and those copied from the master.

When the WAL file with the same name exists in the archive, pg_xlog
and pg_xlog/restore/ which directory should we recover it from?
I'm not sure that we can always make a right decision about that.

How about just making a restore_command copy the WAL files as the
normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
Though we need to worry about deleting them, we can easily leave
the task to the bgwriter.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 28 Jan 2010 10:48

Fujii Masao <masao.fujii(a)gmail.com> writes:
> How about just making a restore_command copy the WAL files as the
> normal one (e.g., 0000...) instead of a pg_xlog/RECOVERYXLOG?
> Though we need to worry about deleting them, we can easily leave
> the task to the bgwriter.

The reason for doing it that way was to limit disk space usage during
a long restore. I'm not convinced we can leave the task to the bgwriter
--- it shouldn't be deleting anything at that point.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [HACKERS] Bloom filters bloom filters bloom filters
Next: Build farm tweaks