From: Heikki Linnakangas on
Fujii Masao wrote:
> I found another missing feature in new file-based log shipping (i.e.,
> standby_mode is enabled and 'cp' is used as restore_command).
>
> After the trigger file is found, the startup process with pg_standby
> tries to replay all of the WAL files in both pg_xlog and the archive.
> So, when the primary fails, if the latest WAL file in pg_xlog of the
> primary can be read, we can prevent the data loss by copying it to
> pg_xlog of the standby before creating the trigger file.
>
> On the other hand, the startup process with standby mode doesn't
> replay the WAL files in pg_xlog after the trigger file is found. So
> failover always causes the data loss even if the latest WAL file can
> be read from the primary. And if the latest WAL file is copied to the
> archive instead, it can be replayed but a PANIC error would happen
> because it's not filled.
>
> We should remove this restriction?

Looking into this, I realized that we have a bigger problem related to
this. Although streaming replication stores the streamed WAL files in
pg_xlog, so that they can be re-replayed after a standby restart without
connecting to the master, we don't try to replay those either. So if you
restart standby, it will fail to start up if the WAL it needs can't be
found in archive or by connecting to the master. That must be fixed.

I'd imagine that the ability to restore WAL files manually copied to
pg_xlog will fall out of that fix too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on
Fujii Masao wrote:
>> sources &= ~failedSources;
>> failedSources |= readSource;
>
> The above lines in XLogPageRead() seem not to be required in normal
> recovery case (i.e., standby_mode = off). So how about the attached
> patch?
>
> *** 9050,9056 **** next_record_is_invalid:
> --- 9047,9056 ----
> readSource = 0;
>
> if (StandbyMode)
> + {
> + failedSources |= readSource;
> goto retry;
> + }
> else
> return false;

That doesn't work because readSource is cleared above. But yeah,
failedSources is not needed in archive recovery, so that line can be
removed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on
Fujii Masao wrote:
> On second thought, the following lines seem to be necessary just after
> calling XLogPageRead() since it reads new WAL file from another source.
>
>> if (readSource == XLOG_FROM_STREAM || readSource == XLOG_FROM_ARCHIVE)
>> emode = PANIC;
>> else
>> emode = emode_arg;

Yep.

Here's an updated patch, with these changes since the last patch:

* Fix the bug of a spurious PANIC in archive recovery, if the WAL ends
in the middle of a WAL record that continues over a WAL segment boundary.

* If a corrupt WAL record is found in archive or streamed from master in
standby mode, throw WARNING instead of PANIC, and keep trying. In
archive recovery (ie. standby_mode=off) it's still a PANIC. We can make
it a WARNING too, which gives the pre-9.0 behavior of starting up the
server on corruption. I prefer PANIC but the discussion is still going on.

* Small code changes to handling of failedSources, inspired by your
comment. No change in functionality.

This is also available in my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: Heikki Linnakangas on
Fujii Masao wrote:
>> * Small code changes to handling of failedSources, inspired by your
>> comment. No change in functionality.
>>
>> This is also available in my git repository at
>> git://git.postgresql.org/git/users/heikki/postgres.git, branch "xlogchanges"
>
> I looked the patch and was not able to find any big problems until now.
> The attached small patch fixes the typo.

Thanks. Committed with that typo-fix, and I also added a comment
explaining how failedSources and retrying XLogPageRead() works.

I'm now happy with the standby mode logic. It was a bigger struggle than
I anticipated back in January/February, I hope others now find it
intuitive as well. I'm going to work on the documentation of this, along
the lines of the draft I posted last week.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers