From: Robert Haas on
On Wed, Jun 16, 2010 at 4:00 PM, Kevin Grittner
<Kevin.Grittner(a)wicourts.gov> wrote:
> Robert Haas <robertmhaas(a)gmail.com> wrote:
>> So, obviously at this point my slave database is corrupted beyond
>> repair due to nothing more than an unexpected crash on the master.
>
> Certainly that's true for resuming replication. �From your
> description it sounds as though the slave would be usable for
> purposes of taking over for an unrecoverable master. �Or am I
> misunderstanding?

It depends on what you mean. If you can prevent the slave from ever
reconnecting to the master, then it's still safe to promote it. But
if the master comes up and starts generating WAL again, and the slave
ever sees any of that WAL (either via SR or via the archive) then
you're toast.

In my case, the slave was irrecoverably out of sync with the master as
soon as the crash happened, but it still could have been promoted at
that point if you killed the old master. It became corrupted as soon
as it replayed the first WAL record starting beyond 1/87000000. At
that point it's potentially got arbitrary corruption; you need a new
base backup (but this may not be immediately obvious; it may look OK
even if it isn't).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>> The first problem I noticed is that the slave never seems to realize
>> that the master has gone away. �Every time I crashed the master, I had
>> to kill the wal receiver process on the slave to get it to reconnect;
>> otherwise it just sat there waiting, either forever or at least for
>> longer than I was willing to wait.
>
> Yes, I've noticed this. �That was the reason for forcing walreceiver to
> shut down on a restart per prior discussion and patches. �This needs to
> be on the open items list ... possibly it'll be fixed by Simon's
> keepalive patch? �Or is it just a tcp_keeplalive issue?

I think a TCP keepalive might be enough, but I have not tried to code
or test it.

>> More seriously, I was able to demonstrate that the problem linked in
>> the thread above is real: if the master crashes after streaming WAL
>> that it hasn't yet fsync'd, then on recovery the slave's xlog position
>> is ahead of the master. �So far I've only been able to reproduce this
>> with fsync=off, but I believe it's possible anyway,
>
> ... and some users will turn fsync off. �This is, in fact, one of the
> primary uses for streaming replication: Durability via replicas.

Yep.

>> and this just
>> makes it more likely. �After the most recent crash, the master thought
>> pg_current_xlog_location() was 1/86CD4000; the slave thought
>> pg_last_xlog_receive_location() was 1/8733C000. �After reconnecting to
>> the master, the slave then thought that
>> pg_last_xlog_receive_location() was 1/87000000.
>
> So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
> have actually prevented the slave from being corrupted.
>
> My question, though, is detecting out-of-sequence xlogs *enough*? �Are
> there any crash conditions on the master which would cause the master to
> reuse the same locations for different records, for example? �I don't
> think so, but I'd like to be certain.

The real problem here is that we're sending records to the slave which
might cease to exist on the master if it unexpectedly reboots. I
believe that what we need to do is make sure that the master only
sends WAL it has already fsync'd (Tom suggested on another thread that
this might be necessary, and I think it's now clear that it is 100%
necessary). But I'm not sure how this will play with fsync=off - if
we never fsync, then we can't ever really send any WAL without risking
this failure mode. Similarly with synchronous_commit=off, I believe
that the next checkpoint will still fsync WAL, but the lag might be
long.

I think we should also change the slave to panic and shut down
immediately if its xlog position is ahead of the master. That can
never be a watertight solution because you can always advance the xlog
position on them master and mask the problem. But I think we should
do it anyway, so that we at least have a chance of noticing that we're
hosed. I wish I could think of something a little more watertight...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on
Robert Haas <robertmhaas(a)gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner(a)wicourts.gov> wrote:
>> Robert Haas <robertmhaas(a)gmail.com> wrote:
>>> So, obviously at this point my slave database is corrupted
>>> beyond repair due to nothing more than an unexpected crash on
>>> the master.
>>
>> Certainly that's true for resuming replication. From your
>> description it sounds as though the slave would be usable for
>> purposes of taking over for an unrecoverable master. Or am I
>> misunderstanding?
>
> It depends on what you mean. If you can prevent the slave from
> ever reconnecting to the master, then it's still safe to promote
> it.

Yeah, that's what I meant.

> But if the master comes up and starts generating WAL again, and
> the slave ever sees any of that WAL (either via SR or via the
> archive) then you're toast.

Well, if it *applies* what it sees, yes. Effectively you've got
transactions from two alternative timelines applied in the same
database, which is not going to work. At a minimum we need some
way to reliably detect that the incoming WAL stream is starting
before some applied WAL record and isn't a match.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on
On Wed, Jun 16, 2010 at 22:26, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>> and this just
>>> makes it more likely. �After the most recent crash, the master thought
>>> pg_current_xlog_location() was 1/86CD4000; the slave thought
>>> pg_last_xlog_receive_location() was 1/8733C000. �After reconnecting to
>>> the master, the slave then thought that
>>> pg_last_xlog_receive_location() was 1/87000000.
>>
>> So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
>> have actually prevented the slave from being corrupted.
>>
>> My question, though, is detecting out-of-sequence xlogs *enough*? �Are
>> there any crash conditions on the master which would cause the master to
>> reuse the same locations for different records, for example? �I don't
>> think so, but I'd like to be certain.
>
> The real problem here is that we're sending records to the slave which
> might cease to exist on the master if it unexpectedly reboots. �I
> believe that what we need to do is make sure that the master only
> sends WAL it has already fsync'd (Tom suggested on another thread that
> this might be necessary, and I think it's now clear that it is 100%
> necessary). �But I'm not sure how this will play with fsync=off - if
> we never fsync, then we can't ever really send any WAL without risking

Well, at this point we can just prevent streaming replication with
fsync=off if we can't think of an easy fix, and then design a "proper
fix" for 9.1. Given how late we are in the cycle.


--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Rafael Martinez on
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Haas wrote:

>
> The first problem I noticed is that the slave never seems to realize
> that the master has gone away. Every time I crashed the master, I had
> to kill the wal receiver process on the slave to get it to reconnect;
> otherwise it just sat there waiting, either forever or at least for
> longer than I was willing to wait.
>

Hei Robert

I have seen two different behaviors in my tests.

a) If I crash the server , the wal receiver process will wait forever
and the only way to get it working again is to restart postgres in the
slave after the master is back online. I have not been able to get the
slave database corrupted (I am running with fsync=on).

b) If I kill all postgres processes in the master with kill -9, the wal
receiver will start trying to reconnect automatically and it will
success in the moment postgres gets startet in the master.

The only different I can see at the OS level is that in a) the
connection continues to have the status ESTABLISHED forever, and in b)
it gets status TIME_WAIT in the moment postgres is down in the master.

regards,
- --
Rafael Martinez, <r.m.guerrero(a)usit.uio.no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkwZNiMACgkQBhuKQurGihQ3CQCaAhKcLkur6MO0/F7RqD6OWbv2
R/IAnjj4SrgiwkD6qKodJxrFHCODAEuh
=qHlh
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers