streaming replication breaks horribly if master crashes [PgSql]

Prev: streaming replication breaks horribly if mastercrashes
Next: ANNOUNCE list

From: "Kevin Grittner" on 16 Jun 2010 16:00

Robert Haas <robertmhaas(a)gmail.com> wrote:

> So, obviously at this point my slave database is corrupted beyond
> repair due to nothing more than an unexpected crash on the master.

Certainly that's true for resuming replication. From your
description it sounds as though the slave would be usable for
purposes of taking over for an unrecoverable master. Or am I
misunderstanding?

> had no trouble getting back in sync with the master - but it would
> have done this after having replayed WAL that, from the master's
> point of view, doesn't exist. In other words, the database on the
> slave would be silently corrupted.
>
> I don't know what to do about this, but I'm pretty sure we can't
> ship it as-is.

I'm sure we can't.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Stefan Kaltenbrunner on 16 Jun 2010 16:07

On 06/16/2010 09:47 PM, Robert Haas wrote:
> On Mon, Jun 14, 2010 at 7:55 AM, Simon Riggs<simon(a)2ndquadrant.com> wrote:
>>> But that change would cause the problem that Robert pointed out.
>>> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php
>>
>> Presumably this means that if synchronous_commit = off on primary that
>> SR in 9.0 will no longer work correctly if the primary crashes?
>
> I spent some time investigating this today and have come to the
> conclusion that streaming replication is really, really broken in the
> face of potential crashes on the master. Using a copy of VMware
> parallels provided by $EMPLOYER, I set up two Fedora 12 virtual
> machines on my MacBook in a master/slave configuration. Then I
> crashed the master repeatedly using 'echo b> /proc/sysrq-trigger',
> which causes an immediate reboot (without syncing the disks, closing
> network connections, etc.) while running pgbench or other stuff
> against it.
>
> The first problem I noticed is that the slave never seems to realize
> that the master has gone away. Every time I crashed the master, I had
> to kill the wal receiver process on the slave to get it to reconnect;
> otherwise it just sat there waiting, either forever or at least for
> longer than I was willing to wait.

well this is likely caused by the OS not noticing that the connections
went away (linux has really long timeouts here) - maybe we should
unconditionally enable keepalive on systems that support that for
replication connections (if that is possible in the current design anyway)

>
> More seriously, I was able to demonstrate that the problem linked in
> the thread above is real: if the master crashes after streaming WAL
> that it hasn't yet fsync'd, then on recovery the slave's xlog position
> is ahead of the master. So far I've only been able to reproduce this
> with fsync=off, but I believe it's possible anyway, and this just
> makes it more likely. After the most recent crash, the master thought
> pg_current_xlog_location() was 1/86CD4000; the slave thought
> pg_last_xlog_receive_location() was 1/8733C000. After reconnecting to
> the master, the slave then thought that
> pg_last_xlog_receive_location() was 1/87000000. The slave didn't
> think this was a problem yet, though. When I then restarted a pgbench
> run against the master, the slave pretty quickly started spewing an
> endless stream of messages complaining of "LOG: invalid record length
> at 1/8733A828".

this is obviously bad but with fsync=off(or sync_commit=off?) it is
probably impossible to prevent...

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 16 Jun 2010 16:07

Robert Haas <robertmhaas(a)gmail.com> wrote:

> I don't know what to do about this

This probably is out of the question for 9.0 based on scale of
change, and maybe forever based on the impact of WAL volume, but --
if we logged "before" images along with the "after", we could undo
the work of the "over-eager" transactions on the slave upon
reconnect.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 16 Jun 2010 16:14

Stefan Kaltenbrunner <stefan(a)kaltenbrunner.cc> wrote:

> well this is likely caused by the OS not noticing that the
> connections went away (linux has really long timeouts here) -
> maybe we should unconditionally enable keepalive on systems that
> support that for replication connections (if that is possible in
> the current design anyway)

Yeah, in similar situations I've had good results with a keepalive
timeout of a minute or two.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on 16 Jun 2010 16:14

> The first problem I noticed is that the slave never seems to realize
> that the master has gone away. Every time I crashed the master, I had
> to kill the wal receiver process on the slave to get it to reconnect;
> otherwise it just sat there waiting, either forever or at least for
> longer than I was willing to wait.

Yes, I've noticed this. That was the reason for forcing walreceiver to
shut down on a restart per prior discussion and patches. This needs to
be on the open items list ... possibly it'll be fixed by Simon's
keepalive patch? Or is it just a tcp_keeplalive issue?

> More seriously, I was able to demonstrate that the problem linked in
> the thread above is real: if the master crashes after streaming WAL
> that it hasn't yet fsync'd, then on recovery the slave's xlog position
> is ahead of the master. So far I've only been able to reproduce this
> with fsync=off, but I believe it's possible anyway,

.... and some users will turn fsync off. This is, in fact, one of the
primary uses for streaming replication: Durability via replicas.

> and this just
> makes it more likely. After the most recent crash, the master thought
> pg_current_xlog_location() was 1/86CD4000; the slave thought
> pg_last_xlog_receive_location() was 1/8733C000. After reconnecting to
> the master, the slave then thought that
> pg_last_xlog_receive_location() was 1/87000000.

So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
have actually prevented the slave from being corrupted.

My question, though, is detecting out-of-sequence xlogs *enough*? Are
there any crash conditions on the master which would cause the master to
reuse the same locations for different records, for example? I don't
think so, but I'd like to be certain.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4 5
Prev: streaming replication breaks horribly if mastercrashes
Next: ANNOUNCE list