streaming replication breaks horribly if master crashes [PgSql]

Prev: streaming replication breaks horribly if mastercrashes
Next: ANNOUNCE list

From: Greg Stark on 16 Jun 2010 19:32

On Thu, Jun 17, 2010 at 12:22 AM, Kevin Grittner
<Kevin.Grittner(a)wicourts.gov> wrote:
> "Kevin Grittner" <Kevin.Grittner(a)wicourts.gov> wrote:
>
>> It sounds like it behaves just fine except for not detecting a
>> broken connection.
>
> Of course I meant in terms of the slave's attempts at retrieving
> more WAL, not in terms of it applying a second time line. �TCP
> keepalive timeouts don't help with that part of it, just the failure
> to recognize the broken connection. �I suppose someone could argue
> that's a *feature*, since it gives you two hours to manually
> intervene before it does something stupid, but that hardly seems
> like a solution....

It's certainly a design goal of TCP that you should be able to
disconnect the network and reconnect it everything should recover. If
no data was sent it should be able to withstand arbitrarily long
disconnections. TCP Keepalives break that but they should only break
it in the case where the network connection has definitely exceeded
the retry timeouts, not when it merely hasn't responded fast enough
for the application requirements.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 16 Jun 2010 19:40

On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
<Kevin.Grittner(a)wicourts.gov> wrote:
> Greg Stark <gsstark(a)mit.edu> wrote:
>
>> TCP keepalives are for detecting broken network connections
>
> Yeah. �That seems like what we have here. �If you shoot the OS in
> the head, the network connection is broken rather abruptly, without
> the normal packets exchanged to close the TCP connection. �It sounds
> like it behaves just fine except for not detecting a broken
> connection.

So I think there are two things happening here. If you shut down the
master and don't replace it then you'll get no network errors until
TCP gives up entirely. Similarly if you pull the network cable or your
switch powers off or your routing table becomes messed up, or anything
else occurs which prevents packets from getting through then you'll
see similar breakage. You wouldn't want your database to suddenly come
up as master in such circumstances though when you'll have to fix the
problem anyways, doing so won't solve any problems it would just
create a second problem.

But there's a second case. The Postgres master just stops responding
-- perhaps it starts seeing disk errors and becomes stuck in disk-wait
or the machine just becomes very heaviliy loaded and Postgres can't
get any cycles, or someone attaches to it with gdb, or one of any
number of things happen which cause it to stop sending data. In that
case replication will not see any data from the master but TCP will
never time out because the network is just fine. That's why there
needs to be an application level health check if you want to have
timeouts. You can't depend on the network layer to detect problems
between the application.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 17 Jun 2010 01:57

On Thu, Jun 17, 2010 at 5:26 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Wed, Jun 16, 2010 at 4:14 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>>> The first problem I noticed is that the slave never seems to realize
>>> that the master has gone away. Every time I crashed the master, I had
>>> to kill the wal receiver process on the slave to get it to reconnect;
>>> otherwise it just sat there waiting, either forever or at least for
>>> longer than I was willing to wait.
>>
>> Yes, I've noticed this. That was the reason for forcing walreceiver to
>> shut down on a restart per prior discussion and patches. This needs to
>> be on the open items list ... possibly it'll be fixed by Simon's
>> keepalive patch? Or is it just a tcp_keeplalive issue?
>
> I think a TCP keepalive might be enough, but I have not tried to code
> or test it.

The "keepalive on libpq" patch would help.
https://commitfest.postgresql.org/action/patch_view?id=281

>>> and this just
>>> makes it more likely. After the most recent crash, the master thought
>>> pg_current_xlog_location() was 1/86CD4000; the slave thought
>>> pg_last_xlog_receive_location() was 1/8733C000. After reconnecting to
>>> the master, the slave then thought that
>>> pg_last_xlog_receive_location() was 1/87000000.
>>
>> So, *in this case*, detecting out-of-sequence xlogs (and PANICing) would
>> have actually prevented the slave from being corrupted.
>>
>> My question, though, is detecting out-of-sequence xlogs *enough*? Are
>> there any crash conditions on the master which would cause the master to
>> reuse the same locations for different records, for example? I don't
>> think so, but I'd like to be certain.
>
> The real problem here is that we're sending records to the slave which
> might cease to exist on the master if it unexpectedly reboots. I
> believe that what we need to do is make sure that the master only
> sends WAL it has already fsync'd (Tom suggested on another thread that
> this might be necessary, and I think it's now clear that it is 100%
> necessary).

The attached patch changes walsender so that it always sends WAL up to
LogwrtResult.Flush instead of LogwrtResult.Write.

> But I'm not sure how this will play with fsync=off - if
> we never fsync, then we can't ever really send any WAL without risking
> this failure mode. Similarly with synchronous_commit=off, I believe
> that the next checkpoint will still fsync WAL, but the lag might be
> long.

First of all, we should not restart the master after the crash in
fsync=off case. That would cause the corruption of the master database
itself.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From: Heikki Linnakangas on 17 Jun 2010 02:09

On 17/06/10 02:40, Greg Stark wrote:
> On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
> <Kevin.Grittner(a)wicourts.gov> wrote:
>> Greg Stark<gsstark(a)mit.edu> wrote:
>>
>>> TCP keepalives are for detecting broken network connections
>>
>> Yeah. That seems like what we have here. If you shoot the OS in
>> the head, the network connection is broken rather abruptly, without
>> the normal packets exchanged to close the TCP connection. It sounds
>> like it behaves just fine except for not detecting a broken
>> connection.
>
> So I think there are two things happening here. If you shut down the
> master and don't replace it then you'll get no network errors until
> TCP gives up entirely. Similarly if you pull the network cable or your
> switch powers off or your routing table becomes messed up, or anything
> else occurs which prevents packets from getting through then you'll
> see similar breakage. You wouldn't want your database to suddenly come
> up as master in such circumstances though when you'll have to fix the
> problem anyways, doing so won't solve any problems it would just
> create a second problem.

We're not talking about a timeout for promoting standby to master. The
problem is that the standby doesn't notice that from the master's point
of view, the connection has been broken. Whether it's because of a
network error or because the master server crashed doesn't matter, the
standby should reconnect in any case. TCP keepalives are a perfect fit,
as long as you can tune the keepalive time short enough. Where "Short
enough" is up to the admin to decide depending on the application.

Having said that, it would probably make life easier if we implemented
an application level heartbeat anyway. Not all OS's allow tuning keepalives.

> But there's a second case. The Postgres master just stops responding
> -- perhaps it starts seeing disk errors and becomes stuck in disk-wait
> or the machine just becomes very heaviliy loaded and Postgres can't
> get any cycles, or someone attaches to it with gdb, or one of any
> number of things happen which cause it to stop sending data. In that
> case replication will not see any data from the master but TCP will
> never time out because the network is just fine. That's why there
> needs to be an application level health check if you want to have
> timeouts. You can't depend on the network layer to detect problems
> between the application.

If the PostgreSQL master stops responding, it's OK for the slave to sit
and wait for the master to recover. Reconnecting wouldn't help.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Rafael Martinez on 17 Jun 2010 03:02

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Heikki Linnakangas wrote:

>
> We're not talking about a timeout for promoting standby to master. The
> problem is that the standby doesn't notice that from the master's point
> of view, the connection has been broken. Whether it's because of a
> network error or because the master server crashed doesn't matter, the
> standby should reconnect in any case. TCP keepalives are a perfect fit,
> as long as you can tune the keepalive time short enough. Where "Short
> enough" is up to the admin to decide depending on the application.
>
>

I tested this yesterday and I could not get any reaction from the wal
receiver even after using minimal values compared to the default values .

The default values in linux for tcp_keepalive_time, tcp_keepalive_intvl
and tcp_keepalive_probes are 7200, 75 and 9. I reduced these values to
60, 3, 3 and nothing happened, it continuous with status ESTABLISHED
after 60+3*3 seconds.

I did not restart the network after I changed these values on the fly
via /proc. I wonder if this is the reason the connection didn't die
neither with the new keppalive values after the connection was broken. I
will check this later today.

regards,
- --
Rafael Martinez, <r.m.guerrero(a)usit.uio.no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAkwZyJ4ACgkQBhuKQurGihT3kgCgn4iQkZ8YKr/nAk5/QqpwYfnc
4lsAn2CKvgeeIOon+lWRHe908hbJ+zK6
=VymH
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: streaming replication breaks horribly if mastercrashes
Next: ANNOUNCE list