From: Greg Stark on
On Mon, Jun 21, 2010 at 10:40 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> I guess, but you have to be very careful to correctly refrain from applying
> the WAL. For example, a naive implementation might write the WAL to disk in
> walreceiver immediately, but refrain from telling the startup process about
> it. If walreceiver is then killed because the connection is broken (and it
> will be because the master just crashed), the startup process will read the
> streamed WAL from the file in pg_xlog, and go ahead to apply it anyway.

So the goal is that when you *do* failover to the standby it replays
these additional records. So whether the startup process obeys this
limit would have to be conditional on whether it's still in standby
mode.

> So maybe there's some room for optimization there, but given the round-trip
> required for the acknowledgment anyway it might not buy you much, and the
> implementation is not very straightforward. This is clearly 9.1 material, if
> worth optimizing at all.

I don't see any need for a round-trip acknowledgement -- no more than
currently. the master just includes the flush location in every
response. It might have to send additional responses though when
fsyncs happen to update the flush location even if no additional
records are sent. Otherwise a hot standby might spend a long time with
out-dated data even if on failover it would be up to date that seems
nonideal for the hot standby users.

I think this would be a good improvement for databases processing
large batch updates so the standby doesn't have an increased risk of
losing a large amount of data if there's a crash after processing such
a large query. I agree it's 9.1 material.

Earlier we made a change to the WAL streaming protocol on the basis
that we wanted to get the protocol right even if we don't use the
change right away. I'm not sure I understand that -- it's not like
we're going to stream WAL from 9.0 to 9.1. But if that was true then
perhaps we need to add the WAL flush location to the protocol now even
if we're not going to use yet?

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:

> The problem is not that the master streams non-fsync'd WAL, but that the
> standby can replay that. So I'm thinking that we can send non-fsync'd WAL
> safely if the standby makes the recovery wait until the master has fsync'd
> WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
> location to walreceiver, and the standby applies only the WAL which the
> master has already fsync'd. Thought?

Yes, good thought. The patch just applied seems too much.

I had the same thought, though it would mean you'd need to send two xlog
end locations, one for write, one for fsync. Though not really clear why
we send the "current end of WAL on the server" anyway, so maybe we can
just alter that.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Wed, Jun 30, 2010 at 5:36 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>> Before we get too busy frobnicating this gonkulator, I'd like to see a
>> little more discussion of what kind of performance people are
>> expecting from sync rep. �Sounds to me like the best we can expect
>> here is, on every commit: (a) wait for master fsync to complete, (b)
>> send message to standby, (c) wait for reply for reply from standby
>> indicating that fsync is complete on standby. �Even assuming that the
>> network overhead is minimal, that halves the commit rate. �Are the
>> people who want sync rep OK with that? �Is there any way to do better?
>
> (c) would depend on the synchronization mode the user chooses:
>
> �#1 Wait for WAL to be received by the standby
> �#2 Wait for WAL to be received and flushed by the standby
> �#3 Wait for WAL to be received, flushed and replayed by the standby
>
> (a) would depend on synchronous_commit. Personally I'm interested in
> disabling synchronous_commit on the master and choosing #1 as the sync
> mode. Though this may be very optimistic configuration :)
>
> The point for performance of sync rep is to parallelize (a) and (b)+(c),
> I think. If they are performed in a serial manner, the performance
> overhead on the master would become high.

Right. So we to try to come up with a design that permits that, which
must be robust in the face of any number of crashes on the two
machines, in any order. Until we have that, we're just going around
in circles.

One thought that occurred to me is that if the master and standby were
more tightly coupled, you could recover after a crash by making the
one with the further-advanced WAL position the master, and the other
one the standby. That would get around this problem, though at the
cost of considerable additional complexity. But then if one of the
servers comes up and can't talk to the other, you need some mechanism
for preventing split-brain syndrome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on
On Wed, Jun 30, 2010 at 11:26 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> Maybe. �As Heikki pointed out upthread, the standby can't even write
> the WAL to back to the OS until it's been fsync'd on the master
> without risking the problem under discussion.

If we change the startup process so that it doesn't go ahead of the
master's fsync location even after the walreceiver is terminated,
we would have no need to worry about that risk. For further robustness,
the walreceiver might be able to zero the WAL records which have not
been fsync'd on the master yet, when being terminated.

But, if the standby crashes after the master crashes, restart of the
standby might replay that non-fsync'd WAL wrongly because it cannot
remember the master's fsync location. In this case, if we promote the
standby to the master, we still don't have to worry about that risk.
But instead of performing a failover, if we restart the master and
make the standby connect to the master again, the database on the standby
would get corrupted.

For now, I don't have good idea to avoid that database corruption by
the double failure (crash of both master and standby)...

>�So we can stream the
> WAL from master to standby as long as the standby just buffers it in
> memory (or somewhere other than the usual location in pg_xlog).

Yeah, I was just thinking the same thing. But the problem is that the
buffer size might become too big (might be bigger than 16MB). For
example, synchronous_commit = off and wal_writer_delay = 10000ms on
the master would delay the fsync significantly and increase the buffer
size on the standby.

> Before we get too busy frobnicating this gonkulator, I'd like to see a
> little more discussion of what kind of performance people are
> expecting from sync rep. �Sounds to me like the best we can expect
> here is, on every commit: (a) wait for master fsync to complete, (b)
> send message to standby, (c) wait for reply for reply from standby
> indicating that fsync is complete on standby. �Even assuming that the
> network overhead is minimal, that halves the commit rate. �Are the
> people who want sync rep OK with that? �Is there any way to do better?

(c) would depend on the synchronization mode the user chooses:

#1 Wait for WAL to be received by the standby
#2 Wait for WAL to be received and flushed by the standby
#3 Wait for WAL to be received, flushed and replayed by the standby

(a) would depend on synchronous_commit. Personally I'm interested in
disabling synchronous_commit on the master and choosing #1 as the sync
mode. Though this may be very optimistic configuration :)

The point for performance of sync rep is to parallelize (a) and (b)+(c),
I think. If they are performed in a serial manner, the performance
overhead on the master would become high.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Tue, Jun 29, 2010 at 10:06 PM, Bruce Momjian <bruce(a)momjian.us> wrote:
> Simon Riggs wrote:
>> On Mon, 2010-06-21 at 18:08 +0900, Fujii Masao wrote:
>>
>> > The problem is not that the master streams non-fsync'd WAL, but that the
>> > standby can replay that. So I'm thinking that we can send non-fsync'd WAL
>> > safely if the standby makes the recovery wait until the master has fsync'd
>> > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
>> > location to walreceiver, and the standby applies only the WAL which the
>> > master has already fsync'd. Thought?
>>
>> Yes, good thought. The patch just applied seems too much.
>>
>> I had the same thought, though it would mean you'd need to send two xlog
>> end locations, one for write, one for fsync. Though not really clear why
>> we send the "current end of WAL on the server" anyway, so maybe we can
>> just alter that.
>
> Is this a TODO?

Maybe. As Heikki pointed out upthread, the standby can't even write
the WAL to back to the OS until it's been fsync'd on the master
without risking the problem under discussion. So we can stream the
WAL from master to standby as long as the standby just buffers it in
memory (or somewhere other than the usual location in pg_xlog).

Before we get too busy frobnicating this gonkulator, I'd like to see a
little more discussion of what kind of performance people are
expecting from sync rep. Sounds to me like the best we can expect
here is, on every commit: (a) wait for master fsync to complete, (b)
send message to standby, (c) wait for reply for reply from standby
indicating that fsync is complete on standby. Even assuming that the
network overhead is minimal, that halves the commit rate. Are the
people who want sync rep OK with that? Is there any way to do better?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers