From: Josh Berkus on

> I have yet to convince myself of how likely this is to occur. I tried
> to reproduce this issue by crashing the database, but I think in 9.0
> you need an actual operating system crash to cause this problem, and I
> haven't yet set up an environment in which I can repeatedly crash the
> OS. I believe, though, that in 9.1, we're going to want to stream
> from WAL buffers as proposed in the patch that started out this
> thread, and then I think this issue can be triggered with just a
> database crash.

Yes, but it still requires:

a) the master must crash with at least one transaction transmitted to
the slave an not yet fsync'd
b) the slave must not crash as well
c) the master must come back up without the slave ever having been
promoted to master

Note that (a) is fairly improbable to begin with due to both our
batching transactions into bundles for transmission, and network latency
vs. disk latency.

So, is it possible? Yes. Will it happen anywhere but the
highest-txn-rate sites one in 10,000 times? No.

This means that we should look for a solution which does not penalize
the common case in order to close a very improbable hole, if such a
solution exists.

> In 9.0, I think we can fix this problem by (1) only streaming WAL that
> has been fsync'd and

I don't think this is the best solution; it would be a noticeable
performance penalty on replication. It also would potentially result in
data loss for the user; if the user fails over to the slave in the
corner case, they can "rescue" the in-flight transaction. At the least,
this would need to become Yet Another Configuration Option.

>(2) PANIC-ing if the problem occurs anyway.

The question is, is detecting out-of-order WAL records *sufficient* to
detect a failure? I'm thinking there are possible sequences where there
would be no out-of-sequence, but the slave would still have a
transaction the master doesn't, which the user wouldn't know until a
page update corrupts their data.

> But
> in 9.1, with sync rep and the performance demands that entails, I
> think that we're going to need to rethink it.

All the more reason to avoid dealing with it now, if we can.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on
On 6/15/10 5:09 PM, Josh Berkus wrote:
>> > In 9.0, I think we can fix this problem by (1) only streaming WAL that
>> > has been fsync'd and
>
> I don't think this is the best solution; it would be a noticeable
> performance penalty on replication.

Actually, there's an even bigger reason not to mandate waiting for
fsync: what if the user turns fsync off?

One can certainly imagine users choosing to rely on their replication
slaves for crash recovery instead of fsync.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Tue, Jun 15, 2010 at 8:09 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>
>> I have yet to convince myself of how likely this is to occur. �I tried
>> to reproduce this issue by crashing the database, but I think in 9.0
>> you need an actual operating system crash to cause this problem, and I
>> haven't yet set up an environment in which I can repeatedly crash the
>> OS. �I believe, though, that in 9.1, we're going to want to stream
>> from WAL buffers as proposed in the patch that started out this
>> thread, and then I think this issue can be triggered with just a
>> database crash.
>
> Yes, but it still requires:
>
> a) the master must crash with at least one transaction transmitted to
> the slave an not yet fsync'd

Bzzzzt. Stop right there. It only requires the master to crash with
at least one *WAL record* written but not transmitted, not one
transaction. And most WAL record types are not fsync'd immediately.
So in theory I think that, for example, an OS crash in the middle of a
big bulk insert operation should be sufficient to trigger this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on
On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>>> I wonder if it would be possible to jigger things so that we send the
>>> WAL to the standby as soon as it is generated, but somehow arrange
>>> things so that the standby knows the last location that the master has
>>> fsync'd and never applies beyond that point.
>>
>> I can't think of any way which would not require major engineering. �And
>> you'd be slowing down replication *in general* to deal with a fairly
>> unlikely corner case.
>>
>> I think the panic is the way to go.
>
> I have yet to convince myself of how likely this is to occur. �I tried
> to reproduce this issue by crashing the database, but I think in 9.0
> you need an actual operating system crash to cause this problem, and I
> haven't yet set up an environment in which I can repeatedly crash the
> OS. �I believe, though, that in 9.1, we're going to want to stream
> from WAL buffers as proposed in the patch that started out this
> thread, and then I think this issue can be triggered with just a
> database crash.
>
> In 9.0, I think we can fix this problem by (1) only streaming WAL that
> has been fsync'd and (2) PANIC-ing if the problem occurs anyway. �But
> in 9.1, with sync rep and the performance demands that entails, I
> think that we're going to need to rethink it.

The problem is not that the master streams non-fsync'd WAL, but that the
standby can replay that. So I'm thinking that we can send non-fsync'd WAL
safely if the standby makes the recovery wait until the master has fsync'd
WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
location to walreceiver, and the standby applies only the WAL which the
master has already fsync'd. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on
On 21/06/10 12:08, Fujii Masao wrote:
> On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas<robertmhaas(a)gmail.com> wrote:
>> In 9.0, I think we can fix this problem by (1) only streaming WAL that
>> has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But
>> in 9.1, with sync rep and the performance demands that entails, I
>> think that we're going to need to rethink it.
>
> The problem is not that the master streams non-fsync'd WAL, but that the
> standby can replay that. So I'm thinking that we can send non-fsync'd WAL
> safely if the standby makes the recovery wait until the master has fsync'd
> WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush
> location to walreceiver, and the standby applies only the WAL which the
> master has already fsync'd. Thought?

I guess, but you have to be very careful to correctly refrain from
applying the WAL. For example, a naive implementation might write the
WAL to disk in walreceiver immediately, but refrain from telling the
startup process about it. If walreceiver is then killed because the
connection is broken (and it will be because the master just crashed),
the startup process will read the streamed WAL from the file in pg_xlog,
and go ahead to apply it anyway.

So maybe there's some room for optimization there, but given the
round-trip required for the acknowledgment anyway it might not buy you
much, and the implementation is not very straightforward. This is
clearly 9.1 material, if worth optimizing at all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers