Proposal for 9.1: WAL streaming from WAL buffers [PgSql]

Prev: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers
Next: [HACKERS] pg_upgrade output directory

From: Fujii Masao on 15 Jun 2010 04:45

On Tue, Jun 15, 2010 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> On 15/06/10 07:47, Fujii Masao wrote:
>>
>> On Tue, Jun 15, 2010 at 12:02 AM, Tom Lane<tgl(a)sss.pgh.pa.us> �wrote:
>>>
>>> Fujii Masao<masao.fujii(a)gmail.com> �writes:
>>>>
>>>> Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
>>>> xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
>>>
>>> Wrong. �LogwrtResult.Write tracks how far we've written out data,
>>> but it is only (known to be) fsync'd as far as LogwrtResult.Flush.
>>
>> Hmm.. I agree that xlogctl->LogwrtResult.Write indicates the byte position
>> we've written. But in the current XLogWrite() code, it's updated after
>> XLogWrite() calls issue_xlog_fsync(). No?
>
> issue_xlog_fsync() is only called if the caller requested a flush by
> advancing WriteRqst.Flush.

True. The scenario that I'm concerned about is:

1. A transaction commit causes XLogFlush() to write *and* fsync WAL up to
the commit record.
2. XLogFlush() calls XLogWrite(), and xlogctl->LogwrtResult.Write is
updated to indicate the LSN bigger than or equal to that of the commit
record after XLogWrite() calls issue_xlog_fsync().
3. Then walsender can send WAL up to the commit record.

A transaction commit would need to wait for local fsync and replication
in a serial manner, in synchronous replication. IOW, walsender cannot
send the commit record until it's fsync'd in XLogWrite().

This scenario will not happen? Am I missing something?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 15 Jun 2010 06:53

On Tue, Jun 15, 2010 at 12:46 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
> On Mon, Jun 14, 2010 at 10:13 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> On Mon, Jun 14, 2010 at 8:41 AM, Fujii Masao <masao.fujii(a)gmail.com> wrote:
>>> On Mon, Jun 14, 2010 at 8:10 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>>> Maybe. �That sounds like a pretty enormous foot-gun to me, considering
>>>> that we have no way of recovering from the situation where the standby
>>>> gets ahead of the master.
>>>
>>> No, we can do that by reconstructing the standby from the backup.
>>>
>>> And, that situation is not a problem for users including me who prefer to
>>> perform a failover when the master goes down.
>>
>> You don't get to pick - if a backend crashes on the master, it will
>> restart right away and come up, but the slave will now be hosed...
>
> You are concerned about the case where postmaster automatically restarts
> the crash recovery, in particular? Yes, this case is more problematic.
> If the standby is ahead of the master, the standby might find an invalid
> record and run into the infinite retry loop, or keep working without
> noticing the inconsistency between the database and the WAL.
>
> I'm thinking that walreceiver should throw a PANIC when it receives the
> record which is in the LSN older than the last WAL receive location,
> except the beginning of streaming (because the standby always requests
> for streaming from the starting of WAL file at first even if some records
> have already been received in previous time). Thought?

Yeah, that seems like it would be a good safety check.

I wonder if it would be possible to jigger things so that we send the
WAL to the standby as soon as it is generated, but somehow arrange
things so that the standby knows the last location that the master has
fsync'd and never applies beyond that point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Florian Pflug on 15 Jun 2010 08:05

On Jun 15, 2010, at 10:45 , Fujii Masao wrote:
> A transaction commit would need to wait for local fsync and replication
> in a serial manner, in synchronous replication. IOW, walsender cannot
> send the commit record until it's fsync'd in XLogWrite().

Hm, but since 9.0 won't do synchronous replication anyway, the right thing to do for 9.0 is still to send only fsync'ed WAL, no? Without synchronous replication the overhead seems negligible.

For synchronous replication (and hence for 9.1) I think there are two basic options

a) Stream only fsync'ed WAL, like in the asynchronous case. Depending on policy, additionally wait for one or more slaves to fsync before reporting success.

b) Stream non-fsync'ed WAL. on COMMIT, wait for at last one node (not necessarily the master, exact count depends on policy) to fsync before reporting success. During recovery of the master, recover up to the latest LSN found on any one of the nodes.

Option (b) requires some additional thought, though. Controlled removal of slave nodes and concurrent crashes of more than one node are the most difficult areas to handle gracefully, it seems.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on 15 Jun 2010 15:57

> I wonder if it would be possible to jigger things so that we send the
> WAL to the standby as soon as it is generated, but somehow arrange
> things so that the standby knows the last location that the master has
> fsync'd and never applies beyond that point.

I can't think of any way which would not require major engineering. And
you'd be slowing down replication *in general* to deal with a fairly
unlikely corner case.

I think the panic is the way to go.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 15 Jun 2010 16:06

On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh(a)agliodbs.com> wrote:
>> I wonder if it would be possible to jigger things so that we send the
>> WAL to the standby as soon as it is generated, but somehow arrange
>> things so that the standby knows the last location that the master has
>> fsync'd and never applies beyond that point.
>
> I can't think of any way which would not require major engineering. �And
> you'd be slowing down replication *in general* to deal with a fairly
> unlikely corner case.
>
> I think the panic is the way to go.

I have yet to convince myself of how likely this is to occur. I tried
to reproduce this issue by crashing the database, but I think in 9.0
you need an actual operating system crash to cause this problem, and I
haven't yet set up an environment in which I can repeatedly crash the
OS. I believe, though, that in 9.1, we're going to want to stream
from WAL buffers as proposed in the patch that started out this
thread, and then I think this issue can be triggered with just a
database crash.

In 9.0, I think we can fix this problem by (1) only streaming WAL that
has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But
in 9.1, with sync rep and the performance demands that entails, I
think that we're going to need to rethink it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers
Next: [HACKERS] pg_upgrade output directory