Proposal for 9.1: WAL streaming from WAL buffers [PgSql]

Prev: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers
Next: [HACKERS] pg_upgrade output directory

From: Florian Pflug on 12 Jun 2010 07:50

On Jun 12, 2010, at 3:10 , Josh Berkus wrote:
>> Hm, but then Robert's failure case is real, and streaming replication might break due to an OS-level crash of the master. Or am I missing something?
>
> 1) Master goes out
> 2) "floating" transaction applied to standby.
> 3) Standby goes out
> 4) Power back on
> 5) master comes up
> 6) standby comes up
>
> It seems like, in that sequence, the standby would have one transaction
> which the master doesn't have, yet the standby thinks it can continue
> getting WAL from the master. Or did I miss something which makes this
> impossible?

I did indeed miss something - with wal_sync_method set to either open_datasync or open_sync, all written WAL is also synced. Since open_datasync is the preferred setting according to http://www.postgresql.org/docs/9.0/static/runtime-config-wal.html#GUC-WAL-SYNC-METHOD, systems supporting open_datasync should be safe.

My Ubuntu 10.04 box running postgres 8.4.4 doesn't support open_datasync though, and hence defaults to fdatasync. Probably because of this fragment in xlogdefs.h
#if O_DSYNC != BARE_OPEN_SYNC_FLAG
#define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
#endif

glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
"Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns."

If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least on linux.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 13 Jun 2010 01:24

On 12/06/10 01:16, Josh Berkus wrote:
>
>> Well, we're already not waiting for fsync, which is the slowest part.
>> If there's a performance problem, it may be because FADVISE_DONTNEED
>> disables kernel buffering so that we're forced to actually read the data
>> back from disk before sending it on down the wire.
>
> Well, that's fairly direct to solve, no? Just disable FADVISE_DONTNEED
> if walsenders> 0.

We already do that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on 13 Jun 2010 03:36

Florian Pflug wrote:
> glibc defines O_DSYNC as an alias for O_SYNC and warrants that with
> "Most Linux filesystems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns."
>
> If that is true, I believe we should default to open_sync, not fdatasync if open_datasync isn't available, at least on linux.
>

It's not true, because Linux O_SYNC semantics are basically that it's
never worked reliably on ext3. See
http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php for
example of how terrible the situation would be if O_SYNC were the
default on Linux.

We just got a report that a better O_DSYNC is now properly exposed
starting on kernel 2.6.33+glibc 2.12:
http://archives.postgresql.org/message-id/201006041539.03868.cousinmarc(a)gmail.com
and it's possible they may have finally fixed it so it work like it's
supposed to. PostgreSQL versions compiled against the right
prerequisites will default to O_DSYNC by themselves. Whether or not
this is a good thing has yet to be determined. The last thing we'd want
to do at this point is make the old and usually broken O_SYNC behavior
suddenly preferred, when the new and possibly fixed O_DSYNC one will be
automatically selected when available without any code changes on the
database side.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 14 Jun 2010 04:14

On Fri, Jun 11, 2010 at 11:24 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> I think the failover case might be OK. �But if the master crashes and
> restarts, the slave might be left thinking its xlog position is ahead
> of the xlog position on the master.

Right. Unless we perform a failover in this case, the standby might go down
because of inconsistency of WAL after restarting the master. To avoid this
problem, walsender must wait for WAL to be not only written but also *fsynced*
on the master before sending it as 9.0 does. Though this would degrade the
performance, this might be useful for some cases. We should provide the knob
to specify whether to allow the standby to go ahead of the master or not?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 14 Jun 2010 04:39

On Fri, Jun 11, 2010 at 11:47 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Stefan Kaltenbrunner <stefan(a)kaltenbrunner.cc> writes:
>> hmm not sure that is what fujii tried to say - I think his point was
>> that in the original case we would have serialized all the operations
>> (first write+sync on the master, network afterwards and write+sync on
>> the slave) and now we could try parallelizing by sending the wal before
>> we have synced locally.
>
> Well, we're already not waiting for fsync, which is the slowest part.

No, currently walsender waits for fsync.

Walsender tries to send WAL up to xlogctl->LogwrtResult.Write. OTOH,
xlogctl->LogwrtResult.Write is updated after XLogWrite() performs fsync.
As the result, walsender cannot send WAL not fsynced yet. We should
update xlogctl->LogwrtResult.Write before XLogWrite() performs fsync
for 9.0?

But that change would cause the problem that Robert pointed out.
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00670.php

> If there's a performance problem, it may be because FADVISE_DONTNEED
> disables kernel buffering so that we're forced to actually read the data
> back from disk before sending it on down the wire.

Currently, if max_wal_senders > 0, POSIX_FADV_DONTNEED is not used for
WAL files at all.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers
Next: [HACKERS] pg_upgrade output directory