Prev: [HACKERS] Proposal for 9.1: WAL streaming from WAL buffers
Next: [HACKERS] pg_upgrade output directory
From: Josh Berkus on 15 Jun 2010 20:09 > I have yet to convince myself of how likely this is to occur. I tried > to reproduce this issue by crashing the database, but I think in 9.0 > you need an actual operating system crash to cause this problem, and I > haven't yet set up an environment in which I can repeatedly crash the > OS. I believe, though, that in 9.1, we're going to want to stream > from WAL buffers as proposed in the patch that started out this > thread, and then I think this issue can be triggered with just a > database crash. Yes, but it still requires: a) the master must crash with at least one transaction transmitted to the slave an not yet fsync'd b) the slave must not crash as well c) the master must come back up without the slave ever having been promoted to master Note that (a) is fairly improbable to begin with due to both our batching transactions into bundles for transmission, and network latency vs. disk latency. So, is it possible? Yes. Will it happen anywhere but the highest-txn-rate sites one in 10,000 times? No. This means that we should look for a solution which does not penalize the common case in order to close a very improbable hole, if such a solution exists. > In 9.0, I think we can fix this problem by (1) only streaming WAL that > has been fsync'd and I don't think this is the best solution; it would be a noticeable performance penalty on replication. It also would potentially result in data loss for the user; if the user fails over to the slave in the corner case, they can "rescue" the in-flight transaction. At the least, this would need to become Yet Another Configuration Option. >(2) PANIC-ing if the problem occurs anyway. The question is, is detecting out-of-order WAL records *sufficient* to detect a failure? I'm thinking there are possible sequences where there would be no out-of-sequence, but the slave would still have a transaction the master doesn't, which the user wouldn't know until a page update corrupts their data. > But > in 9.1, with sync rep and the performance demands that entails, I > think that we're going to need to rethink it. All the more reason to avoid dealing with it now, if we can. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Josh Berkus on 15 Jun 2010 20:32 On 6/15/10 5:09 PM, Josh Berkus wrote: >> > In 9.0, I think we can fix this problem by (1) only streaming WAL that >> > has been fsync'd and > > I don't think this is the best solution; it would be a noticeable > performance penalty on replication. Actually, there's an even bigger reason not to mandate waiting for fsync: what if the user turns fsync off? One can certainly imagine users choosing to rely on their replication slaves for crash recovery instead of fsync. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 15 Jun 2010 22:01 On Tue, Jun 15, 2010 at 8:09 PM, Josh Berkus <josh(a)agliodbs.com> wrote: > >> I have yet to convince myself of how likely this is to occur. �I tried >> to reproduce this issue by crashing the database, but I think in 9.0 >> you need an actual operating system crash to cause this problem, and I >> haven't yet set up an environment in which I can repeatedly crash the >> OS. �I believe, though, that in 9.1, we're going to want to stream >> from WAL buffers as proposed in the patch that started out this >> thread, and then I think this issue can be triggered with just a >> database crash. > > Yes, but it still requires: > > a) the master must crash with at least one transaction transmitted to > the slave an not yet fsync'd Bzzzzt. Stop right there. It only requires the master to crash with at least one *WAL record* written but not transmitted, not one transaction. And most WAL record types are not fsync'd immediately. So in theory I think that, for example, an OS crash in the middle of a big bulk insert operation should be sufficient to trigger this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Fujii Masao on 21 Jun 2010 05:08 On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas <robertmhaas(a)gmail.com> wrote: > On Tue, Jun 15, 2010 at 3:57 PM, Josh Berkus <josh(a)agliodbs.com> wrote: >>> I wonder if it would be possible to jigger things so that we send the >>> WAL to the standby as soon as it is generated, but somehow arrange >>> things so that the standby knows the last location that the master has >>> fsync'd and never applies beyond that point. >> >> I can't think of any way which would not require major engineering. �And >> you'd be slowing down replication *in general* to deal with a fairly >> unlikely corner case. >> >> I think the panic is the way to go. > > I have yet to convince myself of how likely this is to occur. �I tried > to reproduce this issue by crashing the database, but I think in 9.0 > you need an actual operating system crash to cause this problem, and I > haven't yet set up an environment in which I can repeatedly crash the > OS. �I believe, though, that in 9.1, we're going to want to stream > from WAL buffers as proposed in the patch that started out this > thread, and then I think this issue can be triggered with just a > database crash. > > In 9.0, I think we can fix this problem by (1) only streaming WAL that > has been fsync'd and (2) PANIC-ing if the problem occurs anyway. �But > in 9.1, with sync rep and the performance demands that entails, I > think that we're going to need to rethink it. The problem is not that the master streams non-fsync'd WAL, but that the standby can replay that. So I'm thinking that we can send non-fsync'd WAL safely if the standby makes the recovery wait until the master has fsync'd WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush location to walreceiver, and the standby applies only the WAL which the master has already fsync'd. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Heikki Linnakangas on 21 Jun 2010 05:40
On 21/06/10 12:08, Fujii Masao wrote: > On Wed, Jun 16, 2010 at 5:06 AM, Robert Haas<robertmhaas(a)gmail.com> wrote: >> In 9.0, I think we can fix this problem by (1) only streaming WAL that >> has been fsync'd and (2) PANIC-ing if the problem occurs anyway. But >> in 9.1, with sync rep and the performance demands that entails, I >> think that we're going to need to rethink it. > > The problem is not that the master streams non-fsync'd WAL, but that the > standby can replay that. So I'm thinking that we can send non-fsync'd WAL > safely if the standby makes the recovery wait until the master has fsync'd > WAL. That is, walsender sends not only non-fsync'd WAL but also WAL flush > location to walreceiver, and the standby applies only the WAL which the > master has already fsync'd. Thought? I guess, but you have to be very careful to correctly refrain from applying the WAL. For example, a naive implementation might write the WAL to disk in walreceiver immediately, but refrain from telling the startup process about it. If walreceiver is then killed because the connection is broken (and it will be because the master just crashed), the startup process will read the streamed WAL from the file in pg_xlog, and go ahead to apply it anyway. So maybe there's some room for optimization there, but given the round-trip required for the acknowledgment anyway it might not buy you much, and the implementation is not very straightforward. This is clearly 9.1 material, if worth optimizing at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |