Prev: [HACKERS] pg_restore --single-transaction and --clean
Next: pgsql: Make standby server continuously retry restoringthe next WAL
From: Simon Riggs on 24 Mar 2010 19:23 On Wed, 2010-03-24 at 14:31 +0200, Heikki Linnakangas wrote: > Fujii Masao wrote: > > But in the current (v8.4 or before) behavior, recovery ends normally > > when an invalid record is found in an archived WAL file. Otherwise, > > the server would never be able to start normal processing when there > > is a corrupted archived file for some reasons. So, that invalid record > > should not be treated as a PANIC if the server is not in standby mode > > or the trigger file has been created. Thought? > > Hmm, true, this changes behavior over previous releases. I tend to think > that it's always an error if there's a corrupt file in the archive, > though, and PANIC is appropriate. If the administrator wants to start up > the database anyway, he can remove the corrupt file from the archive and > place it directly in pg_xlog instead. I don't agree with changing the behaviour from previous releases. PANICing won't change the situation, so it just destroys server availability. If we had 1 master and 42 slaves then this behaviour would take down almost the whole server farm at once. Very uncool. You might have reason to prevent the server starting up at that point, when in standby mode, but that is not a reason to PANIC. We don't really want all of the standbys thinking they can be the master all at once either. Better to throw a serious ERROR and have the server still up and available for reads. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Fujii Masao on 24 Mar 2010 22:08 On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote: > PANICing won't change the situation, so it just destroys server > availability. If we had 1 master and 42 slaves then this behaviour would > take down almost the whole server farm at once. Very uncool. > > You might have reason to prevent the server starting up at that point, > when in standby mode, but that is not a reason to PANIC. We don't really > want all of the standbys thinking they can be the master all at once > either. Better to throw a serious ERROR and have the server still up and > available for reads. OK. How about making the startup process emit WARNING, stop WAL replay and wait for the presence of trigger file, when an invalid record is found? Which keeps the server up for readonly queries. And if the trigger file is found, I think that the startup process should emit a FATAL, i.e., the server should exit immediately, to prevent the server from becoming the primary in a half-finished state. Also to allow such a halfway failover, we should provide fast failover mode as pg_standby does? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 24 Mar 2010 22:14 Fujii Masao <masao.fujii(a)gmail.com> writes: > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? I find it extremely scary to read this sort of blue-sky design discussion going on now, two months after we were supposedly feature-frozen for 9.0. We need to be looking for the *rock bottom minimum* amount of work to do to get 9.0 out the door in a usable state; not what would be nice to have later on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Simon Riggs on 25 Mar 2010 04:08 On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: > On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote: > > PANICing won't change the situation, so it just destroys server > > availability. If we had 1 master and 42 slaves then this behaviour would > > take down almost the whole server farm at once. Very uncool. > > > > You might have reason to prevent the server starting up at that point, > > when in standby mode, but that is not a reason to PANIC. We don't really > > want all of the standbys thinking they can be the master all at once > > either. Better to throw a serious ERROR and have the server still up and > > available for reads. > > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? The lack of docs begins to show a lack of coherent high-level design here. By now, I've forgotten what this thread was even about. The major design decision in this that keeps showing up is "remove pg_standby, at all costs" but no reason has ever been given for that. I do believe there is a "better way", but we won't find it by trial and error, even if we had time to do so. Please work on some clear docs for the failure modes in this system. That way we can all read them and understand them, or point out further issues. Moving straight to code is not a solution to this, since what we need now is to all agree on the way forwards. If we ignore this, then there is considerable risk that streaming rep will have a fatal operational flaw. Please just document/diagram how it works now, highlighting the problems that still remain to be solved. We're all behind you and I'm helping wherever I can. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Simon Riggs on 25 Mar 2010 04:22
On Thu, 2010-03-25 at 11:08 +0900, Fujii Masao wrote: > On Thu, Mar 25, 2010 at 8:23 AM, Simon Riggs <simon(a)2ndquadrant.com> wrote: > > PANICing won't change the situation, so it just destroys server > > availability. If we had 1 master and 42 slaves then this behaviour would > > take down almost the whole server farm at once. Very uncool. > > > > You might have reason to prevent the server starting up at that point, > > when in standby mode, but that is not a reason to PANIC. We don't really > > want all of the standbys thinking they can be the master all at once > > either. Better to throw a serious ERROR and have the server still up and > > available for reads. > > OK. How about making the startup process emit WARNING, stop WAL replay and > wait for the presence of trigger file, when an invalid record is found? > Which keeps the server up for readonly queries. Yes. Receiving new WAL records is a completely separate activity from running the rest of the server (in this release...). > And if the trigger file is > found, I think that the startup process should emit a FATAL, i.e., the > server should exit immediately, to prevent the server from becoming the > primary in a half-finished state. Please remember that "half-finished" is your judgment on what has happened in the particular scenario you are considering. In many cases, an invalid WAL record clearly and simply indicates the end of WAL and we should start up normally. "State" is a good word here. I'd like to see the server have a clear state model with well documented transitions between them. The state should also be externally queriable, so we can work out what its doing and how long we can expect it to keep doing it for. I don't want to be in a position where we are waiting for the server to sort itself out from a complex set of retries. > Also to allow such a halfway failover, > we should provide fast failover mode as pg_standby does? Yes, we definitely need a JFDI solution for immediate failover. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |