From: Josh Berkus on

> This issue is 100% reproduceable.

Oh, btw, this is on Alpha4.

--Josh Berkus

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Joshua D. Drake" on
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:
> Simon, Fujii, All:
>
> While demoing HS/SR at SCALE, I ran into a problem which is likely to be
> a commonly encountered bug when people first setup HS/SR. Here's the
> sequence:
>
> 1) Set up a brand new master with an archive-commmand and archive=on.
>
> 2) Start the master
>
> 3) Do a pg_start_backup()
>
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.
>
> 5) Attempt to shut down the master. Master tells me that pg_stop_backup
> must be run in order to shut down.

If I issue a shutdown, PostgreSQL should do whatever it needs to do to
shutdown; including issuing a pg_stop_backup.

Joshua D. Drake




--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on
"Joshua D. Drake" <jd(a)commandprompt.com> wrote:

> If I issue a shutdown, PostgreSQL should do whatever it needs to
> do to shutdown; including issuing a pg_stop_backup.

Should we have a pg_fail_backup function, so that it doesn't put out
a file which suggests that we have a complete backup?

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> 1) Set up a brand new master with an archive-commmand and archive=on.
>
> 2) Start the master
>
> 3) Do a pg_start_backup()
>
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.

> 5) Attempt to shut down the master. Master tells me that pg_stop_backup
> must be run in order to shut down.
>
> 6) Execute pg_stop_backup.
>
> 7) pg_stop_backup waits forever without ever stopping backup. Ever 60
> seconds, it give me a helpful "still waiting" message, but at least in
> the amount of time I was willing to wait (5 minutes), it never completed.
>
> 8) do an immediate shutdown, as it's the only way I can get the database
> unstuck.
>
> With some experimentation, the problem seems to occur when you have a
> failing archive_command and a master which currently has no database
> traffic; for example, if I did some database write activity (a createdb)
> then pg_stop_backup would complete after about 60 seconds (which, btw,
> is extremely annoying, but at least tolerable).
>
> This issue is 100% reproduceable.

IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it gives
an explicit way out of deadlock.

ISTM the problem is that you didn't test. Steps 3 and 4 should have been
reversed. Perhaps we should put something in the docs to say "and test".
The correct resolution is to put in an archive_command that works.

We can put in an extra step to prevent a pg_start_backup() if there are
a significant number of outstanding files to be archived. Doing that
seems like closing the door after the horse has bolted, since we just
introduced streaming replication that doesn't rely on archived files. In
any case, I don't see many people working on a production system hitting
a problem on an archive_command and then deciding to shut down.

So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.

--
Simon Riggs www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Joshua D. Drake" on
On Tue, 2010-02-23 at 18:58 +0000, Simon Riggs wrote:
> On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> > This issue is 100% reproduceable.
>
> IMHO there in no problem in that behaviour. If somebody requests a
> backup then we should wait for it to complete. Kevin's suggestion of
> pg_fail_backup() is the only sensible conclusion there because it gives
> an explicit way out of deadlock.
>
> ISTM the problem is that you didn't test. Steps 3 and 4 should have been
> reversed. Perhaps we should put something in the docs to say "and test".
> The correct resolution is to put in an archive_command that works.

The problem isn't that it is a bad archive_command, it is that
PostgreSQL has no way to deal with this gracefully. Yes people should
test but are we dealing with the real world or not?

>
> So I don't see this as something that needs fixing for 9.0. There is
> already too much non-essential code there, all of which needs to be
> tested. I don't think adding in new corner cases to "help" people makes
> any sense until we have automated testing that allows us to rerun the
> regression tests to check all this stuff still works.

This will bite us if we release like this.

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers