From: Tom Lane on
Josh Berkus <josh(a)agliodbs.com> writes:
>> pg_stop_backup() doesn't complete until all the WAL segments needed to
>> restore from the backup are archived. If archive_command is failing,
>> that never happens.

> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing. Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.

The pg_abort_backup() operation previously proposed seems like the only
workable compromise. Simon is quite right to not want pg_stop_backup()
to behave in a way that could contribute to data loss; but on the other
hand there needs to be some clear way to get the system out of that
state at need.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "David E. Wheeler" on
On Feb 24, 2010, at 12:47 PM, Tom Lane wrote:

>> OK, so we need a way out of that cycle if the user is issuing
>> pg_stop_backup because they *already know* that archive_command is
>> failing. Right now, there's no way out other than a fast shutdown,
>> which is a bit user-hostile.
>
> The pg_abort_backup() operation previously proposed seems like the only
> workable compromise. Simon is quite right to not want pg_stop_backup()
> to behave in a way that could contribute to data loss; but on the other
> hand there needs to be some clear way to get the system out of that
> state at need.

+1 makes sense.

David


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Josh Berkus wrote:
>> pg_stop_backup() doesn't complete until all the WAL segments needed to
>> restore from the backup are archived. If archive_command is failing,
>> that never happens.
>>
>
> OK, so we need a way out of that cycle if the user is issuing
> pg_stop_backup because they *already know* that archive_command is
> failing. Right now, there's no way out other than a fast shutdown,
> which is a bit user-hostile.
>
gsmith=# select name,context from pg_settings where name like 'archive%';
name | context
-----------------+------------
archive_command | sighup
archive_mode | postmaster
archive_timeout | sighup

I expect for your particular bad situation, you can replace the
archive_command with a corrected one, use "pg_ctl reload" to send a
SIGHUP to make that fix active, and escape from this. That's the only
right way out of this situation. You can't just abort a backup someone
has asked for just because archives are failing and allow the server to
shutdown cleanly in this situation. That's the wrong thing to do for
production setups; the last thing you want for a system with archiving
issues is to be stopped normally if it's interfering with an explicit
admin requested backup.

Not necessarily any reason that backup even needs to fail, and no reason
for the server to get restarted in this situation at all. If the
archive_command never returned false information, and in fact just
returned a valid error code, all of the segments needed to make the
backup consistent will be queued up waiting for the problem to be
fixed. Put the fixed archive_command in place, and you're off and
running again. If that's impossible, because the archive_command was
really screwed up, we can just tell people to swap to an archive_command
that just returns success, and let the queued up segments to be archived
all get tossed away. That backup will be bad, they fix the
archive_command, send SIGHUP, and start over with a new backup.

There's some doc patches that could guide how to handle this situation
better for sure, but I don't see any code changes needed. Everything
working as designed, optimized for production use at the expense of some
confusion on how to recover if you configure things badly.

I suggested a patch a few weeks ago to make "what is the archiver
doing?" behavior easier to monitor, got the impression people felt it
was redundant given SR was the preferred path moving forward and
eventually this whole archive_command bit would be going away. I could
revive that work if you feel this is such a bad issue that we need a
better way to watch what the archiver is doing.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> writes:
> Josh Berkus wrote:
>> OK, so we need a way out of that cycle if the user is issuing
>> pg_stop_backup because they *already know* that archive_command is
>> failing. Right now, there's no way out other than a fast shutdown,

> Sure there is. Just kill the session, Ctrl-c or similar.
> pg_stop_backup() isn't actually doing anything at that point anymore;
> it's just waiting for the files to be archived before returning.

One objection to this is that it's not very clear to the user when
pg_stop_backup has finished with actual work and is just waiting for the
archiver, ie when is it safe to hit control-C? Maybe we should emit a
"backup done, waiting for archiver to complete" notice before entering
the sleep loop.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on
Greg,

> I expect for your particular bad situation, you can replace the
> archive_command with a corrected one, use "pg_ctl reload" to send a
> SIGHUP to make that fix active, and escape from this. That's the only
> right way out of this situation. You can't just abort a backup someone
> has asked for just because archives are failing and allow the server to
> shutdown cleanly in this situation. That's the wrong thing to do for
> production setups; the last thing you want for a system with archiving
> issues is to be stopped normally if it's interfering with an explicit
> admin requested backup.

Yeah, I can see that for large production setups with multiple staff.
We also need something newbie-friendly (and friendly to the large number
of users we have where the DBA/Sysadmin is just the most skilled web
developer) though. The above procedure is far too complex for someone
who is "just trying out" PostgreSQL as a replacement for MySQL, and if
recent conferences are anything to go by, we're about to have several
thousand such users.

BTW, please stop treating this issue as something which happens "only to
Josh". I wouldn't be raising it if it weren't a natural circumstance
which anyone who is trying PostgreSQL with HS/SR for the first time,
with no experience with Warm Standby, would get into. Such new users
are *likely* to get archive_command wrong, and likely to want to start
over when they do. If we make that painful for them, they'll just
switch to MySQL or CouchDB instead.

Thing is, if archive_command is failing, then the backup is useless
regardless until it's fixed. And sending the archives to /dev/null (the
fix you're essentially recommending above) doesn't make the backup any
more useful. So I'm seeing pg_abort_backup(), which also produces a
markers which prevent the backup from loading, as an improvement on
current UI.

--Josh Berkus

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers