From: Greg Smith on
Josh Berkus wrote:
> Thing is, if archive_command is failing, then the backup is useless
> regardless until it's fixed. And sending the archives to /dev/null (the
> fix you're essentially recommending above) doesn't make the backup any
> more useful.

That's not what I said to do first. If it's possible to fix your
archive_command, and it never returned bad "I'm saying success but I
didn't really do the right thing" information to the server--it just
failed--this situation is completely recoverable with no damage to the
backup. Just fix the archive_command, reload the configuration, and the
queue of archived files will flow and eventually your consistent backup
completes. This it the only behavior someone who is trying to recover
from a mistake in production is likely to find acceptable, and as Simon
has pointed out that is what the current situation is optimized for.

Only in the situation where the archive_command was so bad that it
returned the wrong data to the server--saying the segment was saved but
it really wasn't--did I suggest that you might as well change
archive_command to go nowhere. Because in that case, your backup is
already screwed, you lost an essential piece of it.

As far your comment about treating this like it's a problem specific to
you, did you miss the part where I pointed out I was just expressing
concerns about poor visiblity into this area ("what is the archiver
doing?") recently? I'm well aware this path is full of difficult to
escape from holes. We just need to be careful not do something that
screws over production users in the name of reducing the learning curve.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on

> That's not what I said to do first. If it's possible to fix your
> archive_command, and it never returned bad "I'm saying success but I
> didn't really do the right thing" information to the server--it just
> failed--this situation is completely recoverable with no damage to the
> backup. Just fix the archive_command, reload the configuration, and the
> queue of archived files will flow and eventually your consistent backup
> completes. This it the only behavior someone who is trying to recover
> from a mistake in production is likely to find acceptable, and as Simon
> has pointed out that is what the current situation is optimized for.

Right. I'm pointing out that production and "trying out 9.0 for the
first time" are actually different circumstances, and we need to be able
to handle both gracefully. Since, if people have a bad experience
trying it out for the first time, we'll never *get* to production.

> As far your comment about treating this like it's a problem specific to
> you, did you miss the part where I pointed out I was just expressing
> concerns about poor visiblity into this area ("what is the archiver
> doing?") recently? I'm well aware this path is full of difficult to
> escape from holes. We just need to be careful not do something that
> screws over production users in the name of reducing the learning curve.

I think Tom's idea is minimally intrusive, and deals with the central
problem, which is one of UI and visibility as you assessed.

--Josh Berkus


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Josh Berkus <josh(a)agliodbs.com> writes:
> Tom, Simon,
>> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
>> it's starting to wait for the archiver (or maybe after it's waited
>> for a few seconds, but much less than the present 60).
>>
>> * extend the existing WARNING (and the NOTICE too if we elect to have
>> one) with a HINT message explicitly saying that you can cancel the
>> wait but thus-and-such consequences might ensue.
>>
>> Both of these things would only be helpful when using client software
>> that shows you received notices promptly. psql is okay, but maybe
>> pgAdmin and other tools would need some further work. There is not
>> much we can do about that in the core project though.

> Well, the client software could be fixed in time for 9.0, I'd think. I
> think that implementing both of the above would probably do the trick
> for user-friendliness, enough for 9.0. If it's obvious to the user on
> the console what to do, then they won't panic.

If you like the concept, then the next question is exactly how to phrase
the messages. All we have at the moment is the inside-the-delay-loop
warning:

ereport(WARNING,
(errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
waits)));

which now that I look at it could use some wordsmithing itself.
Suggestions?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on

> If you like the concept, then the next question is exactly how to phrase
> the messages. All we have at the moment is the inside-the-delay-loop
> warning:
>
> ereport(WARNING,
> (errmsg("pg_stop_backup still waiting for archive to complete (%d seconds elapsed)",
> waits)));

Well, we'll want this message first, as soon as pg_stop_backup finishes
checkpointing:

WARNING: Stop backup work complete. Now awaiting completion of WAL
archiving.

Then after 60s:

WARNING: pg_stop_backup is still waiting for WAL archiving to complete
(%d seconds elapsed).
HINT: Check if your WAL archive_command is failing. You may abort
pg_stop_backup at this point, but you will not be able to use the
resulting clone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Wed, 2010-02-24 at 16:52 -0500, Tom Lane wrote:

> Before you could enter pg_abort_backup you'd have to control-C out of
> the pg_stop_backup call, and that action already accomplishes the only
> thing pg_abort_backup could do for you.

Agreed. I was responding to perceived user need.

> So what I am thinking is that this is really just a minor bit of user
> unfriendliness in pg_stop_backup. We should address it with one or
> both of these changes:
>
> * emit a NOTICE as soon as pg_stop_backup's actual work is done and
> it's starting to wait for the archiver (or maybe after it's waited
> for a few seconds, but much less than the present 60).

Pointless really. Nobody runs backups in production by typing
pg_stop_backup() except in a demo. Nobody will see this.

> * extend the existing WARNING (and the NOTICE too if we elect to have
> one) with a HINT message explicitly saying that you can cancel the
> wait but thus-and-such consequences might ensue.

If you can see the HINT, you can also see the WARNING. If you can see
the WARNING and do nothing, I don't think we need a "objects in the
mirror may be closer than they appear" message. If people can't work out
that if a) they are running something and b) that something is waiting
that they should cancel it then we aren't going to have much luck with
them.

--
Simon Riggs www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers