Streaming replication status [PgSql]

Prev: [HACKERS] synchronized snapshots
Next: synchronized snapshots

From: Stefan Kaltenbrunner on 15 Jan 2010 01:53

Greg Smith wrote:
> Fujii Masao wrote:
>>> "I'm thinking something like pg_standbys_xlog_location() [on the primary] which returns
>>> one row per standby servers, showing pid of walsender, host name/
>>> port number/user OID of the standby, the location where the standby
>>> has written/flushed WAL. DBA can measure the gap from the
>>> combination of pg_current_xlog_location() and pg_standbys_xlog_location()
>>> via one query on the primary."
>>>
>>
>> This function is useful but not essential for troubleshooting, I think.
>> So I'd like to postpone it.
>>
>
> Sure; in a functional system where primary and secondary are both up,
> you can assemble the info using the new functions you just added, so
> this other one is certainly optional. I just took a brief look at the
> code of the features you added, and it looks like it exposes the minimum
> necessary to make this whole thing possible to manage. I think it's OK
> if you postpone this other bit, more important stuff for you to work on.

agreed

>
> So: the one piece of information I though was most important to expose
> here at an absolute minimum is there now. Good progress. The other
> popular request that keeps popping up here is providing an easy way to
> see how backlogged the archive_command is, to make it easier to monitor
> for out of disk errors that might prove catastrophic to replication.

I tend to disagree - in any reasonable production setup basic stulff
like disk space usage is monitored by non-application specific matters.
While monitoring backlog might be interesting for other reasons, citing
disk space usage/exhaustions seems just wrong.

[...]
>
> I'd find this extremely handy as a hook for monitoring scripts that want
> to watch the server but don't have access to the filesystem directly,
> even given those limitations. I'd prefer to have the "tried to"
> version, because it will populate with the name of the troublesome file
> it's stuck on even if archiving never gets its first segment delivered.

While fancy at all I think this goes way to far for the first cut at
SR(or say this release), monitoring disk usage and tracking log files
for errors are SOLVED issues in estabilished production setups. If you
are in an environment that does neither for each and every server
independent on what you have running on it, or a setup where the
sysadmins are clueless and the poor DBA has to hack around that fact you
have way bigger issues anyway.

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 15 Jan 2010 08:27

Greg Smith wrote:

> to make it easier to monitor for out of disk errors that might
> prove catastrophic to replication.

We handle that with the fsutil functions (in pgfoundry). This can
actually measure free space on each volume. These weren't portable
enough to include in core, but maybe they could be made more
portable?

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on 15 Jan 2010 11:50

Stefan Kaltenbrunner wrote:
> Greg Smith wrote:
>>
>> The other popular request that keeps popping up here is providing an
>> easy way to see how backlogged the archive_command is, to make it
>> easier to monitor for out of disk errors that might prove
>> catastrophic to replication.
>
> I tend to disagree - in any reasonable production setup basic stulff
> like disk space usage is monitored by non-application specific matters.
> While monitoring backlog might be interesting for other reasons,
> citing disk space usage/exhaustions seems just wrong.

I was just mentioning that one use of the data, but there are others.
Let's say that your archive_command works by copying things over to a
NFS mount, and the mount goes down. It could be a long time before you
noticed this via disk space monitoring. But if you were monitoring "how
long has it been since the last time pg_last_archived_xlogfile()
changed?", this would jump right out at you.

Another popular question is "how far behind real-time is the archiver
process?" You can do this right now by duplicating the same xlog file
name scanning and sorting that the archiver does in your own code,
looking for .ready files. It would be simpler if you could call
pg_last_archived_xlogfile() and then just grab that file's timestamp.

I think it's also important to consider the fact that diagnostic
internals exposed via the database are far more useful to some people
than things you have to setup outside of it. You talk about reasonable
configurations above, but some production setups are not so reasonable.
In many of the more secure environments I've worked in (finance,
defense), there is *no* access to the database server beyond what comes
out of port 5432 without getting a whole separate team of people
involved. If the DBA can write a simple monitoring program themselves
that presents data via the one port that is exposed, that makes life
easier for them. This same issue pops up sometimes when we consider the
shared hosting case too, where the user may not have the option of
running a full-fledged monitoring script.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 15 Jan 2010 11:55

Greg Smith <greg(a)2ndquadrant.com> wrote:

> In many of the more secure environments I've worked in (finance,
> defense), there is *no* access to the database server beyond what
> comes out of port 5432 without getting a whole separate team of
> people involved. If the DBA can write a simple monitoring program
> themselves that presents data via the one port that is exposed,
> that makes life easier for them.

Right, we don't want to give the monitoring software an OS login for
the database servers, for security reasons.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Stefan Kaltenbrunner on 15 Jan 2010 12:24

Greg Smith wrote:
> Stefan Kaltenbrunner wrote:
>> Greg Smith wrote:
>>>
>>> The other popular request that keeps popping up here is providing an
>>> easy way to see how backlogged the archive_command is, to make it
>>> easier to monitor for out of disk errors that might prove
>>> catastrophic to replication.
>>
>> I tend to disagree - in any reasonable production setup basic stulff
>> like disk space usage is monitored by non-application specific matters.
>> While monitoring backlog might be interesting for other reasons,
>> citing disk space usage/exhaustions seems just wrong.
>
> I was just mentioning that one use of the data, but there are others.
> Let's say that your archive_command works by copying things over to a
> NFS mount, and the mount goes down. It could be a long time before you
> noticed this via disk space monitoring. But if you were monitoring "how
> long has it been since the last time pg_last_archived_xlogfile()
> changed?", this would jump right out at you.

well from an syadmin perspective you have to monitor the NFS mount
anyway - so why do you need the database to do too(and not in a sane way
because there is no way the database can even figure out what the real
problem is and if there is one)?

>
> Another popular question is "how far behind real-time is the archiver
> process?" You can do this right now by duplicating the same xlog file
> name scanning and sorting that the archiver does in your own code,
> looking for .ready files. It would be simpler if you could call
> pg_last_archived_xlogfile() and then just grab that file's timestamp.

well that one seems a more reasonable reasoning to me however I'm not so
sure that the proposed implementation feels right - though can't come up
with a better suggestion for now.

>
> I think it's also important to consider the fact that diagnostic
> internals exposed via the database are far more useful to some people
> than things you have to setup outside of it. You talk about reasonable
> configurations above, but some production setups are not so reasonable.
> In many of the more secure environments I've worked in (finance,
> defense), there is *no* access to the database server beyond what comes
> out of port 5432 without getting a whole separate team of people
> involved. If the DBA can write a simple monitoring program themselves
> that presents data via the one port that is exposed, that makes life
> easier for them. This same issue pops up sometimes when we consider the
> shared hosting case too, where the user may not have the option of
> running a full-fledged monitoring script.

well again I consider stuff like "available diskspace" or "NFS mount
available" completely in the realm of the OS level management. The
database side should focus on the stuff that concerns the internal state
and operation of the database app itself.
If you continue your line of thought you will have to add all kind of
stuff to the database, like CPU usage tracking, getting information
about running processes, storage health.
As soon as you are done you have reimplemented nagios-plugins over SQL
on port 5432 instead of NRPE(or SNMP or whatnot).
Again I fully understand and know that there are environments where the
DBA does not have OS level (be it root or no shell at all) access has to
the OS but even if you had that "archiving is hanging" function you
would still have to go back to that "completely different group" and
have them diagnose again.
So my point is - that even if you have disparate groups of people being
responsible for different parts of a system solution you can't really
work around incompetency(or slownest or whatever) of the group
responsible for the lower layer by adding partial and inexact
functionality at the upper part that can only guess what the real issue is.

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: [HACKERS] synchronized snapshots
Next: synchronized snapshots