Streaming replication and a disk full in primary [PgSql]

Prev: [HACKERS] pg_filedump strangeness
Next: pgsql: Forbid using pg_xlogfile_name() andpg_xlogfile_name_offset()

From: Heikki Linnakangas on 7 Apr 2010 06:02

This task has been languishing for a long time, so I took a shot at it.
I took the approach I suggested before, keeping a variable in shared
memory to track the latest removed WAL segment. After walsender has read
a bunch of WAL records from a WAL file, it checks that what it read is
after the latest removed WAL segment, otherwise the data it read might
have came from a file that was already recycled and overwritten with new
data, and an error is thrown.

This changes the behavior so that if a standby server doing streaming
replication falls behind too much, the primary will remove/recycle a WAL
segment needed by the standby server. The previous behavior was that WAL
segments still needed by any connected standby server were never
removed, at the risk of filling the disk in the primary if a standby
server behaves badly.

In your version of this patch, the default was still the current
behavior where the primary retains WAL files that are still needed by
connected stadby servers indefinitely. I think that's a dangerous
default, so I changed it so that if you don't set standby_keep_segments,
the primary doesn't retain any extra segments; the number of WAL
segments available for standby servers is determined only by the
location of the previous checkpoint, and the status of WAL archiving.
That makes the code a bit simpler too, as we never care how far the
walsenders are. In fact, the GetOldestWALSenderPointer() function is now
dead code.

Fujii Masao wrote:
> Thanks for the review! And, sorry for the delay.
>
> On Thu, Jan 21, 2010 at 11:10 PM, Heikki Linnakangas
> <heikki.linnakangas(a)enterprisedb.com> wrote:
>> I don't think we should do the check XLogWrite(). There's really no
>> reason to kill the standby connections before the next checkpoint, when
>> the old WAL files are recycled. XLogWrite() is in the critical path of
>> normal operations, too.
>
> OK. I'll remove that check from XLogWrite().
>
>> There's another important reason for that: If archiving is not working
>> for some reason, the standby can't obtain the old segments from the
>> archive either. If we refuse to stream such old segments, and they're
>> not getting archived, the standby has no way to catch up until archiving
>> is fixed. Allowing streaming of such old segments is free wrt. disk
>> space, because we're keeping the files around anyway.
>
> OK. We should terminate the walsender whose currently-opened WAL file
> has been already archived, isn't required for crash recovery AND is
> 'max-lag' older than the currently-written one. I'll change so.
>
>> Walreceiver will get an error if it tries to open a segment that's been
>> deleted or recycled already. The dangerous situation we need to avoid is
>> when walreceiver holds a file open while bgwriter recycles it.
>> Walreceiver will merrily continue streaming data from it, even though
>> it's be overwritten by new data already.
>
> s/walreceiver/walsender ?
>
> Yes, that's the problem that I'll have to fix.
>
>> A straightforward fix is to keep an "newest recycled XLogRecPtr" in
>> shared memory that RemoveOldXlogFiles() updates. Walreceiver checks it
>> right after read()ing from a file, before sending it to the client, and
>> throws an error if the data it read() was already recycled.
>
> I prefer this. But I don't think such an aggressive check of a "newest
> recycled XLogRecPtr" is required if the bgwriter always doesn't delete
> the WAL file which is newer than or equal to the walsenders' oldest WAL
> file. In other words, the WAL files which the walsender is reading (or
> will read) are not removed at the moment.
>
> Regards,
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From: Robert Haas on 7 Apr 2010 13:11

On Wed, Apr 7, 2010 at 6:02 AM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> This task has been languishing for a long time, so I took a shot at it.
> I took the approach I suggested before, keeping a variable in shared
> memory to track the latest removed WAL segment. After walsender has read
> a bunch of WAL records from a WAL file, it checks that what it read is
> after the latest removed WAL segment, otherwise the data it read might
> have came from a file that was already recycled and overwritten with new
> data, and an error is thrown.
>
> This changes the behavior so that if a standby server doing streaming
> replication falls behind too much, the primary will remove/recycle a WAL
> segment needed by the standby server. The previous behavior was that WAL
> segments still needed by any connected standby server were never
> removed, at the risk of filling the disk in the primary if a standby
> server behaves badly.
>
> In your version of this patch, the default was still the current
> behavior where the primary retains WAL files that are still needed by
> connected stadby servers indefinitely. I think that's a dangerous
> default, so I changed it so that if you don't set standby_keep_segments,
> the primary doesn't retain any extra segments; the number of WAL
> segments available for standby servers is determined only by the
> location of the previous checkpoint, and the status of WAL archiving.
> That makes the code a bit simpler too, as we never care how far the
> walsenders are. In fact, the GetOldestWALSenderPointer() function is now
> dead code.

This seems like a very useful feature, but I can't speak to the code
quality without a good deal more study.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 8 Apr 2010 02:33

Thanks for the great patch! I apologize for leaving the issue
half-finished for long time :(

On Wed, Apr 7, 2010 at 7:02 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
> In your version of this patch, the default was still the current
> behavior where the primary retains WAL files that are still needed by
> connected stadby servers indefinitely. I think that's a dangerous
> default, so I changed it so that if you don't set standby_keep_segments,
> the primary doesn't retain any extra segments; the number of WAL
> segments available for standby servers is determined only by the
> location of the previous checkpoint, and the status of WAL archiving.
> That makes the code a bit simpler too, as we never care how far the
> walsenders are. In fact, the GetOldestWALSenderPointer() function is now
> dead code.

It's OK for me to change the default behavior. We can remove
the GetOldestWALSenderPointer() function.

doc/src/sgml/config.sgml
- archival or to recover from a checkpoint. If standby_keep_segments
+ archival or to recover from a checkpoint. If
<varname>standby_keep_segments</>

The word "standby_keep_segments" always needs the <varname> tag, I think.

We should remove the document "25.2.5.2. Monitoring"?

Why is standby_keep_segments used even if max_wal_senders is zero?
In that case, ISTM we don't need to keep any WAL files in pg_xlog
for the standby.

When XLogRead() reads two WAL files and only the older of them is recycled
during being read, it might fail in checking whether the read data is valid.
This is because the variable "recptr" can advance to the newer WAL file
before the check.

When walreceiver has gotten stuck for some reason, walsender would be
unable to pass through the send() system call, and also get stuck.
In the patch, such a walsender cannot exit forever because it cannot
call XLogRead(). So I think that the bgwriter needs to send the
exit-signal to such a too lagged walsender. Thought?

The shmem of latest recycled WAL file is updated before checking whether
it's already been archived. If archiving is not working for some reason,
the WAL file which that shmem indicates might not actually have been
recycled yet. In this case, the standby cannot obtain the WAL file from
the primary because it's been marked as "latest recycled", and from the
archive because it's not been archived yet. This seems to be a big problem.
How about moving the update of the shmem to after calling XLogArchiveCheckDone()
in RemoveOldXlogFiles()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 12 Apr 2010 06:41

Fujii Masao wrote:
> doc/src/sgml/config.sgml
> - archival or to recover from a checkpoint. If standby_keep_segments
> + archival or to recover from a checkpoint. If
> <varname>standby_keep_segments</>
>
> The word "standby_keep_segments" always needs the <varname> tag, I think.

Thanks, fixed.

> We should remove the document "25.2.5.2. Monitoring"?

I updated it to no longer claim that the primary can run out of disk
space because of a hung WAL sender. The information about calculating
the lag between primary and standby still seems valuable, so I didn't
remove the whole section.

> Why is standby_keep_segments used even if max_wal_senders is zero?
> In that case, ISTM we don't need to keep any WAL files in pg_xlog
> for the standby.

True. I don't think we should second guess the admin on that, though.
Perhaps he only set max_wal_senders=0 temporarily, and will be
disappointed if the the logs are no longer there when he sets it back to
non-zero and restarts the server.

> When XLogRead() reads two WAL files and only the older of them is recycled
> during being read, it might fail in checking whether the read data is valid.
> This is because the variable "recptr" can advance to the newer WAL file
> before the check.

Thanks, fixed.

> When walreceiver has gotten stuck for some reason, walsender would be
> unable to pass through the send() system call, and also get stuck.
> In the patch, such a walsender cannot exit forever because it cannot
> call XLogRead(). So I think that the bgwriter needs to send the
> exit-signal to such a too lagged walsender. Thought?

Any backend can get stuck like that.

> The shmem of latest recycled WAL file is updated before checking whether
> it's already been archived. If archiving is not working for some reason,
> the WAL file which that shmem indicates might not actually have been
> recycled yet. In this case, the standby cannot obtain the WAL file from
> the primary because it's been marked as "latest recycled", and from the
> archive because it's not been archived yet. This seems to be a big problem.
> How about moving the update of the shmem to after calling XLogArchiveCheckDone()
> in RemoveOldXlogFiles()?

Good point. It's particularly important considering that if a segment
hasn't been archived yet, it's not available to the standby from the
archive either. I changed that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Fujii Masao on 12 Apr 2010 08:39

On Mon, Apr 12, 2010 at 7:41 PM, Heikki Linnakangas
<heikki.linnakangas(a)enterprisedb.com> wrote:
>> We should remove the document "25.2.5.2. Monitoring"?
>
> I updated it to no longer claim that the primary can run out of disk
> space because of a hung WAL sender. The information about calculating
> the lag between primary and standby still seems valuable, so I didn't
> remove the whole section.

Yes.

> ! An important health indicator of streaming replication is the amount
> ! of WAL records generated in the primary, but not yet applied in the
> ! standby.

Since pg_last_xlog_receive_location doesn't let us know the WAL location
not yet applied, we should use pg_last_xlog_replay_location instead. How
How about?:

----------------
An important health indicator of streaming replication is the amount
of WAL records generated in the primary, but not yet applied in the
standby. You can calculate this lag by comparing the current WAL write
- location on the primary with the last WAL location received by the
+ location on the primary with the last WAL location replayed by the
standby. They can be retrieved using
<function>pg_current_xlog_location</> on the primary and the
- <function>pg_last_xlog_receive_location</> on the standby,
+ <function>pg_last_xlog_replay_location</> on the standby,
respectively (see <xref linkend="functions-admin-backup-table"> and
<xref linkend="functions-recovery-info-table"> for details).
- The last WAL receive location in the standby is also displayed in the
- process status of the WAL receiver process, displayed using the
- <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
</sect3>
----------------

>> Why is standby_keep_segments used even if max_wal_senders is zero?
>> In that case, ISTM we don't need to keep any WAL files in pg_xlog
>> for the standby.
>
> True. I don't think we should second guess the admin on that, though.
> Perhaps he only set max_wal_senders=0 temporarily, and will be
> disappointed if the the logs are no longer there when he sets it back to
> non-zero and restarts the server.

OK. Since the behavior is not intuitive for me, I'd like to add the note
into the end of the description about "standby_keep_segments". How about?:

----------------
This setting has effect if max_wal_senders is zero.
----------------

>> When walreceiver has gotten stuck for some reason, walsender would be
>> unable to pass through the send() system call, and also get stuck.
>> In the patch, such a walsender cannot exit forever because it cannot
>> call XLogRead(). So I think that the bgwriter needs to send the
>> exit-signal to such a too lagged walsender. Thought?
>
> Any backend can get stuck like that.

OK.

> + },
> +
> + {
> + {"standby_keep_segments", PGC_SIGHUP, WAL_CHECKPOINTS,
> + gettext_noop("Sets the number of WAL files held for standby servers"),
> + NULL
> + },
> + &StandbySegments,
> + 0, 0, INT_MAX, NULL, NULL

We should s/WAL_CHECKPOINTS/WAL_REPLICATION ?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3
Prev: [HACKERS] pg_filedump strangeness
Next: pgsql: Forbid using pg_xlogfile_name() andpg_xlogfile_name_offset()