Keepalive for max_standby

Prev: [HACKERS] Keepalive for max_standby_delay
Next: Unexpected page allocation behavior on insert-only tables

From: Simon Riggs on 18 May 2010 17:17

On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote:
> On 17/05/10 04:40, Simon Riggs wrote:
> > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote:
> >>>
> >>> Attached patch rearranges the walsender loops slightly to fix the above.
> >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE /
> >>> 2) in one round and returns to the main loop after that even if there's
> >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL.
> >>> That way the main loop gets to respond to signals quickly, and we also
> >>> get to update the shared memory status and PS display more often when
> >>> there's a lot of catching up to do.
> >>>
> >>> Comments
> >>
> >> 8MB at a time still seems like a large batch to me.
> >>
> >> libpq is going to send it in smaller chunks anyway, so I don't see the
> >> importance of trying to keep the batch too large. It just introduces
> >> delay into the sending process. We should be sending chunks that matches
> >> libpq better.
> >
> > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE
> > because it will send partial pages.
>
> I don't see a failure. We rely on not splitting WAL records across
> messages, but we're talking about libpq-level CopyData messages, not TCP
> messages.

OK

> > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently
> > stands.
>
> Well, it does affect the size of the read() in walsender, and I'm sure
> there's some overhead in setting the ps display and the other small
> stuff we do once per message. But you're probably right that we could
> easily make MAX_SEND_SIZE much smaller with no noticeable affect on
> performance, while making walsender more responsive to signals. I'll
> decrease it to, say, 512 kB.

I'm pretty certain we don't need to set the ps display once per message.
ps doesn't need an update 5 times per second on average.

There's no reason that the buffer size we use for XLogRead() should be
the same as the send buffer, if you're worried about that. My point is
that pq_putmessage contains internal flushes so at the libpq level you
gain nothing by big batches. The read() will be buffered anyway with
readahead so not sure what the issue is. We'll have to do this for sync
rep anyway, so what's the big deal? Just do it now, once. Do we really
want 9.1 code to differ here?

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 18 May 2010 17:21

On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote:
> On 17/05/10 12:36, Jim Nasby wrote:
> > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote:
> >> What exactly is the user trying to monitor? If it's "how far behind is
> >> the standby", the difference between pg_current_xlog_insert_location()
> >> in the master and pg_last_xlog_replay_location() in the standby seems
> >> more robust and well-defined to me. It's a measure of XLOG location (ie.
> >> bytes) instead of time, but time is a complicated concept.
> >
> > I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so.
>
> No doubt about that, the problem is that it's hard to provide a reliable
> time-based indication.

I think I have one now.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 18 May 2010 17:25

On 18/05/10 17:17, Simon Riggs wrote:
> There's no reason that the buffer size we use for XLogRead() should be
> the same as the send buffer, if you're worried about that. My point is
> that pq_putmessage contains internal flushes so at the libpq level you
> gain nothing by big batches. The read() will be buffered anyway with
> readahead so not sure what the issue is. We'll have to do this for sync
> rep anyway, so what's the big deal? Just do it now, once. Do we really
> want 9.1 code to differ here?

Do what? What exactly is it that you want instead, then?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 18 May 2010 17:37

On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
> On 18/05/10 17:17, Simon Riggs wrote:
> > There's no reason that the buffer size we use for XLogRead() should be
> > the same as the send buffer, if you're worried about that. My point is
> > that pq_putmessage contains internal flushes so at the libpq level you
> > gain nothing by big batches. The read() will be buffered anyway with
> > readahead so not sure what the issue is. We'll have to do this for sync
> > rep anyway, so what's the big deal? Just do it now, once. Do we really
> > want 9.1 code to differ here?
>
> Do what? What exactly is it that you want instead, then?

Read and write smaller messages, so the latency is minimised. Libpq will
send in 8192 byte packets, so writing anything larger gains nothing when
the WAL data is also chunked at exactly the same size.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 26 May 2010 18:34

On 19/05/10 00:37, Simon Riggs wrote:
> On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote:
>> On 18/05/10 17:17, Simon Riggs wrote:
>>> There's no reason that the buffer size we use for XLogRead() should be
>>> the same as the send buffer, if you're worried about that. My point is
>>> that pq_putmessage contains internal flushes so at the libpq level you
>>> gain nothing by big batches. The read() will be buffered anyway with
>>> readahead so not sure what the issue is. We'll have to do this for sync
>>> rep anyway, so what's the big deal? Just do it now, once. Do we really
>>> want 9.1 code to differ here?
>>
>> Do what? What exactly is it that you want instead, then?
>
> Read and write smaller messages, so the latency is minimised. Libpq will
> send in 8192 byte packets, so writing anything larger gains nothing when
> the WAL data is also chunked at exactly the same size.

Committed with chunk size of 128 kB. I hope that's a reasonable
compromise, in the absence of any performance test data either way.

I'm weary of setting it as low as 8k, as there is some per-message
overhead. Some of that could be avoided by rearranging the loops so that
the ps display is not updated at every message as you suggested, but I
don't feel doing any extra rearrangements at this point. It would not be
hard, but it also certainly wouldn't make the code simpler.

I believe in practice 128kB is just as good as 8k from the
responsiveness point of view. If a standby is not responding, walsender
will be stuck trying to send no matter what the block size is. If it
responding, no matter how slowly, 128kB should get transferred pretty
quickly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: [HACKERS] Keepalive for max_standby_delay
Next: Unexpected page allocation behavior on insert-only tables