Prev: [HACKERS] Keepalive for max_standby_delay
Next: Unexpected page allocation behavior on insert-only tables
From: Simon Riggs on 18 May 2010 17:17 On Tue, 2010-05-18 at 17:06 -0400, Heikki Linnakangas wrote: > On 17/05/10 04:40, Simon Riggs wrote: > > On Sun, 2010-05-16 at 16:53 +0100, Simon Riggs wrote: > >>> > >>> Attached patch rearranges the walsender loops slightly to fix the above. > >>> XLogSend() now only sends up to MAX_SEND_SIZE bytes (== XLOG_SEG_SIZE / > >>> 2) in one round and returns to the main loop after that even if there's > >>> unsent WAL, and the main loop no longer sleeps if there's unsent WAL. > >>> That way the main loop gets to respond to signals quickly, and we also > >>> get to update the shared memory status and PS display more often when > >>> there's a lot of catching up to do. > >>> > >>> Comments > >> > >> 8MB at a time still seems like a large batch to me. > >> > >> libpq is going to send it in smaller chunks anyway, so I don't see the > >> importance of trying to keep the batch too large. It just introduces > >> delay into the sending process. We should be sending chunks that matches > >> libpq better. > > > > More to the point the logic will fail if XLOG_BLCKSZ> PQ_BUFFER_SIZE > > because it will send partial pages. > > I don't see a failure. We rely on not splitting WAL records across > messages, but we're talking about libpq-level CopyData messages, not TCP > messages. OK > > Having MAX_SEND_SIZE> PQ_BUFFER_SIZE is pointless, as libpq currently > > stands. > > Well, it does affect the size of the read() in walsender, and I'm sure > there's some overhead in setting the ps display and the other small > stuff we do once per message. But you're probably right that we could > easily make MAX_SEND_SIZE much smaller with no noticeable affect on > performance, while making walsender more responsive to signals. I'll > decrease it to, say, 512 kB. I'm pretty certain we don't need to set the ps display once per message. ps doesn't need an update 5 times per second on average. There's no reason that the buffer size we use for XLogRead() should be the same as the send buffer, if you're worried about that. My point is that pq_putmessage contains internal flushes so at the libpq level you gain nothing by big batches. The read() will be buffered anyway with readahead so not sure what the issue is. We'll have to do this for sync rep anyway, so what's the big deal? Just do it now, once. Do we really want 9.1 code to differ here? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Simon Riggs on 18 May 2010 17:21 On Tue, 2010-05-18 at 17:08 -0400, Heikki Linnakangas wrote: > On 17/05/10 12:36, Jim Nasby wrote: > > On May 15, 2010, at 12:05 PM, Heikki Linnakangas wrote: > >> What exactly is the user trying to monitor? If it's "how far behind is > >> the standby", the difference between pg_current_xlog_insert_location() > >> in the master and pg_last_xlog_replay_location() in the standby seems > >> more robust and well-defined to me. It's a measure of XLOG location (ie. > >> bytes) instead of time, but time is a complicated concept. > > > > I can tell you that end users *will* want a time-based indication of how far behind we are. DBAs will understand "we're this many transactions behind", but managers and end users won't. Unless it's unreasonable to provide that info, we should do so. > > No doubt about that, the problem is that it's hard to provide a reliable > time-based indication. I think I have one now. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Heikki Linnakangas on 18 May 2010 17:25 On 18/05/10 17:17, Simon Riggs wrote: > There's no reason that the buffer size we use for XLogRead() should be > the same as the send buffer, if you're worried about that. My point is > that pq_putmessage contains internal flushes so at the libpq level you > gain nothing by big batches. The read() will be buffered anyway with > readahead so not sure what the issue is. We'll have to do this for sync > rep anyway, so what's the big deal? Just do it now, once. Do we really > want 9.1 code to differ here? Do what? What exactly is it that you want instead, then? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Simon Riggs on 18 May 2010 17:37 On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: > On 18/05/10 17:17, Simon Riggs wrote: > > There's no reason that the buffer size we use for XLogRead() should be > > the same as the send buffer, if you're worried about that. My point is > > that pq_putmessage contains internal flushes so at the libpq level you > > gain nothing by big batches. The read() will be buffered anyway with > > readahead so not sure what the issue is. We'll have to do this for sync > > rep anyway, so what's the big deal? Just do it now, once. Do we really > > want 9.1 code to differ here? > > Do what? What exactly is it that you want instead, then? Read and write smaller messages, so the latency is minimised. Libpq will send in 8192 byte packets, so writing anything larger gains nothing when the WAL data is also chunked at exactly the same size. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Heikki Linnakangas on 26 May 2010 18:34
On 19/05/10 00:37, Simon Riggs wrote: > On Tue, 2010-05-18 at 17:25 -0400, Heikki Linnakangas wrote: >> On 18/05/10 17:17, Simon Riggs wrote: >>> There's no reason that the buffer size we use for XLogRead() should be >>> the same as the send buffer, if you're worried about that. My point is >>> that pq_putmessage contains internal flushes so at the libpq level you >>> gain nothing by big batches. The read() will be buffered anyway with >>> readahead so not sure what the issue is. We'll have to do this for sync >>> rep anyway, so what's the big deal? Just do it now, once. Do we really >>> want 9.1 code to differ here? >> >> Do what? What exactly is it that you want instead, then? > > Read and write smaller messages, so the latency is minimised. Libpq will > send in 8192 byte packets, so writing anything larger gains nothing when > the WAL data is also chunked at exactly the same size. Committed with chunk size of 128 kB. I hope that's a reasonable compromise, in the absence of any performance test data either way. I'm weary of setting it as low as 8k, as there is some per-message overhead. Some of that could be avoided by rearranging the loops so that the ps display is not updated at every message as you suggested, but I don't feel doing any extra rearrangements at this point. It would not be hard, but it also certainly wouldn't make the code simpler. I believe in practice 128kB is just as good as 8k from the responsiveness point of view. If a standby is not responding, walsender will be stuck trying to send no matter what the block size is. If it responding, no matter how slowly, 128kB should get transferred pretty quickly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |