[HACKERS] Proposal for 9.1: WAL streaming from WAL buffers [PgSql]

Prev: PG 9.1 tentative timeline
Next: Proposal for 9.1: WAL streaming from WAL buffers

From: Fujii Masao on 11 Jun 2010 09:14

Hi,

In 9.0, walsender reads WAL always from the disk and sends it to the standby.
That is, we cannot send WAL until it has been written (and flushed) to the disk.
This degrades the performance of synchronous replication very much since a
transaction commit must wait for the WAL write time *plus* the replication time.

The attached patch enables walsender to read data from WAL buffers in addition
to the disk. Since we can write and send WAL simultaneously, in synchronous
replication, a transaction commit has only to wait for either of them. So the
performance would significantly increase.

Now three hackers (Zoltan, Simon and me) are planning to develop synchronous
replication feature. I'm not sure whose patch will be committed at last. But
since the attached patch provides just a infrastructure to optimize SR, it
would work fine with any of them together and have a good effect.

I'll add the patch into the next CF. AFAIK the ReviewFest will start Jun 15.
During that, if you are interested in the patch, please feel free to review it.
Also you can get the code change from my git repository:

git://git.postgresql.org/git/users/fujii/postgres.git
branch: read-wal-buffers

From here I talk about the detail of the change. At first, walsender reads WAL
from the disk. If it has reached the current write location (i.e., there is no
unsent WAL in the disk), then it attempts to read from WAL buffers. This buffer
reading continues until the WAL to send has been purged from WAL buffers. IOW,
If WAL buffers is large enough and walsender has been catching up with insertion
of WAL, it can read WAL from the buffers forever.

Then if WAL to send has purged from the buffers, walsender backs off and tries
to read it from the disk. If we can find no WAL to send in the disk, walsender
attempts to read WAL from the buffers again. Walsender repeats these operations.

The location of the oldest record in the buffers is saved in the shared memory.
This location is used to calculate whether the particular WAL is in the buffers
or not.

To avoid lock contention, walsender reads WAL buffers and XLogCtl->xlblocks
without holding neither WALInsertLock nor WALWriteLock. Of course, they might be
changed because of buffer replacement while being read. So after reading them,
we check that what we read was valid by comparing the location of the read WAL
with the location of the oldest record in the buffers. This logic is similar to
what XLogRead() does at the end.

This feature is required for preventing the performance of synchronous
replication from dropping significantly. It can cut the time that a transaction
committed on the master takes to become visible on the standby. So, it's also
useful for asynchronous replication.

Thought? Comment? Objection?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

|
Pages: 1
Prev: PG 9.1 tentative timeline
Next: Proposal for 9.1: WAL streaming from WAL buffers