From: Jan Wieck on
In some systems (data warehousing, replication), the order of commits is
important, since that is the order in which changes have become visible.
This information could theoretically be extracted from the WAL, but
scanning the entire WAL just to extract this tidbit of information would
be excruciatingly painful.

The following is based on ideas that emerged during last weeks PGCon.
Consider it an implementation proposal, if you like.

We introduce a new set of files. The files represent segments of an
infinite array of structures. The present segments are the available
"window" of data. Similar to CLOG files, the individual file name will
represent the high bits of a "serial" number, the offset of the record
inside the file represents the low bits of the "serial".

The system will have postgresql.conf options for enabling/disabling the
whole shebang, how many shared buffers to allocate for managing access
to the data and to define the retention period of the data based on data
volume and/or age of the commit records.

Each record of the Transaction Commit Info consists of

txid xci_transaction_id
timestamptz xci_begin_timestamp
timestamptz xci_commit_timestamp
int64 xci_total_rowcount

32 bytes total.

CommitTransaction() inside of xact.c will call a function, that inserts
a new record into this array. The operation will for most of the time be
nothing than taking a spinlock and adding the record to shared memory.
All the data for the record is readily available, does not require
further locking and can be collected locally before taking the spinlock.
The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
commit_timestamp is what CommitTransaction() just decided to write into
the WAL commit record and the total_rowcount is the sum of inserted,
updated and deleted heap tuples during the transaction, which should be
easily available from the statistics collector, unless row stats are
disabled, in which case the datum would be zero.

The function will return the "sequence" number which CommitTransaction()
in turn will record in the WAL commit record together with the
begin_timestamp. While both, the begin as well as the commit timestamp
are crucial to determine what data a particular transaction should have
seen, the row count is not and will not be recorded in WAL.

Checkpoint handling will call a function to flush the shared buffers.
Together with this, the information from WAL records will be sufficient
to recover this data (except for row counts) during crash recovery.

Exposing the data will be done via a set returning function. The SRF
takes two arguments. The maximum number of rows to return and the last
serial number processed by the reader. The advantage of such SRF is that
the result can be used in a query that right away delivers audit or
replication log information in transaction commit order. The SRF can
return an empty set if no further transactions have committed since, or
an error if data segments needed to answer the request have already been
purged.

Purging of the data will be possible in several different ways.
Autovacuum will call a function that drops segments of the data that are
outside the postgresql.conf configuration with respect to maximum age
or data volume. There will also be a function reserved for superusers to
explicitly purge the data up to a certain serial number.


Comments, suggestions?


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers