From: Heikki Linnakangas on
Pavan Deolasee wrote:
> Another source of I/O is perhaps the CLOG read/writes for checking
> transaction status. If we are talking about large tables like accounts in
> pgbench or customer/stock in DBT2, the tables are vacuumed much later than
> the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
> that CLOG pages holding the status of many of the transactions might have
> been already flushed out of the cache and require an I/O. Since the default
> CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> during VACUUM as the transaction ids will be all random in a heap page.

8 log pages hold 8*8192*4=262144 transactions. If the active set of
transactions is larger than that, the OS cache will probably hold more
clog pages. I guess you could end up doing some I/O on clog on a vacuum
of a big table, if you have a high transaction rate and vacuum
infrequently...

> Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ? Assuming few seconds of life
> of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
> transactions should have completed and bgwriter can set
> XMIN(XMAX)_COMMITTED
> or XMIN(XMAX)_INVALID for most of the tuples in the page. This would
> save us
> CLOG I/Os later, either during subsequent access to the tuple and/or
> vacuum.

Yeah, we could do that. First I'd like to see some more evidence that
clog trashing is a problem, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: Tom Lane on
"Pavan Deolasee" <pavan.deolasee(a)gmail.com> writes:
> Would it help to set the status of the XMIN/XMAX of tuples early enough such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ?

No. The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages. In any case, if the table is in active use then some
passing backend has probably updated the bits already ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

From: "Pavan Deolasee" on
On 1/23/07, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>
> "Pavan Deolasee" <pavan.deolasee(a)gmail.com> writes:
> > Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> > that the heap page is still in the buffer cache, but late enough such
> that
> > the XMIN/XMAX transactions are finished ? How about doing it when the
> > bgwriter is about to write the page to disk ?
>
> No. The bgwriter would then become subject to deadlocks because it
> would be needing to read in clog pages before it could flush out
> dirty pages. In any case, if the table is in active use then some
> passing backend has probably updated the bits already ...


Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.

IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.

Thanks,
Pavan




--

EnterpriseDB http://www.enterprisedb.com
From: Tom Lane on
"Pavan Deolasee" <pavan.deolasee(a)gmail.com> writes:
> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete. You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not "typical
desktop" kit.

In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high. The kernel will have some pages too. And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(a)postgresql.org so that your
message can get through to the mailing list cleanly

From: "Pavan Deolasee" on
On 1/24/07, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>
> "Pavan Deolasee" <pavan.deolasee(a)gmail.com> writes:
> > On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> > clocking more than 1500 tps.
>
> Only if you had fsync off, or equivalently a disk drive that lies about
> write-complete. You could possibly achieve such rates in a non-broken
> configuration with a battery-backed write cache, but that's not "typical
> desktop" kit.


May be I was too vague about the machine/test. Its probably not a
"typical desktop" machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.


> In any case, you ignored Heikki's point that the PG shared memory pages
> holding CLOG are unlikely to be the sole level of caching, if the update
> rate is that high. The kernel will have some pages too. And even if we
> thought not, wouldn't bumping the size of the clog cache be a far
> simpler solution offering benefit for more things than just this?


Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?

Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com