GSoC - Materialized Views - is stale or fresh? [PgSql]

Prev: visibility map
Next: [HACKERS] Typo in plperl doc ?

From: Heikki Linnakangas on 14 Jun 2010 01:52

On 12/06/10 17:18, Pavel Baros wrote:
> I am curious how could I solve the problem:
>
> During refreshing I would like to know, if MV is stale or fresh? And I
> had an idea:
>
> In fact, MV need to know if its last refresh (transaction id) is older
> than any INSERT, UPDATE, DELETE transaction launched against source
> tables. So if MV has information about last (highest) xmin in source
> tables, it could simply compare its own xmin to xmins (xmax for deleted
> rows) from source tables and decide, if is stale or fresh.
>
> Whole realization could look like this:
> 1. Make new column in pg_class (or somewhere in pg_stat* ?):
> pg_class.rellastxid (of type xid)
>
> 2. After each INSERT, UPDATE, DELETE statement (transaction)
> pg_class.rellastxid would be updated. That should not be time- or
> memory- consuming (not so much) since pg_class is cached, I guess.

rellastxid would have to be updated at every insert/update/delete. It
would become a big bottleneck. That's not going to work.

Why do you need to know if a MV is stale?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on 14 Jun 2010 01:54

Pavel Baros wrote:
> After each INSERT, UPDATE, DELETE statement (transaction)
> pg_class.rellastxid would be updated. That should not be time- or
> memory- consuming (not so much) since pg_class is cached, I guess.

An update in PostgreSQL is essentially an INSERT followed a later DELETE
when VACUUM gets to the dead row no longer visible. The problem with
this approach is that it will leave behind so many dead rows in pg_class
due to the heavy updates that the whole database could grind to a halt,
as so many operations will have to sort through all that garbage. It
could potentially double the total write volume on the system, and
you'll completely kill people who don't have autovacuum running during
some periods of the day.

The basic idea of saving the last update time for each relation is not
unreasonable, but you can't store the results by updating pg_class. My
first thought would be to send this information as a message to the
statistics collector. It's already being sent updates at the point
you're interested in for the counters of how many INSERT/UPDATE/DELETE
statements are executing against the table. You might bundle your last
update information into that existing message with minimal overhead.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on 14 Jun 2010 05:00

2010/6/14 Greg Smith <greg(a)2ndquadrant.com>:
> Pavel Baros wrote:
>>
>> After each INSERT, UPDATE, DELETE statement (transaction)
>> pg_class.rellastxid would be updated. That should not be time- or memory-
>> consuming (not so much) since pg_class is cached, I guess.
>
> An update in PostgreSQL is essentially an INSERT followed a later DELETE
> when VACUUM gets to the dead row no longer visible. �The problem with this
> approach is that it will leave behind so many dead rows in pg_class due to
> the heavy updates that the whole database could grind to a halt, as so many
> operations will have to sort through all that garbage. �It could potentially
> double the total write volume on the system, and you'll completely kill
> people who don't have autovacuum running during some periods of the day.
>
> The basic idea of saving the last update time for each relation is not
> unreasonable, but you can't store the results by updating pg_class. �My
> first thought would be to send this information as a message to the
> statistics collector. �It's already being sent updates at the point you're
> interested in for the counters of how many INSERT/UPDATE/DELETE statements
> are executing against the table. �You might bundle your last update
> information into that existing message with minimal overhead.

Right. Do remember that the stats collector is designed to be lossy,
though, so you're not guaranteed that the information reaches the
other end. In reality it tends to do that, but there needs to be some
sort of recovery path for the case when it doesn't.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 14 Jun 2010 07:46

On Mon, Jun 14, 2010 at 5:00 AM, Magnus Hagander <magnus(a)hagander.net> wrote:
> 2010/6/14 Greg Smith <greg(a)2ndquadrant.com>:
>> Pavel Baros wrote:
>>>
>>> After each INSERT, UPDATE, DELETE statement (transaction)
>>> pg_class.rellastxid would be updated. That should not be time- or memory-
>>> consuming (not so much) since pg_class is cached, I guess.
>>
>> An update in PostgreSQL is essentially an INSERT followed a later DELETE
>> when VACUUM gets to the dead row no longer visible. �The problem with this
>> approach is that it will leave behind so many dead rows in pg_class due to
>> the heavy updates that the whole database could grind to a halt, as so many
>> operations will have to sort through all that garbage. �It could potentially
>> double the total write volume on the system, and you'll completely kill
>> people who don't have autovacuum running during some periods of the day.
>>
>> The basic idea of saving the last update time for each relation is not
>> unreasonable, but you can't store the results by updating pg_class. �My
>> first thought would be to send this information as a message to the
>> statistics collector. �It's already being sent updates at the point you're
>> interested in for the counters of how many INSERT/UPDATE/DELETE statements
>> are executing against the table. �You might bundle your last update
>> information into that existing message with minimal overhead.
>
> Right. Do remember that the stats collector is designed to be lossy,
> though, so you're not guaranteed that the information reaches the
> other end. In reality it tends to do that, but there needs to be some
> sort of recovery path for the case when it doesn't.

What Pavel's trying to do here is be smart about figuring out when an
MV needs to be refreshed. I'm pretty sure this is the wrong way to go
about it, but it seems entirely premature considering that we don't
have a working implementation of a *manually* refreshed MV.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 14 Jun 2010 11:17

Robert Haas <robertmhaas(a)gmail.com> wrote:

> What Pavel's trying to do here is be smart about figuring out when
> an MV needs to be refreshed. I'm pretty sure this is the wrong
> way to go about it, but it seems entirely premature considering
> that we don't have a working implementation of a *manually*
> refreshed MV.

Agreed all around.

At the risk of sounding obsessed, this is an area where predicate
locks might be usefully extended, if and when the serializable patch
makes it in.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2
Prev: visibility map
Next: [HACKERS] Typo in plperl doc ?