From: Heikki Linnakangas on
Tom Lane wrote:
> That seems too fragile to me, as I don't find it a stretch at all to
> think that writing the map file might fail --- just think Windows
> antivirus code :-(. Now, once we have written the WAL record for
> the mapfile change, we can't really afford a failure in my approach
> either. But I think a rename() after successfully creating/writing/
> fsync'ing a temp file is a whole lot safer than writing from a standing
> start.

My gut feeling is exactly opposite. Creating and renaming a file
involves operations (and permissions) on the directory, while
overwriting a small file is just a simple write(). Especially if you
open() the file before doing anything irreversible.

> The other problem with what you sketch is that it'd require holding the
> mapfile write lock across commit, because we still have to have strict
> serialization of updates.

Why is the strict serialization of updates needed? To avoid overwriting
the file with stale contents in a race condition?

I was thinking that we only store the modified part in the WAL record.
Right after writing commit record, take the lock, read() the map file,
modify it in memory, write() it back, and release lock.

That means that there's no full images of the file in WAL records, which
makes me slightly uncomfortable from a disaster recovery point-of-view,
but we could keep a backup copy of the file in the data directory or
something if that's too scary otherwise.

> Maybe we should forget the
> rename() trick and overwrite the map file in place. I still think it
> needs to be a separate WAL record though. I'm thinking
>
> * obtain lock
> * open file for read/write
> * read current contents
> * construct modified contents
> * write and sync WAL record
> * write back file through already-opened descriptor
> * fsync
> * release lock
>
> Not totally clear if this is more or less safe than the rename method;
> but given the assumption that the file is less than one disk block,
> it should be just as atomic as pg_control updates are.

That doesn't solve the problem I was trying to solve, which is that if
the map file is rewritten, but the transaction later aborts, the map
file update has hit the disk already. That's why I wanted to stash it
into the commit record.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> writes:
> Tom Lane wrote:
>> The other problem with what you sketch is that it'd require holding the
>> mapfile write lock across commit, because we still have to have strict
>> serialization of updates.

> Why is the strict serialization of updates needed? To avoid overwriting
> the file with stale contents in a race condition?

Exactly.

> I was thinking that we only store the modified part in the WAL record.
> Right after writing commit record, take the lock, read() the map file,
> modify it in memory, write() it back, and release lock.

> That means that there's no full images of the file in WAL records, which
> makes me slightly uncomfortable from a disaster recovery point-of-view,

Yeah, me too, which is another advantage of using a separate WAL entry.

> That doesn't solve the problem I was trying to solve, which is that if
> the map file is rewritten, but the transaction later aborts, the map
> file update has hit the disk already. That's why I wanted to stash it
> into the commit record.

The design I sketched doesn't require such an assumption anyway. Once
the map file is written, the relocation is effective, commit or no.
As long as we restrict relocations to maintenance operations such as
VACUUM FULL, which have no transactionally significant results, this
doesn't seem like a problem. What we do need is that after a CLUSTER
or V.F., which is going to relocate not only the rel but its indexes,
the relocations of the rel and its indexes have to all "commit"
atomically. But saving up the transaction's map changes and applying
them in one write takes care of that.

(What I believe this means is that pg_class and its indexes have to all
be mapped, but I'm thinking right now that no other non-shared catalogs
need the treatment.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Mon, 2010-02-01 at 10:27 -0500, Tom Lane wrote:
> Simon Riggs <simon(a)2ndQuadrant.com> writes:
> > On Mon, 2010-02-01 at 10:06 -0500, Tom Lane wrote:
> >> the assumption that the file is less than one disk block,
> >> it should be just as atomic as pg_control updates are.
>
> > IIRC there were 173 relations affected by this. 4 bytes each we would
> > have more than 512 bytes.
>
> Where in the world did you get that number?
>
> There are currently 29 shared relations (counting indexes), and 13
> nailed local relations, which would go into a different map file.
> I'm not sure if the set of local catalogs requiring the map treatment
> is exactly the same as what's presently nailed, but that's probably
> a good approximation.

I was suggesting that we only do shared and nailed relations. Sounds
like you agree.

--
Simon Riggs www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
I wrote:
> The design I sketched doesn't require such an assumption anyway. Once
> the map file is written, the relocation is effective, commit or no.
> As long as we restrict relocations to maintenance operations such as
> VACUUM FULL, which have no transactionally significant results, this
> doesn't seem like a problem. What we do need is that after a CLUSTER
> or V.F., which is going to relocate not only the rel but its indexes,
> the relocations of the rel and its indexes have to all "commit"
> atomically. But saving up the transaction's map changes and applying
> them in one write takes care of that.

BTW, I noticed a couple of other issues that need to be dealt with to
make that safe. During CLUSTER/V.F. we typically try to update the
relation's relfilenode, relpages, reltuples, relfrozenxid (in
setNewRelfilenode) as well as its toastrelid (in swap_relation_files).
These are regular transactional updates to the pg_class tuple that will
fail to commit if the outer transaction rolls back. However:

* For a mapped relation, both the old and new relfilenode will be zero,
so it doesn't matter.

* Possibly losing the updates of relpages and reltuples is not critical.

* For relfrozenxid, we can simply force the new and old values to be the
same rather than hoping to advance the value, if we're dealing with a
mapped relation. Or just let it be; I think that losing an advance
of relfrozenxid wouldn't be critical either.

* We can not change the toast rel OID of a shared catalog -- there's no
way to propagate that into the other copies of pg_class. So we need to
rejigger the logic for heap rewriting a little bit. Toast rel swapping
has to be handled by swapping their relfilenodes not their OIDs. This
is no big deal as far as cluster.c itself is concerned, but the tricky
part is that when we write new toasted values into the new toast rel,
the TOAST pointers going into the new heap have to be written with the
original toast-table OID value not the one that the transient target
toast rel has got. This is doable but it would uglify the TOAST API a
bit I think. Another possibility is to treat the toast rel OID of a
catalog as something that can be supplied by the map file. Not sure
which way to jump.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Simon Riggs <simon(a)2ndQuadrant.com> writes:
> On Wed, 2010-02-03 at 11:50 -0500, Tom Lane wrote:
>> I've concluded that that's too large a change to undertake for 9.0

> The purpose of this was to make the big changes in 9.0. If we aren't
> going to do that it seems like we shouldn't bother at all.

No, the purpose of this was to get rid of VACUUM FULL INPLACE in 9.0.
I'm not interested in destabilizing the code (even more) just to avoid
one small internal kluge. The proposed magic field in struct Relation
is the only part of this that I'd foresee reverting later.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers