From: Robert Haas on
On Tue, Apr 27, 2010 at 9:59 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> [storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out
> parameter (its second argument) a list of RelFileNodes pending delete,
> which we then write to WAL or to the two-phase state file.

It appears that we are playing a little bit fast and loose with this.
I think that the two-phase code path is solid because we prohibit
PREPARE TRANSACTION if the transaction has referenced any temporary
tables, so when we read the two-phase state file it's safe to assume
that all the tables mentioned are non-temporary. But the ordinary
one-phase commit writes permanent and temporary relfilenodes to WAL
without distinction, and then, in xl_redo_commit() and
xl_redo_abort(), does this:

XLogDropRelation(xlrec->xnodes[i], fork);
smgrdounlink(srel, fork, false, true);

The third argument to smgrdounlink() is "isTemp", which we're here
passing as false, but might really be true. I don't think it
technically matters at present because the only effect of that
parameter right now is that we pass it through to
DropRelFileNodeBuffers(), which will drop shared buffers rather than
local buffers as a result of the incorrect setting. But that won't
matter because the WAL replay process shouldn't have any local buffers
anyway, since temp relations are not otherwise WAL-logged. For the
same reason, I don't think the call to XLogDropRelation() is an issue
because its only purpose is to remove entries from invalid_page_tab,
and there won't be any temporary pages in there anyway.

Of course if we're going to do $SUBJECT this will need to be changed
anyway, but assuming the above analysis is correct I think the
existing coding at least deserves a comment... then again, maybe I'm
all mixed up?

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jim Nasby on
On May 6, 2010, at 10:24 PM, Robert Haas wrote:
> On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
> <alvherre(a)commandprompt.com> wrote:
>>>>> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
>>>>> relations? I think the only backend that can have an smgr reference
>>>>> to a temprel other than the owning backend is bgwriter, and AFAICS
>>>>> bgwriter will only have such a reference if it's responding to a
>>>>> request by the owning backend to unlink the associated files, in which
>>>>> case (I think) the owning backend will have no reference.
>
> This turns out to be wrong, I think. It seems that what we do is
> prevent backends other than the opening backend from reading pages
> from non-local temp rels into private or shared buffers, but we don't
> actually prevent them from having smgr references. This allows
> autovacuum to drop them, for example, in an anti-wraparound situation.
> (Thanks to Tom for helping me get my head around this better.)
>
>>>> Hmm, wasn't there a proposal to have the owning backend delete the files
>>>> instead of asking the bgwriter to?
>>>
>>> I did propose that upthread; it may have been proposed previously
>>> also. This might be worth doing independently of the rest of the patch
>>> (which I'm starting to fear is doomed, cue ominous soundtrack) since
>>> it would reduce the chance of orphaning data files and possibly
>>> simplify the logic also.
>>
>> +1 for doing it separately, but hopefully that doesn't mean the rest of
>> this patch is doomed ...
>
> Doom has been averted. Proposed patch attached. It passes regression
> tests and seems to work, but could use additional testing and, of
> course, some code-reading also.
>
> Some notes on this patch:
>
> It seems prett clear that it isn't desirable to simply add backend ID
> to RelFileNode, because there are too many places using RelFileNode
> already for purposes where the backend ID can be inferred from
> context, such as buffer headers and most of xlog. Instead, I
> introduced BackendRelFileNode, which consists of an ordinary
> RelFileNode augmented with a backend ID, and use that only where
> needed. In particular, the smgr layer must use BackendRelFileNode
> throughout, since it operates on both permanent and temporary
> relations. smgr invalidations must also include the backend ID. xlog
> generally happens only for non-temporary relations and can thus
> continue to use an ordinary RelFileNode; however, commit/abort records
> must use BackendRelFileNode as there may be physical storage
> associated with temporary relations that must be unlinked.
> Communication with the bgwriter must use BackendRelFileNode for
> similar reasons. The relcache now stores rd_backend rather than
> rd_islocaltemp so that it remains straightforward to call smgropen()
> based on a relcache entry. Some smgr functions no longer require an
> isTemp argument, because they can infer the necessary information from
> their BackendRelFileNode. smgrwrite() and smgrextend() now take a
> skipFsync argument rather than an isTemp argument.
>
> I'm not totally sure whether it makes sense to do what we were talking
> about above, viz, transfer unlink responsibility for temp rels from
> the bgwriter to the owning backend. I haven't done that here. Nor
> have I implemented any kind of improved temporary file cleanup
> strategy, though I hope such a thing is possible.

Any particular reason not to use directories to help organize things? IE:

base/database_oid/pg_temp_rels/backend_pid/relfilenode

Perhaps relfilenode should be something else.

This seems to have several advantages:

1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
3: It separates all the temporary stuff away from real files.

The only downside I see is some extra code to create the backend_pid directory.
--
Jim C. Nasby, Database Architect jim(a)nasby.net
512.569.9461 (cell) http://jim.nasby.net



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel(a)decibel.org> wrote:
>> It seems prett clear that it isn't desirable to simply add backend ID
>> to RelFileNode, because there are too many places using RelFileNode
>> already for purposes where the backend ID can be inferred from
>> context, such as buffer headers and most of xlog.  Instead, I
>> introduced BackendRelFileNode, which consists of an ordinary
>> RelFileNode augmented with a backend ID, and use that only where
>> needed.  In particular, the smgr layer must use BackendRelFileNode
>> throughout, since it operates on both permanent and temporary
>> relations. smgr invalidations must also include the backend ID.  xlog
>> generally happens only for non-temporary relations and can thus
>> continue to use an ordinary RelFileNode; however, commit/abort records
>> must use BackendRelFileNode as there may be physical storage
>> associated with temporary relations that must be unlinked.
>> Communication with the bgwriter must use BackendRelFileNode for
>> similar reasons. The relcache now stores rd_backend rather than
>> rd_islocaltemp so that it remains straightforward to call smgropen()
>> based on a relcache entry. Some smgr functions no longer require an
>> isTemp argument, because they can infer the necessary information from
>> their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
>> skipFsync argument rather than an isTemp argument.
>>
>> I'm not totally sure whether it makes sense to do what we were talking
>> about above, viz, transfer unlink responsibility for temp rels from
>> the bgwriter to the owning backend.  I haven't done that here.  Nor
>> have I implemented any kind of improved temporary file cleanup
>> strategy, though I hope such a thing is possible.
>
> Any particular reason not to use directories to help organize things? IE:
>
> base/database_oid/pg_temp_rels/backend_pid/relfilenode
>
> Perhaps relfilenode should be something else.
>
> This seems to have several advantages:
>
> 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
> 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
> 3: It separates all the temporary stuff away from real files.
>
> The only downside I see is some extra code to create the backend_pid directory.

I like the idea of using directories to organize things better and I
completely agree with points #1 and #3. Point #2 is a little more
complicated, I think, and something I've been struggling with. We
need to make sure that we clean up not only the temporary files but
also the catalog entries that point to them, if any. The current code
only blows away temporary tables that are "old" in terms of
transaction IDs, is driven off the catalog entries, and essentially
does DROP TABLE <whatever>. So it can't clean up orphaned temporary
files that don't have any catalog entries associated with them (which
is what we want to fix) but on the other hand whatever it does clean
up is cleaned up completely.

We might be able to do something like this:

1. Scan pg_temp_rels. For each subdirectory found (whose name looks
like a backend ID), if the corresponding backend is not currently
running, add the backend ID to a list of backend IDs needing cleanup.
2. For each backend ID derived in step 1:
2A. Scan the subdirectory and add all the files you find (whose names
are in the right format) to a list of files to be deleted.
2B. Check again whether the backend in question is running. If it is,
don't do anything further for this backend and go on to the next
backend ID (i.e. continue with step 2).
2C. For each file found in step 2A, look for a pg_class entry in the
temp tablespace for that backend ID with a matching relfilenode
number. If one is found, drop the rel. If not, unlink the file if it
still exists.
2D. Attempt to remove the directory, ignoring failures.

I think step 2B is sufficient to prevent a race condition where we end
up mistaking a newly created file for an orphaned one. Am I right?

One possible problem with this is that it would need to be repeated
for every database/tablespace combination, but maybe that wouldn't be
too expensive. autovacuum already needs to process every database,
but I don't know how you'd decide how often to check for stray temp
files. Certainly you'd want to check after a restart... after that I
get fuzzy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on
On Tue, May 4, 2010 at 5:17 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> FWIW, that's not the case, anymore than it is for blocks in shared
> buffer cache for regular rels.  smgrextend() results in an observable
> extension of the file EOF immediately, whether or not you can see
> up-to-date data for those pages.
>
> Now people have often complained about the extra I/O involved in that,
> and it'd be nice to have a solution, but it's not clear to me that
> fixing it would be harder for temprels than regular rels.

For systems that have it and filesystems that optimize it I think
posix_fallocate() handles this case. We can extend files without
actually doing any i/o but we get the guarantee from the filesystem
that it has the space available and writing to those blocks won't fail
due to lack of space. Only meta-data i/o is triggered allocating the
blocks and marking them as virtually filled with nulls and it's not
synced unless there's an fsync so there's no extra physical i/o.

This should be the case for ext4 but I'm not sure what other
filesystems implement this.


--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel(a)decibel.org> wrote:
> Any particular reason not to use directories to help organize things? IE:
>
> base/database_oid/pg_temp_rels/backend_pid/relfilenode
>
> Perhaps relfilenode should be something else.
>
> This seems to have several advantages:
>
> 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find.
> 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway.
> 3: It separates all the temporary stuff away from real files.
>
> The only downside I see is some extra code to create the backend_pid directory.

I thought this was a good idea when you first proposed it, but on
further review I've changed my mind. There are several places in the
code that rely on checking whether the database directory within any
given tablespace is empty to determine whether that database is using
that tablespace. While I could rewrite all of that logic to do the
right thing, I think it's unnecessary pain.

I talked with Tom Lane about this a little bit at PGcon and opined
that we probably only need to clean out stray temporary files at
startup. So what I'm tempted to do is just write a function that goes
through all tablespace/database combinations and scans each directory
for files with a name like t<digits>_<digits> and blows them away.
This will leave the catalog entries pointing at nothing, but we
already have working code in autovacuum.c to clean up the catalog
entries, and I believe that will work just fine even if the underlying
files have been removed earlier.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers