Prev: [HACKERS] including PID or backend ID in relpath of temp rels
Next: [HACKERS] improve plpgsql's EXECUTE 'select into' message with a hint
From: Robert Haas on 4 May 2010 23:53 On Tue, Apr 27, 2010 at 9:59 PM, Robert Haas <robertmhaas(a)gmail.com> wrote: > [storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out > parameter (its second argument) a list of RelFileNodes pending delete, > which we then write to WAL or to the two-phase state file. It appears that we are playing a little bit fast and loose with this. I think that the two-phase code path is solid because we prohibit PREPARE TRANSACTION if the transaction has referenced any temporary tables, so when we read the two-phase state file it's safe to assume that all the tables mentioned are non-temporary. But the ordinary one-phase commit writes permanent and temporary relfilenodes to WAL without distinction, and then, in xl_redo_commit() and xl_redo_abort(), does this: XLogDropRelation(xlrec->xnodes[i], fork); smgrdounlink(srel, fork, false, true); The third argument to smgrdounlink() is "isTemp", which we're here passing as false, but might really be true. I don't think it technically matters at present because the only effect of that parameter right now is that we pass it through to DropRelFileNodeBuffers(), which will drop shared buffers rather than local buffers as a result of the incorrect setting. But that won't matter because the WAL replay process shouldn't have any local buffers anyway, since temp relations are not otherwise WAL-logged. For the same reason, I don't think the call to XLogDropRelation() is an issue because its only purpose is to remove entries from invalid_page_tab, and there won't be any temporary pages in there anyway. Of course if we're going to do $SUBJECT this will need to be changed anyway, but assuming the above analysis is correct I think the existing coding at least deserves a comment... then again, maybe I'm all mixed up? ....Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Jim Nasby on 17 May 2010 14:10 On May 6, 2010, at 10:24 PM, Robert Haas wrote: > On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera > <alvherre(a)commandprompt.com> wrote: >>>>> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary >>>>> relations? I think the only backend that can have an smgr reference >>>>> to a temprel other than the owning backend is bgwriter, and AFAICS >>>>> bgwriter will only have such a reference if it's responding to a >>>>> request by the owning backend to unlink the associated files, in which >>>>> case (I think) the owning backend will have no reference. > > This turns out to be wrong, I think. It seems that what we do is > prevent backends other than the opening backend from reading pages > from non-local temp rels into private or shared buffers, but we don't > actually prevent them from having smgr references. This allows > autovacuum to drop them, for example, in an anti-wraparound situation. > (Thanks to Tom for helping me get my head around this better.) > >>>> Hmm, wasn't there a proposal to have the owning backend delete the files >>>> instead of asking the bgwriter to? >>> >>> I did propose that upthread; it may have been proposed previously >>> also. This might be worth doing independently of the rest of the patch >>> (which I'm starting to fear is doomed, cue ominous soundtrack) since >>> it would reduce the chance of orphaning data files and possibly >>> simplify the logic also. >> >> +1 for doing it separately, but hopefully that doesn't mean the rest of >> this patch is doomed ... > > Doom has been averted. Proposed patch attached. It passes regression > tests and seems to work, but could use additional testing and, of > course, some code-reading also. > > Some notes on this patch: > > It seems prett clear that it isn't desirable to simply add backend ID > to RelFileNode, because there are too many places using RelFileNode > already for purposes where the backend ID can be inferred from > context, such as buffer headers and most of xlog. Instead, I > introduced BackendRelFileNode, which consists of an ordinary > RelFileNode augmented with a backend ID, and use that only where > needed. In particular, the smgr layer must use BackendRelFileNode > throughout, since it operates on both permanent and temporary > relations. smgr invalidations must also include the backend ID. xlog > generally happens only for non-temporary relations and can thus > continue to use an ordinary RelFileNode; however, commit/abort records > must use BackendRelFileNode as there may be physical storage > associated with temporary relations that must be unlinked. > Communication with the bgwriter must use BackendRelFileNode for > similar reasons. The relcache now stores rd_backend rather than > rd_islocaltemp so that it remains straightforward to call smgropen() > based on a relcache entry. Some smgr functions no longer require an > isTemp argument, because they can infer the necessary information from > their BackendRelFileNode. smgrwrite() and smgrextend() now take a > skipFsync argument rather than an isTemp argument. > > I'm not totally sure whether it makes sense to do what we were talking > about above, viz, transfer unlink responsibility for temp rels from > the bgwriter to the owning backend. I haven't done that here. Nor > have I implemented any kind of improved temporary file cleanup > strategy, though I hope such a thing is possible. Any particular reason not to use directories to help organize things? IE: base/database_oid/pg_temp_rels/backend_pid/relfilenode Perhaps relfilenode should be something else. This seems to have several advantages: 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find. 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway. 3: It separates all the temporary stuff away from real files. The only downside I see is some extra code to create the backend_pid directory. -- Jim C. Nasby, Database Architect jim(a)nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 17 May 2010 21:01 On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel(a)decibel.org> wrote: >> It seems prett clear that it isn't desirable to simply add backend ID >> to RelFileNode, because there are too many places using RelFileNode >> already for purposes where the backend ID can be inferred from >> context, such as buffer headers and most of xlog. Instead, I >> introduced BackendRelFileNode, which consists of an ordinary >> RelFileNode augmented with a backend ID, and use that only where >> needed. In particular, the smgr layer must use BackendRelFileNode >> throughout, since it operates on both permanent and temporary >> relations. smgr invalidations must also include the backend ID. xlog >> generally happens only for non-temporary relations and can thus >> continue to use an ordinary RelFileNode; however, commit/abort records >> must use BackendRelFileNode as there may be physical storage >> associated with temporary relations that must be unlinked. >> Communication with the bgwriter must use BackendRelFileNode for >> similar reasons. The relcache now stores rd_backend rather than >> rd_islocaltemp so that it remains straightforward to call smgropen() >> based on a relcache entry. Some smgr functions no longer require an >> isTemp argument, because they can infer the necessary information from >> their BackendRelFileNode. smgrwrite() and smgrextend() now take a >> skipFsync argument rather than an isTemp argument. >> >> I'm not totally sure whether it makes sense to do what we were talking >> about above, viz, transfer unlink responsibility for temp rels from >> the bgwriter to the owning backend. I haven't done that here. Nor >> have I implemented any kind of improved temporary file cleanup >> strategy, though I hope such a thing is possible. > > Any particular reason not to use directories to help organize things? IE: > > base/database_oid/pg_temp_rels/backend_pid/relfilenode > > Perhaps relfilenode should be something else. > > This seems to have several advantages: > > 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find. > 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway. > 3: It separates all the temporary stuff away from real files. > > The only downside I see is some extra code to create the backend_pid directory. I like the idea of using directories to organize things better and I completely agree with points #1 and #3. Point #2 is a little more complicated, I think, and something I've been struggling with. We need to make sure that we clean up not only the temporary files but also the catalog entries that point to them, if any. The current code only blows away temporary tables that are "old" in terms of transaction IDs, is driven off the catalog entries, and essentially does DROP TABLE <whatever>. So it can't clean up orphaned temporary files that don't have any catalog entries associated with them (which is what we want to fix) but on the other hand whatever it does clean up is cleaned up completely. We might be able to do something like this: 1. Scan pg_temp_rels. For each subdirectory found (whose name looks like a backend ID), if the corresponding backend is not currently running, add the backend ID to a list of backend IDs needing cleanup. 2. For each backend ID derived in step 1: 2A. Scan the subdirectory and add all the files you find (whose names are in the right format) to a list of files to be deleted. 2B. Check again whether the backend in question is running. If it is, don't do anything further for this backend and go on to the next backend ID (i.e. continue with step 2). 2C. For each file found in step 2A, look for a pg_class entry in the temp tablespace for that backend ID with a matching relfilenode number. If one is found, drop the rel. If not, unlink the file if it still exists. 2D. Attempt to remove the directory, ignoring failures. I think step 2B is sufficient to prevent a race condition where we end up mistaking a newly created file for an orphaned one. Am I right? One possible problem with this is that it would need to be repeated for every database/tablespace combination, but maybe that wouldn't be too expensive. autovacuum already needs to process every database, but I don't know how you'd decide how often to check for stray temp files. Certainly you'd want to check after a restart... after that I get fuzzy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Greg Stark on 17 May 2010 23:25 On Tue, May 4, 2010 at 5:17 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote: > FWIW, that's not the case, anymore than it is for blocks in shared > buffer cache for regular rels. smgrextend() results in an observable > extension of the file EOF immediately, whether or not you can see > up-to-date data for those pages. > > Now people have often complained about the extra I/O involved in that, > and it'd be nice to have a solution, but it's not clear to me that > fixing it would be harder for temprels than regular rels. For systems that have it and filesystems that optimize it I think posix_fallocate() handles this case. We can extend files without actually doing any i/o but we get the guarantee from the filesystem that it has the space available and writing to those blocks won't fail due to lack of space. Only meta-data i/o is triggered allocating the blocks and marking them as virtually filled with nulls and it's not synced unless there's an fsync so there's no extra physical i/o. This should be the case for ext4 but I'm not sure what other filesystems implement this. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 4 Jun 2010 19:16
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel(a)decibel.org> wrote: > Any particular reason not to use directories to help organize things? IE: > > base/database_oid/pg_temp_rels/backend_pid/relfilenode > > Perhaps relfilenode should be something else. > > This seems to have several advantages: > > 1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding everything is still easy via filesystem find. > 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under pg_temp_rels goes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run it's own cleanup, though I think that anytime that happens we're going to restart everything anyway. > 3: It separates all the temporary stuff away from real files. > > The only downside I see is some extra code to create the backend_pid directory. I thought this was a good idea when you first proposed it, but on further review I've changed my mind. There are several places in the code that rely on checking whether the database directory within any given tablespace is empty to determine whether that database is using that tablespace. While I could rewrite all of that logic to do the right thing, I think it's unnecessary pain. I talked with Tom Lane about this a little bit at PGcon and opined that we probably only need to clean out stray temporary files at startup. So what I'm tempted to do is just write a function that goes through all tablespace/database combinations and scans each directory for files with a name like t<digits>_<digits> and blows them away. This will leave the catalog entries pointing at nothing, but we already have working code in autovacuum.c to clean up the catalog entries, and I believe that will work just fine even if the underlying files have been removed earlier. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |