Prev: Add on_perl_init and proper destruction to plperl UPDATE v3[PATCH]
Next: [HACKERS] Package namespace and Safe init cleanup for plperl UPDATE 3 [PATCH]
From: Simon Riggs on 31 Jan 2010 15:40 On Sun, 2010-01-31 at 15:14 -0500, Tom Lane wrote: > If the only benefit of getting rid of VACUUM FULL were simplifying > Hot Standby, I'd agree with you. But there are numerous other benefits. > The double-commit hack you mention is something we need to get rid of > for general system stability (because of the risk of PANIC if the vacuum > fails after the first commit). Getting rid of REINDEX-in-place on > shared catalog indexes is another thing that's really safety critical. > Removing V-F related hacks in other places would just be a bonus. I should've agreed with this in my last post, cos I do. I want very, very much to get rid of VACUUM FULL just because it's such a sump of ugly, complex code. But there is a limit to how and when performs what I now see is a more major surgical operation. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 31 Jan 2010 22:49 Simon Riggs <simon(a)2ndQuadrant.com> writes: > I'll do a little work towards step (1) just so we can take a more > informed view once you've had a better look at just what (2) involves. I spent a couple of hours reading code and believe that I've identified all the pain points for allowing relocation of system catalogs (ie, assigning them new relfilenodes during cluster-style VACUUM FULL, REINDEX, etc). It's definitely not a trivial project but it's not out of reach either --- I estimate I could get it done in a couple or three days of full-time effort. Given the potential benefits I think it's worth doing. Rough notes are attached --- comments? regards, tom lane Change pg_class.relfilenode to be 0 for all rels for which a map file should be used instead. Define RelationGetFilenode() to look to the physaddr info instead, and make all internal refs go to that instead of reading rd_rel->relfilenode directly. Define pg_relation_filenode(regclass) and fix external refs (oid2name, pg_dump) to use that instead. In zeroth cut, just have relcache.c substitute the OID in place of reading a map file. Possibly it should always do that during bootstrap. How should bootstrap decide which rels to insert zero for, anyway? Shared definitely, pg_class, ... maybe that's enough? Or do we need it for all nailed local catalogs? local state contains * shared map list * this database's map list * list of local overrides to each on-disk map list At commit: if modified, write out new version; do this as late as possible before xact commit At abort: just discard local-override list "Write" must include generating a WAL entry as well as sending a shared-inval signal * write temp file, fsync it * emit WAL record containing copy of new file contents, fsync it * atomically rename temp file into place * send sinval During relation open, check for sinval before relying on current cached value of map contents Hm, what about concurrent updates of map? Probably instantiate a new LWLock or maybe better a HW lock. So write involves taking lock, check for updates, finally write --- which means that local modifications have to be stored in a way that allows re-reading somebody else's mods. Hence above data structure with separate storage of local modifications. We assume rel-level locking protects us from concurrent update attempts on the same map entry, but we don't want to forbid concurrent relocations of different catalogs. LWLock would work if we do an unconditional read of the file contents after getting lock rather than relying on sinval signaling, which seems a small price considering map updates should be infrequent. Without that, writers have to hold the lock till commit which rules out using LWLock. Hm ... do we want an LWLock per map file, or is one lock to rule them all sufficient? LWLock per database seems problematic. With an HW lock there wouldn't be a problem with that. HW lock would allow concurrent updates of the map files of different DBs, but is that worth the extra cycles? In a case where a transaction relocates pg_class (or another mapped catalog) and then makes updates in that catalog, all is well in the sense that the updates will be preserved iff the xact commits. HOWEVER we would have problems during WAL replay? No, because the WAL entries will command updates to the catalog's new relfilenode, which will be interesting to anyone else if and only if the xact gets to commit. We would need to be careful about the case of relocating pg_class itself though --- make sure local relcache references the new pg_class before any such updates happen. Probably a CCI is sufficient. Another issue for clustering a catalog is that sinval catcache signalling could get confused, since that depends on catalog entry TIDs which would change --- we'd likely need to have relocation send an sinval signal saying "flush *everything* from that catalog". (relcache inval on the catalog itself doesn't do that.) This action could/should be transactional so no additional code is needed to propagate the notice to HS standbys. Once the updated map file is moved into place, the relocation is effectively committed even if we subsequently abort the transaction. We can make that window pretty narrow but not remove it completely. Risk factors: * abort would try to remove relation files created by xact, in particular the new version of the catalog. Ooops. Can fix this by telling smgr to forget about removing those files before installing the new map file --- better to leak some disk space than destroy catalogs. * on abort, we'd not send sinval notices, leaving other backends at risk of using stale info (maybe our own too). We could fix this by sending the sinval notice BEFORE renaming into place (if we send it and then fail to rename, no real harm done, just useless cache flushes). This requires keeping a nontransactional-inval path in inval.c, but at least it's much more localized than before. No problem for HS since we have the WAL record for map update to drive issuing sinvals on the slave side. Note this means that readers need to take the mapfile lock in shared mode, since they could conceivably get the sinval notice before we complete the rename. For our own backend we need cache flushes at CCI and again at xact commit/abort. This could be handled by regular transactional inval path but we end up with a lot of redundant flushing. Maybe not worth worrying about though. Disallow catalog relocation inside subtransactions, to avoid having to handle subxact abort effects on the local-map-changes state. This could be implemented if desired, but doesn't seem worth it at least in first pass. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Bruce Momjian on 31 Jan 2010 22:58 FYI, getting rid of VACUUM FULL in a .0 release does make more sense than doing it in a .1 release. --------------------------------------------------------------------------- Tom Lane wrote: > Simon Riggs <simon(a)2ndQuadrant.com> writes: > > I'll do a little work towards step (1) just so we can take a more > > informed view once you've had a better look at just what (2) involves. > > I spent a couple of hours reading code and believe that I've identified > all the pain points for allowing relocation of system catalogs (ie, > assigning them new relfilenodes during cluster-style VACUUM FULL, > REINDEX, etc). It's definitely not a trivial project but it's not out > of reach either --- I estimate I could get it done in a couple or three > days of full-time effort. Given the potential benefits I think it's > worth doing. Rough notes are attached --- comments? > > regards, tom lane > > > Change pg_class.relfilenode to be 0 for all rels for which a map file should > be used instead. Define RelationGetFilenode() to look to the physaddr info > instead, and make all internal refs go to that instead of reading > rd_rel->relfilenode directly. Define pg_relation_filenode(regclass) and fix > external refs (oid2name, pg_dump) to use that instead. > > In zeroth cut, just have relcache.c substitute the OID in place of reading > a map file. Possibly it should always do that during bootstrap. > > How should bootstrap decide which rels to insert zero for, anyway? > Shared definitely, pg_class, ... maybe that's enough? Or do we need > it for all nailed local catalogs? > > local state contains > * shared map list > * this database's map list > * list of local overrides to each on-disk map list > > At commit: if modified, write out new version; do this as late as possible > before xact commit > At abort: just discard local-override list > > "Write" must include generating a WAL entry as well as sending a shared-inval > signal > * write temp file, fsync it > * emit WAL record containing copy of new file contents, fsync it > * atomically rename temp file into place > * send sinval > > During relation open, check for sinval before relying on current cached value > of map contents > > Hm, what about concurrent updates of map? Probably instantiate a new > LWLock or maybe better a HW lock. So write involves taking lock, check > for updates, finally write --- which means that local modifications > have to be stored in a way that allows re-reading somebody else's mods. > Hence above data structure with separate storage of local modifications. > We assume rel-level locking protects us from concurrent update attempts > on the same map entry, but we don't want to forbid concurrent relocations > of different catalogs. > > LWLock would work if we do an unconditional read of the file contents after > getting lock rather than relying on sinval signaling, which seems a small > price considering map updates should be infrequent. Without that, writers > have to hold the lock till commit which rules out using LWLock. > > Hm ... do we want an LWLock per map file, or is one lock to rule them all > sufficient? LWLock per database seems problematic. With an HW lock there > wouldn't be a problem with that. HW lock would allow concurrent updates of > the map files of different DBs, but is that worth the extra cycles? > > In a case where a transaction relocates pg_class (or another mapped catalog) > and then makes updates in that catalog, all is well in the sense that the > updates will be preserved iff the xact commits. HOWEVER we would have > problems during WAL replay? No, because the WAL entries will command updates > to the catalog's new relfilenode, which will be interesting to anyone else if > and only if the xact gets to commit. We would need to be careful about the > case of relocating pg_class itself though --- make sure local relcache > references the new pg_class before any such updates happen. Probably a CCI > is sufficient. > > Another issue for clustering a catalog is that sinval catcache signalling > could get confused, since that depends on catalog entry TIDs which would > change --- we'd likely need to have relocation send an sinval signal saying > "flush *everything* from that catalog". (relcache inval on the catalog itself > doesn't do that.) This action could/should be transactional so no > additional code is needed to propagate the notice to HS standbys. > > Once the updated map file is moved into place, the relocation is effectively > committed even if we subsequently abort the transaction. We can make that > window pretty narrow but not remove it completely. Risk factors: > * abort would try to remove relation files created by xact, in particular > the new version of the catalog. Ooops. Can fix this by telling smgr to > forget about removing those files before installing the new map file --- > better to leak some disk space than destroy catalogs. > * on abort, we'd not send sinval notices, leaving other backends at risk > of using stale info (maybe our own too). We could fix this by sending > the sinval notice BEFORE renaming into place (if we send it and then fail > to rename, no real harm done, just useless cache flushes). This requires > keeping a nontransactional-inval path in inval.c, but at least it's much > more localized than before. No problem for HS since we have the WAL record > for map update to drive issuing sinvals on the slave side. Note this > means that readers need to take the mapfile lock in shared mode, since they > could conceivably get the sinval notice before we complete the rename. > > For our own backend we need cache flushes at CCI and again at xact > commit/abort. This could be handled by regular transactional inval > path but we end up with a lot of redundant flushing. Maybe not worth > worrying about though. > > Disallow catalog relocation inside subtransactions, to avoid having > to handle subxact abort effects on the local-map-changes state. > This could be implemented if desired, but doesn't seem worth it > at least in first pass. > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce(a)momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Heikki Linnakangas on 1 Feb 2010 03:02 Tom Lane wrote: > Hm ... do we want an LWLock per map file, or is one lock to rule them all > sufficient? LWLock per database seems problematic. With an HW lock there > wouldn't be a problem with that. HW lock would allow concurrent updates of > the map files of different DBs, but is that worth the extra cycles? A single LWLock should be enough. > Once the updated map file is moved into place, the relocation is effectively > committed even if we subsequently abort the transaction. We can make that > window pretty narrow but not remove it completely. We could include the instructions to update the map file in the commit record, instead of introducing a new record type, and update the map file only *after* writing the commit record. The map file doesn't grow, so we can be pretty confident that updating it doesn't fail (failure would lead to PANIC). I'm assuming the map file is fixed size, with a fixed location for each relation, so that we can just overwrite the old file without the create+rename dance, and not worry about torn-pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Simon Riggs on 1 Feb 2010 03:54
On Sun, 2010-01-31 at 22:49 -0500, Tom Lane wrote: > Simon Riggs <simon(a)2ndQuadrant.com> writes: > > I'll do a little work towards step (1) just so we can take a more > > informed view once you've had a better look at just what (2) involves. > > I spent a couple of hours reading code and believe that I've identified > all the pain points for allowing relocation of system catalogs (ie, > assigning them new relfilenodes during cluster-style VACUUM FULL, > REINDEX, etc). It's definitely not a trivial project but it's not out > of reach either --- I estimate I could get it done in a couple or three > days of full-time effort. Given the potential benefits I think it's > worth doing. Rough notes are attached --- comments? Comments inline. > Change pg_class.relfilenode to be 0 for all rels for which a map file should > be used instead. Define RelationGetFilenode() to look to the physaddr info > instead, and make all internal refs go to that instead of reading > rd_rel->relfilenode directly. Yes > Define pg_relation_filenode(regclass) and fix > external refs (oid2name, pg_dump) to use that instead. Not sure why? > In zeroth cut, just have relcache.c substitute the OID in place of reading > a map file. Possibly it should always do that during bootstrap. Yes > How should bootstrap decide which rels to insert zero for, anyway? > Shared definitely, pg_class, ... maybe that's enough? Or do we need > it for all nailed local catalogs? Only for nailed || shared. My submitted patch covers the "normal" relations. Put 0 for rel at boot, write files at boot also. > local state contains > * shared map list > * this database's map list > * list of local overrides to each on-disk map list > > At commit: if modified, write out new version; do this as late as possible > before xact commit > At abort: just discard local-override list Yep > "Write" must include generating a WAL entry as well as sending a shared-inval > signal > * write temp file, fsync it > * emit WAL record containing copy of new file contents, fsync it > * atomically rename temp file into place > * send sinval Yes > During relation open, check for sinval before relying on current cached value > of map contents Yes > Hm, what about concurrent updates of map? Why not have one file per relation that has a map file? No concurrency issues then at all then. > Another issue for clustering a catalog is that sinval catcache signalling > could get confused, since that depends on catalog entry TIDs which would > change --- we'd likely need to have relocation send an sinval signal saying > "flush *everything* from that catalog". (relcache inval on the catalog itself > doesn't do that.) Yes > This action could/should be transactional so no > additional code is needed to propagate the notice to HS standbys. Agreed > Once the updated map file is moved into place, the relocation is effectively > committed even if we subsequently abort the transaction. We can make that > window pretty narrow but not remove it completely. Risk factors: > * abort would try to remove relation files created by xact, in particular > the new version of the catalog. Ooops. Can fix this by telling smgr to > forget about removing those files before installing the new map file --- > better to leak some disk space than destroy catalogs. > * on abort, we'd not send sinval notices, leaving other backends at risk > of using stale info (maybe our own too). We could fix this by sending > the sinval notice BEFORE renaming into place (if we send it and then fail > to rename, no real harm done, just useless cache flushes). This requires > keeping a nontransactional-inval path in inval.c, but at least it's much > more localized than before. No problem for HS since we have the WAL record > for map update to drive issuing sinvals on the slave side. Note this > means that readers need to take the mapfile lock in shared mode, since they > could conceivably get the sinval notice before we complete the rename. > > For our own backend we need cache flushes at CCI and again at xact > commit/abort. This could be handled by regular transactional inval > path but we end up with a lot of redundant flushing. Maybe not worth > worrying about though. ! > Disallow catalog relocation inside subtransactions, to avoid having > to handle subxact abort effects on the local-map-changes state. > This could be implemented if desired, but doesn't seem worth it > at least in first pass. Agreed, not needed for emergency maintenance actions like this. There are issues associated with performing actions on critical tables, since in some cases I got "relation xxxx is already locked". ISTM we'd need some special handling of locking in those cases. Toast tables are also an area of problem because there are dependencies between tables that need to be mangled. Please bear in mind that I looked at the code also and came to a similar original assessment. It was only when I started working with it that I came to different conclusions. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |