From: Joachim Wieland on 29 Mar 2010 10:46 People have been talking about a parallel version of pg_dump a few times already. I have been working on some proof-of-concept code for this feature every now and then and I am planning to contribute this for 9.1. There are two main issues with a parallel version of pg_dump: The first one is that it requires a consistent snapshot among multiple pg_dump clients and the second is that currently the output goes to a single file and it is unclear what to do about multiple processes writing into a single file. - There are ideas on how to solve the issue with the consistent snapshot but in the end you can always solve it by stopping your application(s). I actually assume that whenever people are interested in a very fast dump, it is because they are doing some maintenance task (like migrating to a different server) that involves pg_dump. In these cases, they would stop their system anyway. Even if we had consistent snapshots in a future version, would we forbid people to run parallel dumps against old server versions? What I suggest is to just display a big warning if run against a server without consistent snapshot support (which currently is every version). - Regarding the output of pg_dump I am proposing two solutions. The first one is to introduce a new archive type "directory" where each table and each blob is a file in a directory, similar to the experimental "files" archive type. Also the idea has come up that you should be able to specify multiple directories in order to make use of several physical disk drives. Thinking this further, in order to manage all the mess that you can create with this, every file of the same backup needs to have a unique identifier and pg_restore should have a check parameter that tells you if your backup directory is in a sane and complete state (think about moving a file from one backup directory to another one or trying to restore from two directories which are from different backup sets...). The second solution to the single-file-problem is to generate no output at all, i.e. whatever you export from your source database you import directly into your target database, which in the end turns out to be a parallel form of "pg_dump | psql". In fact, technically this is rather a parallel pg_restore than a pg_dump as you need to respect the dependencies between objects. The good news is that with the parallel pg_restore of the custom archive format we have everything in place already for this dependency checking. The addition is a new archive type that dumps (just-in-time) whatever the dependency-algorithm decides to restore next. This is probably the fastest way that we can copy or upgrade a database when pg_migrator cannot be used (for example when you migrate to a different hardware architecture). As said, I have some working code for the features described (unix only), if anybody would like to give it a try already now, just let me know, I'd be happy to get some early test reports and you could check for the speedup to expect. But before I continue, I'd like to have a discussion about what is what people actually want and what is the best way to go forward here. I am currently not planning to make parallel dumps work with the custom format even though this would be possible if we changed the format to a certain degree. Comments? Joachim -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: Using HStore type in TSearch Next: [HACKERS] psql: edit function, show function commands patch |