a faster compression algorithm for pg

Prev: [HACKERS] C-Language Fun on VC2005 ERROR: could not load library
Next: extended operator classes vs. type interfaces

From: Greg Stark on 8 Apr 2010 23:51

On Fri, Apr 9, 2010 at 12:17 AM, Joachim Wieland <joe(a)mcknight.de> wrote:
> One question that I do not yet see answered is, do we risk violating a
> patent even if we just link against a compression library, for example
> liblzf, without shipping the actual code?
>

Generally patents are infringed on when the process is used. So
whether we link against or ship the code isn't really relevant. The
user using the software would need a patent license either way. We
want Postgres to be usable without being dependent on any copyright or
patent licenses.

Linking against as an option isn't nearly as bad since the user
compiling it can choose whether to include the restricted feature or
not. That's what we do with readline. However it's not nearly as
attractive when it restricts what file formats Postgres supports -- it
means someone might generate backup dump files that they later
discover they don't have a legal right to read and restore :(

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Joachim Wieland on 10 Apr 2010 08:18

On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark <gsstark(a)mit.edu> wrote:
> Linking against as an option isn't nearly as bad since the user
> compiling it can choose whether to include the restricted feature or
> not. That's what we do with readline. However it's not nearly as
> attractive when it restricts what file formats Postgres supports -- it
> means someone might generate backup dump files that they later
> discover they don't have a legal right to read and restore :(

If we only linked against it, we'd leave it up to the user to weigh
the risk as long as we are not aware of any such violation.

Our top priority is to make sure that the project would not be harmed
if one day such a patent showed up. If I understood you correctly,
this is not an issue, even if we included lzf and less again if we
only link against it. The rest is about user education and using lzf
only in pg_dump and not for toasting, we could show a message in
pg_dump if lzf is chosen to make the user aware of the possible
issues.

If we still cannot do this, then what I am asking is: What does the
project need to be able to at least link against such a compression
algorithm? Is it a list of 10, 20, 50 or more other projects using it
or is it a lawyer saying: "There is no patent."? But then, how can we
be sure that the lawyer is right? Or couldn't we include it even if we
had both, because again, we couldn't be sure... ?

Joachim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 13 Apr 2010 15:03

Joachim Wieland <joe(a)mcknight.de> writes:
> If we still cannot do this, then what I am asking is: What does the
> project need to be able to at least link against such a compression
> algorithm?

Well, what we *really* need is a convincing argument that it's worth
taking some risk for. I find that not obvious. You can pipe the output
of pg_dump into your-choice-of-compressor, for example, and that gets
you the ability to spread the work across multiple CPUs in addition to
eliminating legal risk to the PG project. And in any case the general
impression seems to be that the main dump-speed bottleneck is on the
backend side not in pg_dump's compression.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Stefan Kaltenbrunner on 14 Apr 2010 04:25

Tom Lane wrote:
> Joachim Wieland <joe(a)mcknight.de> writes:
>> If we still cannot do this, then what I am asking is: What does the
>> project need to be able to at least link against such a compression
>> algorithm?
>
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for. I find that not obvious. You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project. And in any case the general
> impression seems to be that the main dump-speed bottleneck is on the
> backend side not in pg_dump's compression.

legal risks aside (I'm not a lawyer so I cannot comment on that) the
current situation imho is:

* for a plain pg_dump the backend is the bottleneck
* for a pg_dump -Fc with compression, compression is a huge bottleneck
* for pg_dump | gzip, it is usually compression (or bytea and some other
datatypes in <9.0)
* for a parallel dump you can either dump uncompressed and compress
afterwards which increases diskspace requirements (and if you need
parallel dump you usually have a large database) and complexity (because
you would have to think about how to manually parallel the compression
* for a parallel dump that compresses inline you are limited by the
compression algorithm on a per core base and given that the current
inline compression overhead is huge you loose a lot of the benefits of
parallel dump

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Dimitri Fontaine on 14 Apr 2010 04:33

Tom Lane <tgl(a)sss.pgh.pa.us> writes:
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for. I find that not obvious. You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project.

Well, I like -Fc and playing with the catalog to restore in staging
environments only the "interesting" data. I even automated all the
catalog mangling in pg_staging so that I just have to setup which
schema I want, with only the DDL or with the DATA too.

The fun is when you want to exclude functions that are used in
triggers based on the schema where the function lives, not the
trigger, BTW, but that's another story.

So yes having both -Fc and another compression facility than plain gzip
would be good news. And benefiting from a better compression in TOAST
would be good too I guess (small size hit, lots faster, would fit).

SummaryÂ : my convincing argument is using the dumps for efficiently
preparing development and testing environments from production data,
thanks to -Fc. That includes skipping data to restore.

Regards,
--
dim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2
Prev: [HACKERS] C-Language Fun on VC2005 ERROR: could not load library
Next: extended operator classes vs. type interfaces

a faster compression algorithm for pg_dump