cross column correlation revisted [PgSql]

Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches

From: Heikki Linnakangas on 14 Jul 2010 06:40

On 14/07/10 13:12, PostgreSQL - Hans-J�rgen Sch�nig wrote:
> hello everybody,
>
> we are currently facing some serious issues with cross correlation issue.
> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
> if i select all the men with breast cancer, i will get basically nobody - the planner will overestimate the output.
> this is the commonly known problem ...
>
> this cross correlation problem can be quite nasty in many many cases.
> underestimated nested loops can turn joins into a never ending nightmare and so on and so on.
>
> my ideas is the following:
> what if we allow users to specifiy cross-column combinations where we keep separate stats?
> maybe somehow like this ...
>
> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>
> or ...
>
> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
>
> clearly we cannot store correlation for all combinations of all columns so we somehow have to limit it.
>
> what is the general feeling about something like that?

+1 is my general feeling, it's good if you can tell the system to
collect additional statistics where needed. And once you have that, you
can write an agent or something to detect automatically which extra
statistics might be useful.

However, the problem is how to represent and store the
cross-correlation. For fields with low cardinality, like "gender" and
boolean "breast-cancer-or-not" you can count the prevalence of all the
different combinations, but that doesn't scale. Another often cited
example is zip code + street address. There's clearly a strong
correlation between them, but how do you represent that?

For scalar values we currently store a histogram. I suppose we could
create a 2D histogram for two columns, but that doesn't actually help
with the zip code + street address problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= on 14 Jul 2010 06:52

On Jul 14, 2010, at 12:40 PM, Heikki Linnakangas wrote:

> On 14/07/10 13:12, PostgreSQL - Hans-J�rgen Sch�nig wrote:
>> hello everybody,
>>
>> we are currently facing some serious issues with cross correlation issue.
>> consider: 10% of all people have breast cancer. we have 2 genders (50:50).
>> if i select all the men with breast cancer, i will get basically nobody - the planner will overestimate the output.
>> this is the commonly known problem ...
>>
>> this cross correlation problem can be quite nasty in many many cases.
>> underestimated nested loops can turn joins into a never ending nightmare and so on and so on.
>>
>> my ideas is the following:
>> what if we allow users to specifiy cross-column combinations where we keep separate stats?
>> maybe somehow like this ...
>>
>> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>>
>> or ...
>>
>> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
>>
>> clearly we cannot store correlation for all combinations of all columns so we somehow have to limit it.
>>
>> what is the general feeling about something like that?
>
> +1 is my general feeling, it's good if you can tell the system to collect additional statistics where needed. And once you have that, you can write an agent or something to detect automatically which extra statistics might be useful.
>

it seems i can leave my bunker where i was hiding for cover when i was waiting for a reply ;).
yes, my idea was to have an agent as well - but this is just some follow up problem.

> However, the problem is how to represent and store the cross-correlation. For fields with low cardinality, like "gender" and boolean "breast-cancer-or-not" you can count the prevalence of all the different combinations, but that doesn't scale. Another often cited example is zip code + street address. There's clearly a strong correlation between them, but how do you represent that?

we could play the same story with a table storing people including their home country and the color of their skin.
obviously we will have more black people in african countries..

>
> For scalar values we currently store a histogram. I suppose we could create a 2D histogram for two columns, but that doesn't actually help with the zip code + street address problem.
>

i think we might go for a second relation here specifically for this issue and a boolean flag in the current stats table indicating that additional correlation stats exist (to avoid an additional lookup unless really necessary).
do you have a useful syntax in mind? the thing is: this issue can be isolated inside a table (e.g. WHERE a.id = a.id2 AND a.id3 = a.id4) or it might span two tables with an arbitrary number of fields.

many thanks,

hans

--
Cybertec Sch�nig & Sch�nig GmbH
Gr�hrm�hlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Yeb Havinga on 14 Jul 2010 07:21

Heikki Linnakangas wrote:
> However, the problem is how to represent and store the
> cross-correlation. For fields with low cardinality, like "gender" and
> boolean "breast-cancer-or-not" you can count the prevalence of all the
> different combinations, but that doesn't scale. Another often cited
> example is zip code + street address. There's clearly a strong
> correlation between them, but how do you represent that?
>
> For scalar values we currently store a histogram. I suppose we could
> create a 2D histogram for two columns, but that doesn't actually help
> with the zip code + street address problem.
In my head the neuron for 'principle component analysis' went on while
reading this. Back in college it was used to prepare input data before
feeding it into a neural network. Maybe ideas from PCA could be helpful?

regards,
Yeb Havinga

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 14 Jul 2010 10:01

Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> writes:
> On 14/07/10 13:12, PostgreSQL - Hans-J�rgen Sch�nig wrote:
>> maybe somehow like this ...
>> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)

> +1 is my general feeling, it's good if you can tell the system to
> collect additional statistics where needed.

The previous discussions about this went in the direction of
"automatically collect stats if there is an index on that combination of
columns". Do we really need a command?

> However, the problem is how to represent and store the
> cross-correlation.

Yes, whatever the triggering mechanism is for collecting cross-column
stats, actually doing something useful is the hard part.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Joshua Tolley on 14 Jul 2010 09:54

On Wed, Jul 14, 2010 at 01:21:19PM +0200, Yeb Havinga wrote:
> Heikki Linnakangas wrote:
>> However, the problem is how to represent and store the
>> cross-correlation. For fields with low cardinality, like "gender" and
>> boolean "breast-cancer-or-not" you can count the prevalence of all the
>> different combinations, but that doesn't scale. Another often cited
>> example is zip code + street address. There's clearly a strong
>> correlation between them, but how do you represent that?
>>
>> For scalar values we currently store a histogram. I suppose we could
>> create a 2D histogram for two columns, but that doesn't actually help
>> with the zip code + street address problem.
> In my head the neuron for 'principle component analysis' went on while
> reading this. Back in college it was used to prepare input data before
> feeding it into a neural network. Maybe ideas from PCA could be helpful?

I've been playing off and on with an idea along these lines, which builds an
empirical copula[1] to represent correlations between columns where there
exists a multi-column index containing those columns. This copula gets stored
in pg_statistic. There are plenty of unresolved questions (and a crash I
introduced and haven't had time to track down), but the code I've been working
on is here[2] in the multicolstat branch. Most of the changes are in
analyze.c; no user-visible changes have been introduced. For that matter,
there aren't any changes yet actually to use the values once calculated (more
unresolved questions get in the way there), but it's a start.

[1] http://en.wikipedia.org/wiki/Copula_(statistics)
[2] http://git.postgresql.org/gitweb?p=users/eggyknap/postgres.git

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

| Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches