cross column correlation revisted [PgSql]

Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches

From: =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= on 14 Jul 2010 10:13

hello tom,

i think that having stats on an index is a problem by itself for 2 reasons - for cross column correlation at least:

a.) joins cannot be covered by an index on two tables - we would fix "inside a table correlation problems" but not joins.
b.) who says that there is actually an index in place? assume you are doing some big seq scan to do analytics. you don't want it to be indexed for 10 different types of queries.

i think i is pretty hard to determine automatically what to collect because we cannot know which permutations of cross-column magic people will use.
i was thinking along the line of having it automatic as well but i could not figure out how to do it.
i think we can suggest addition stats to the user and we can write tools to figure our somehow what would be useful but personally i cannot see anything which is better than a command here.

many thanks,

hans

On Jul 14, 2010, at 4:01 PM, Tom Lane wrote:

> Heikki Linnakangas <heikki.linnakangas(a)enterprisedb.com> writes:
>> On 14/07/10 13:12, PostgreSQL - Hans-J�rgen Sch�nig wrote:
>>> maybe somehow like this ...
>>> ALTER TABLE x SET CORRELATION STATISTICS FOR (id = id2 AND id3=id4)
>
>> +1 is my general feeling, it's good if you can tell the system to
>> collect additional statistics where needed.
>
> The previous discussions about this went in the direction of
> "automatically collect stats if there is an index on that combination of
> columns". Do we really need a command?
>
>> However, the problem is how to represent and store the
>> cross-correlation.
>
> Yes, whatever the triggering mechanism is for collecting cross-column
> stats, actually doing something useful is the hard part.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Sch�nig & Sch�nig GmbH
Gr�hrm�hlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 14 Jul 2010 10:22

=?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres(a)cybertec.at> writes:
> i think that having stats on an index is a problem by itself for 2 reasons - for cross column correlation at least:

> a.) joins cannot be covered by an index on two tables - we would fix "inside a table correlation problems" but not joins.

Your proposed command didn't cover the two-table case either, and anyway
what the heck do you mean by cross-correlation across tables?
Cross-correlation is about the correlation between values in the same
row.

> b.) who says that there is actually an index in place?

If the combination of columns is actually interesting, there might well
be an index in place, or the DBA might be willing to create it. For
that matter, have you considered the idea of examining the index
contents to derive the statistics? Might work better than trying to get
numbers via ANALYZE.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 14 Jul 2010 10:34

Tom Lane wrote:
> If the combination of columns is actually interesting, there might well
> be an index in place, or the DBA might be willing to create it.

I'm having a hard time imagining an interesting case where that wouldn't
be so.

> For
> that matter, have you considered the idea of examining the index
> contents to derive the statistics? Might work better than trying to get
> numbers via ANALYZE.
>
>

Sounds like a good idea.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= on 14 Jul 2010 10:41

hello ...

look at the syntax i posted in more detail:

>> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
>

it says X and Y ...
the selectivity of joins are what i am most interested in. cross correlation of columns within the same table are just a byproduct.
the core thing is: how can i estimate the number of rows returned from a join?

an example would be: you have a email accounts + messages. you know that each row will match in a join as you can assume that every account will have a message.
what we need is a syntax which covers the join case and the case where columns inside the same table correlate.
and the fact that an index cannot cover two tables leads me to the conclusion that stats on an index are not the solution to the join problem.

many thanks,

hans

On Jul 14, 2010, at 4:22 PM, Tom Lane wrote:

> =?iso-8859-1?Q?PostgreSQL_-_Hans-J=FCrgen_Sch=F6nig?= <postgres(a)cybertec.at> writes:
>> i think that having stats on an index is a problem by itself for 2 reasons - for cross column correlation at least:
>
>> a.) joins cannot be covered by an index on two tables - we would fix "inside a table correlation problems" but not joins.
>
> Your proposed command didn't cover the two-table case either, and anyway
> what the heck do you mean by cross-correlation across tables?
> Cross-correlation is about the correlation between values in the same
> row.
>
>> b.) who says that there is actually an index in place?
>
> If the combination of columns is actually interesting, there might well
> be an index in place, or the DBA might be willing to create it. For
> that matter, have you considered the idea of examining the index
> contents to derive the statistics? Might work better than trying to get
> numbers via ANALYZE.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cybertec Sch�nig & Sch�nig GmbH
Gr�hrm�hlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Joshua Tolley on 14 Jul 2010 11:09

On Wed, Jul 14, 2010 at 04:41:01PM +0200, PostgreSQL - Hans-Jürgen Schönig wrote:
> hello ...
>
> look at the syntax i posted in more detail:
>
> >> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id = y.id AND x.id2 = y.id2)
> >
> it says X and Y ...
> the selectivity of joins are what i am most interested in. cross correlation of columns within the same table are just a byproduct.
> the core thing is: how can i estimate the number of rows returned from a join?

All the discussion of this topic that I've seen has been limited to the single
table case. The hard problem in that case is coming up with something you can
precalculate that will actually be useful during query planning, without
taking too much disk, memory, CPU, or something else. Expanding the discussion
to include join relations certainly still has valid use cases, but is even
harder, because you've also got to keep track of precisely how the underlying
relations are joined, so you know in what context the statistics remain valid.
So it makes sense to tackle the single table version first. Once it's
implemented somehow, and has been proven sufficiently effective to merit the
increased code size and complexity, we can consider expanding it to joined
relations.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches