cross column correlation revisted [PgSql]

Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches

From: Robert Haas on 14 Jul 2010 11:13

2010/7/14 Tom Lane <tgl(a)sss.pgh.pa.us>:
> If the combination of columns is actually interesting, there might well
> be an index in place, or the DBA might be willing to create it.

Indexes aren't free, though, nor even close to it.

Still, I think we should figure out the underlying mechanism first and
then design the interface afterwards. One idea I had was a way to say
"compute the MCVs and histogram buckets for this table WHERE
<predicate>". If you can prove predicate for a particular query, you
can use the more refined statistics in place of the full-table
statistics. This is fine for the breast cancer case, but not so
useful for the zip code/street name case (which seems to be the really
tough one).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: marcin mank on 14 Jul 2010 17:57

On Wed, Jul 14, 2010 at 5:13 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> 2010/7/14 Tom Lane <tgl(a)sss.pgh.pa.us>:
>> If the combination of columns is actually interesting, there might well
>> be an index in place, or the DBA might be willing to create it.
>
> Indexes aren't free, though, nor even close to it.
>
> Still, I think we should figure out the underlying mechanism first and
> then design the interface afterwards. One idea I had was a way to say
> "compute the MCVs and histogram buckets for this table WHERE
> <predicate>". If you can prove predicate for a particular query, you
> can use the more refined statistics in place of the full-table
> statistics. This is fine for the breast cancer case, but not so
> useful for the zip code/street name case (which seems to be the really
> tough one).
>

One way of dealing with the zipcode problem is estimating NDST =
count(distinct row(zipcode, street)) - i.e. multi-column ndistinct.

Then the planner doesn`t have to assume that the selectivity of a
equality condition involving both zipcode and city is a multiple of
the respective selectivities. As a first cut it can assume that it
will get count(*) / NDST rows, but there are ways to improve it.

Greetings
Marcin Mańk

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Dimitri Fontaine on 14 Jul 2010 19:33

Joshua Tolley <eggyknap(a)gmail.com> writes:
>> >> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id =3D y.id AND x.id=
2 =3D y.id2)
>> >=20
>> it says X and Y ... the selectivity of joins are what i am most
>> interested in. cross correlation of columns within the same table are
>> just a byproduct. the core thing is: how can i estimate the number
>> of rows returned from a join?
>
> All the discussion of this topic that I've seen has been limited to the s=
ingle
> table case. The hard problem in that case is coming up with something you=
can
> precalculate that will actually be useful during query planning, without
> taking too much disk, memory, CPU, or something else. Expanding the discu=
ssion
> to include join relations certainly still has valid use cases, but is even
> harder, because you've also got to keep track of precisely how the underl=
ying
> relations are joined, so you know in what context the statistics remain v=
alid.

Well I've been proposing to handle the correlation problem in another
way in some past mails here, and I've been trying to write it down too:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00118.php
http://tapoueh.org/char10.html#sec13

What I propose is to extend ANALYZE to be able to work on a VIEW too,
rather than just a table. The hard parts seems to be:

a. what stats to record, exploiting the view definition the best we can
b. how to match a user query against the view definitions we have in
order to actually use the stats

If you have answers or good ideas=C2=A0:)

Regards,
--=20
dim

--
dim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?iso-8859-1?Q?Hans-J=FCrgen_Sch=F6nig?= on 15 Jul 2010 06:04

hello ...

a view is already nice but i think it is still too narrow.
the problem is: you don't want a view for every potential join.
in addition to that - ideally there is not much left of a view when it comes to checking for costs.
so, i think, this is not the kind of approach leading to total success here.

one side question: does anybody happen to know how this is one in oracle or db2?

many thanks,

hans

On Jul 15, 2010, at 1:33 AM, Dimitri Fontaine wrote:

> Joshua Tolley <eggyknap(a)gmail.com> writes:
>>>>> ALTER TABLE x SET CORRELATION STATISTICS FOR (x.id =3D y.id AND x.id=
> 2 =3D y.id2)
>>>> =20
>>> it says X and Y ... the selectivity of joins are what i am most
>>> interested in. cross correlation of columns within the same table are
>>> just a byproduct. the core thing is: how can i estimate the number
>>> of rows returned from a join?
>>
>> All the discussion of this topic that I've seen has been limited to the s=
> ingle
>> table case. The hard problem in that case is coming up with something you=
> can
>> precalculate that will actually be useful during query planning, without
>> taking too much disk, memory, CPU, or something else. Expanding the discu=
> ssion
>> to include join relations certainly still has valid use cases, but is even
>> harder, because you've also got to keep track of precisely how the underl=
> ying
>> relations are joined, so you know in what context the statistics remain v=
> alid.
>
> Well I've been proposing to handle the correlation problem in another
> way in some past mails here, and I've been trying to write it down too:
>
> http://archives.postgresql.org/pgsql-performance/2009-06/msg00118.php
> http://tapoueh.org/char10.html#sec13
>
> What I propose is to extend ANALYZE to be able to work on a VIEW too,
> rather than just a table. The hard parts seems to be:
>
> a. what stats to record, exploiting the view definition the best we can
> b. how to match a user query against the view definitions we have in
> order to actually use the stats
>
> If you have answers or good ideas=C2=A0:)
>
> Regards,
> --=20
> dim
>
>
> --
> dim
>

--
Cybertec Sch�nig & Sch�nig GmbH
Gr�hrm�hlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: David Fetter on 15 Jul 2010 10:14

On Thu, Jul 15, 2010 at 12:04:21PM +0200, Hans-J�rgen Sch�nig wrote:
> hello ...
>
> a view is already nice but i think it is still too narrow.

One sure way to fail is to take on a problem in chunks too large. If
we get even one of the cross-column issues solved by statistics, we'll
be ahead of all our competition, both free and proprietary.

Cheers,
David.
--
David Fetter <david(a)fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter(a)gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] cross column correlation revisted
Next: five-key syscaches