Prev: [HACKERS] Streaming replication and wal skipping
Next: [HACKERS] Bloom filters bloom filters bloom filters
From: Greg Stark on 18 Jan 2010 07:11 So the other thread on bloom filter indexes got a discussion going in real space about other neat things you can do with them and there were two that seemed useful for Postgres. We could efficiently get an fairly accurate estimate of the number of distinct values when doing analyze if we scan the entire table. I'm not sure what happens now if you do vacuum analyze -- the visibility map means we don't need to scan the whole table to vacuum it which means I'm not sure if there's a way to force analyze to see every tuple. But if we did our current method is bunk and no method based on a sample of tuples can ever be very precise. So assuming we provide a way to force analyze to scan every tuple we could build a bloom filter for each column and count the number of times we find a value which the bloom filter says is definitely not seen. That would give a lower bound for ndistinct. Every time the bloom filter gives a positive we can calculate the false positive rate given the number of distinct values stored so far and add in a fractional distinct value to calculate an accurate expected value (or someone could try to do the calculus necessary to avoid doing the floating point arithmetic if that's a bottleneck). This actually gives the expected value for the number of distinct hash values. That would be a very good estimate for sets << 2^32 but gets increasingly imprecise for sets that approach that size and larger. Also this has the problem that the false positive rate depends on the size of the bloom filter and the set size. I'm assuming we just divide maintenance_work_mem by natts and allocate a bloom filter for each column of that size. That means columns with high cardinality might be much more inaccurate and unless maintenance_work_mem is very large might be very inaccurate. I suppose we can print a warning suggesting raising maintenance_work_mem to get better estimates for that column. But I think it would be much better than our current algorithm. And I can't see this being patented given that bloom filters were invented in 1970 and counting distinct values is a pretty obvious application of them. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |