Prev: C function argument types
Next: pg_upgrade docs
From: Jesper Krogh on 19 May 2010 15:01 Hi. I am working on getting full-text-search to work and have come across something I think look a bit strange. The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. # ANALYZE verbose reference (document_tsvector); INFO: analyzing "reference" INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows ANALYZE Ok, so analyze allmost examined all rows. Looking into "most_common_freqs" I find # select count(unnest) from (select unnest(most_common_freqs) from pg_stats where attname = 'document_tsvector') as foo; count ------- 2810 (1 row) But the distribution is very "flat" at the end, the last 128 values are excactly 1.00189e-05 which means that any term sitting outside the array would get an estimate of 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows So far I have no idea if this is bad or good, so a couple of sample runs of stuff that is sitting outside the "most_common_vals" array: # explain analyze select id from efam.reference where document_tsvector @@ to_tsquery('searchterm') order by id limit 2000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135 rows=1612 loops=1) -> Sort (cost=35.99..35.99 rows=2 width=4) (actual time=20.709..23.190 rows=1612 loops=1) Sort Key: id Sort Method: quicksort Memory: 124kB -> Bitmap Heap Scan on reference (cost=28.02..35.98 rows=2 width=4) (actual time=3.522..17.238 rows=1612 loops=1) Recheck Cond: (document_tsvector @@ to_tsquery('searchterm'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613 loops=1) Index Cond: (document_tsvector @@ to_tsquery('searchterm'::text)) Total runtime: 30.743 ms (9 rows) Ok, the query-planner estimates that there are 2 rows .. excactly as predicted, works as expected but in fact there are 1612 rows that matches. So, analyze has sampled 6 of 7 rows in the table and this term exists in 1612/350174 rows ~ freq: 0.0046 which is way higher than the lower bound of 1.00189e-05 .. or it should have been sitting around the center of the 2810 values of the histogram collected. So the "most_common_vals" seems to contain a lot of values that should never have been kept in favor of other values that are more common. In practice, just cranking the statistics estimate up high enough seems to solve the problem, but doesn't there seem to be something wrong in how the statistics are collected? # select version(); version --------------------------------------------------------------------------------------------------------------------------- PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit Jesper -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: C function argument types Next: pg_upgrade docs |