Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)
From: Alvaro Herrera on 25 May 2010 14:44 Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010: > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows I don't know if this is related, but tsvector stats are computed and stored per term, not per datum. This is different from all other datatypes. Maybe there's code somewhere that's assuming per-datum and coming up with the wrong estimates? Or maybe the tsvector-specific code contains a bug somewhere; maybe a rounding error? -- Álvaro Herrera <alvherre(a)alvh.no-ip.org> -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 25 May 2010 19:16 On 19/05/10 21:01, Jesper Krogh wrote: > The document base is arount 350.000 documents and > I have set the statistics target on the tsvector column > to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less "at any time track at most 10 * <target> lexemes simultaneously" where "track" means keeping them in memory while going through the tuples being analysed. Remember that the measure is in lexemes, not whole tsvectors and the 10 factor is meant to approximate the average number of unique lexemes in a tsvector. If your documents are very large, this might not be a good approximation. > # ANALYZE verbose reference (document_tsvector); > INFO: analyzing "reference" > INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live > rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows > ANALYZE > > Ok, so analyze allmost examined all rows. Looking into > "most_common_freqs" I find > # select count(unnest) from (select unnest(most_common_freqs) from > pg_stats where attname = 'document_tsvector') as foo; > count > ------- > 2810 > (1 row) So the size of the most_common_freqs and most_common_vals rows in pg_statistics for tsvectors has an upper bound of <stats-target> * 10 (for the same reasons as mentioned before) and holds lexemes (not whole tsvectors). What happens also is that lexemes that where seen only one while going through the analysed set are discarded, so that's why you can actually get less entries in these arrays, even if your document set is big. > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an > estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows Yeah, this might meant that you could try cranking up the stats target a lot, to make the set of simulatenously tracked lexemes larger (it will cost time and memory during analyse though). If the documents have completely different contents, what can happen is that almost all lexemes are only seen a few times and get removed during the pruning of the working set. I have seen similar behaviour while working on the typanalyze function for tsvectors. > So far I have no idea if this is bad or good, so a couple of sample runs > of stuff that > is sitting outside the "most_common_vals" array: > > [gathered statistics suck] > So the "most_common_vals" seems to contain a lot of values that should > never have been kept in favor > of other values that are more common. > In practice, just cranking the statistics estimate up high enough seems > to solve the problem, but doesn't > there seem to be something wrong in how the statistics are collected? The algorithm to determine most common vals does not do it accurately. That would require keeping all lexemes from the analysed tsvectors in memory, which would be impractical. If you want to learn more about the algorithm being used, try reading http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in ts_typanalyze.c It would be interesting to know what's the average size of a tsvector in your document set (ie. how many unique lexemes does a tsvector have on average). In general, the tsvector typanalyze function is designed to suck less than the constant factor that has been used previously, but it only works really well on the most common lexemes (thus preventing most gross misestimates). I'm not very surprised it misses the difference between 1612/350174 and 4/350174 and I'm quite happy that is gets that if you set the stats target really high :o) There's always the possibility that there's some stupid bug there, but I think you just set your expectations too high for the tsvector typanalze function. If you could come up with a better way of doing tsvector stats, that would be awesome - currently it's just doing its best to prevent the most outrageous errors. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Jesper Krogh on 26 May 2010 00:19 On 26/05/2010, at 01.16, Jan UrbaÅski <wulczer(a)wulczer.org> wrote: > On 19/05/10 21:01, Jesper Krogh wrote: >> The document base is arount 350.000 documents and >> I have set the statistics target on the tsvector column >> to 1000 since the 100 seems way of. > > So for tsvectors the statistics target means more or less "at any time > track at most 10 * <target> lexemes simultaneously" where "track" > means > keeping them in memory while going through the tuples being analysed. > > Remember that the measure is in lexemes, not whole tsvectors and the > 10 > factor is meant to approximate the average number of unique lexemes > in a > tsvector. If your documents are very large, this might not be a good > approximation. I just did a avg(length(document_tsvector)) which is 154 Doc count is 1.3m now in my sample set. > >> But the distribution is very "flat" at the end, the last 128 values >> are >> excactly >> 1.00189e-05 >> which means that any term sitting outside the array would get an >> estimate of >> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows > > Yeah, this might meant that you could try cranking up the stats > target a > lot, to make the set of simulatenously tracked lexemes larger (it will > cost time and memory during analyse though). If the documents have > completely different contents, what can happen is that almost all > lexemes are only seen a few times and get removed during the pruning > of > the working set. I have seen similar behaviour while working on the > typanalyze function for tsvectors. I Think i would prefer something less "magic" I Can increase the statistics target and get more reliable data but that increases also the amount of tuples being picked out for analysis which is really time consuming. But that also means that what gets stored as the lower bound of the historgram isn't anywhere near the lower bound, more the lower bound of the "artificial" histogram that happened after the last pruning. I Would suggest that the pruning in the end should be aware of this. Perhaps by keeping track of the least frequent value that never got pruned and using that as the last pruning ans lower bound? Thanks a lot for the explanation it fits fairly well why i couldn't construct a simple test set that had the problem. > >> So far I have no idea if this is bad or good, so a couple of sample >> runs >> of stuff that >> is sitting outside the "most_common_vals" array: >> >> [gathered statistics suck] > >> So the "most_common_vals" seems to contain a lot of values that >> should >> never have been kept in favor >> of other values that are more common. > >> In practice, just cranking the statistics estimate up high enough >> seems >> to solve the problem, but doesn't >> there seem to be something wrong in how the statistics are collected? > > The algorithm to determine most common vals does not do it accurately. > That would require keeping all lexemes from the analysed tsvectors in > memory, which would be impractical. If you want to learn more about > the > algorithm being used, try reading > http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in > ts_typanalyze.c I'll do some Reading Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 28 May 2010 16:04 On 28/05/10 04:47, Tom Lane wrote: > I re-scanned that paper and realized that there is indeed something > wrong with the way we are doing it. The paper says (last sentence in > the definition of the algorithm, section 4.2): > > When a user requests a list of items with threshold s, we output > those entries in D where f >= (s-e)N. > > What we are actually doing is emitting every entry with f >= 2. Since > e is fixed at 1/w, this effectively means that we are setting s to be > only fractionally greater than e, which means very large relative errors > in the estimates. I gave it a though and reread the paper, but since I already blundered once, please verify me on this. We follow the algorithm as written, the trouble starts when we want to output the result. The paper says which items from the D structure should be returned when the user asks for items that have frequencies higher than a threshold s. What we want to put in the statistics table are items accompanied by their frequencies, so we need to do some reasoning before we can construct the result. Say we have an item I with count f (taken from our D structure). The total number of entries is N. The question would be: what would be the minimum frequency that the user could specify, that would still make us output this element. From f >= (s - e) * N we can say it's s <= (f / N) + e So if the user wants items that occur with frequency (f / N) + e or less. This would mean that the corresponding value in the statistics entry should be < I, (f / N) + e) > The thing is, this doesn't change much, because currently we are putting (f / N) there, and e is set to 1 / stats_target * 10, so the difference would not be dramatic. > Or, if you want it explained another way: we are emitting words whose f > is very small and whose delta is very large, representing items that got > added to the scan very late. These really shouldn't be there because > their true frequency is probably a lot less than the intended threshold. Well, the idea of the algorithm is that if their frequency would have been bigger, they would appear earlier and would survive the pruning, as I understand it. > The net effect of this is first that there are a lot of rather useless > entries in the MCV list whose claimed frequency is quite small, like as > little as two occurrences. Their true frequency could be quite a bit > more. What's even worse is that we believe that the minimum of these > claimed frequencies is a reliable upper bound for the frequencies of > items not listed in the MCV list. Per the algorithm it *is* the upper bound, if I got my maths correctly. The last item in the list would not be returned if the requested frequency was higher than the one that is associated to that item. > So I think we have to fix this. The right thing is to select s and e > values that are actually meaningful in the terms of the paper, then > compute w from that, and be sure to enforce the minimum f value > specified in the algorithm ... ie, don't be afraid to throw away values > in the final D table. I we should definitely prune the table one last time in the very probable case that the loop ended in the middle of two regularly happening prunes. As for selecting the algorithm parameters: we don't get to select s. We do get to select e, but that's it. I have a feeling that our problems are caused by thte fact, that the algorithm tries to answer the question "which elements occur more frequently than X" and we actually want the answer to the "what's the frequency of each element". I've almost convinced myself that the transformation between the answers to these questions exists, but this trouble report keeps giving me doubts. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 28 May 2010 16:22
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes: > We follow the algorithm as written, the trouble starts when we want to > output the result. The paper says which items from the D structure > should be returned when the user asks for items that have frequencies > higher than a threshold s. What we want to put in the statistics table > are items accompanied by their frequencies, so we need to do some > reasoning before we can construct the result. Well, the estimated frequency is still just f/N. The point is that we must filter out items with small f values because they're probably inaccurate --- in particular, anything with f < eN is completely untrustworthy. I agree that we currently aren't bothering to determine a specific s value, but we probably need to do that in order to have a clear understanding of what we are getting. The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number of MCV entries or perhaps a bit more. Then use that estimate as the "s" value, and set e = s/10 or so, and then w = 1/e and continue as per the paper. If the eventual filtering results in a lot less than the target number of MCV entries (because the input wasn't so Zipfian), we lose, but at least we have accurate numbers for the entries we kept. > I we should definitely prune the table one last time in the very > probable case that the loop ended in the middle of two regularly > happening prunes. The paper doesn't say that you need to do that. I suspect if you work through the math, you'll find out that the minimum-f filter skips anything that would have been pruned anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |