Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 29 May 2010 09:56 On 29/05/10 12:34, Jesper Krogh wrote: > On 2010-05-28 23:47, Jan Urbański wrote: >> On 28/05/10 22:22, Tom Lane wrote: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. >> >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. >> > I think you should skip the assumption about stop-words, users may > use something where they are needed in the index or have a language > than the typical. (and they dont seem to influcence the math that much). Turns out it has nearly linear influence on the bucket width and the frequency necessary to survive the final pruning. I put some data in a spreadsheet, results below. > Isn't it the same "type" of logic that is used for collecting statistics > for > "array-types", say integer-arrays and text arrays? AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. ts_typanalyze is the only typanalyze function that takes the trouble of looping over the actual contents of each cell, all the others just compare whole arrays (which means that for a text[] field you will probably a quite useless MCV entry). >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 >> >> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes >> > > Im not sure I get this one.. does this mean that we prune everytime > we have collected 167 new datapoints .. that would seem too often > for me since that would roughly be once per "row". Hm, if we pick s to be 0.06, we say that the K'th word in the English language will have a frequency of 0.06, so if we want to have statistics with an error of s/10, we can prune every 167 lexemes (K is the statistics target, possibly +top_stopwords). Hm, I am now thinking that maybe this theory is flawed, because tsvecors contain only *unique* words, and Zipf's law is talking about words in documents in general. Normally a word like "the" would appear lots of times in a document, but (even ignoring the fact that it's a stopword and so won't appear at all) in a tsvector it will be present only once. This may or may not be a problem, not sure if such "squashing" of occurences as tsvectors do skewes the distribution away from Zipfian or not. Anyway, figuring that out would require some more math and thinking, and to fix the problem at hand we can say Zipf is good enough. >> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N >> >> So assuming that on average a tsvector has 154 elements and that we went >> through 35017 rows (as it would be in Jesper's case, before he raised >> the stats target from 100 to 1000), we will remove lexemes with f< >> 0.054 * 35017 * 154 that is f< 291201.37 >> >> I wonder what would happen if Jasper's case if we did that... And I >> wonder how sound that maths is >> > > If it means that I would get an accurate MCE-histogram for all > things that have an occourance of more than 5.4% of the rows > (given the samples chosen), then I think that would be really > reasonable. Here's the spreadsheet spat out. The variables are: * the statistics target * top stopwords * error factor Where top stopwords is the number of top words in the English language that would be stopwords. You can also think about it as the smudge factor determinig how well do we trust that the distribution is Zipfian. Theoretically if you want to keep X values in the MCE array, you should discard inputs with frequency lower than the frequency of the X'th value in a Zipfian distribution. If you would write out all English words and their frequencies (according to Zipf's law), the top Y of them would be stopwords. We want to discard words with frequency that's lower than X + Y, and then we probably want to have some breathing space as well. That cutoff frequency is called s in the LC algorithm. Error factor determines the relation between s and e, since apparently we want e to be proportional to s (e is the error from the LC algorithm). It directly determines the bucket width, since the larger the bucket, the more accurate the results will be, as there will be less pruning going on. There are also constants: H(len(eng)) is the harmonic number from Zipf's law, that assuming 1e6 words in English is 6.5. tsvector length and rows in sample are just some values to get concrete numbers out. They influence the final pruning frequency, because the rule is f < (s-e)N and N is the total number of lexemes seen The results are attached in a text (CSV) file, to preserve formatting. Based on them I'd like to propose top_stopwords and error_factor to be 100. With your dataset this would mean pruning every 3076 lexemes and discarding from the result all lexemes with < 173507 occurrences. With statistics target set to 1000 it would change to 16923 and 31546, respectively. > I can "fairly easy" try out patches or do other kind of testing. I'll try to come up with a patch for you to try and fiddle with these values before Monday. Cheers, Jan
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 29 May 2010 12:14 On 29/05/10 17:34, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes: >> On 29/05/10 17:09, Tom Lane wrote: >>> There is definitely something wrong with your math there. It's not >>> possible for the 100'th most common word to have a frequency as high >>> as 0.06 --- the ones above it presumably have larger frequencies, >>> which makes the total quite a lot more than 1.0. > >> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be >> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 > > Um, apparently I can't do simple arithmetic first thing in the morning > either, cause I got my number wrong too ;-) > > After a bit more research: if you use the basic form of Zipf's law > with a 1/k distribution, the first frequency has to be about 0.07 > to make the total come out to 1.0 for a reasonable number of words. > So we could use s = 0.07 / K when we wanted a final list of K words. > Some people (including the LC paper) prefer a higher exponent, ie > 1/k^S with S around 1.25. That makes the F1 value around 0.22 which > seems awfully high for the type of data we're working with, so I think > the 1/k rule is probably what we want here. OK, I think we're getting somewhere :o) I took the formula from Wikipedia's page on Zipf's law, assuming an exponent of 1: rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is the number of words in English Then I took the nth harmonic number expansion from the page on harmonic numbers: H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 + O(n^-6) Assuming 1 million words in English and the big-O term in the harmonic expansion to be 1, we get H(1e6) = 14.3927, which would make the frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say 0.07). Which brings me to the same result as yours, which in turn reassures me a lot ;) My previous result was wrong because I used the wrong logarithm base, go figure. So with this, for statistics target of 100 we would predict the frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in the input the bucket width and the final pruning value depend only on the epsilon that we choose for the LC algorithm. So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 = 1428 lexemes and would not discard anything from the result. If we want e to be s/2 we'd prune every 2857 lexemes and discard lexemes with counts < 1887. For s/3, s/4 etc the numbers look like this: s/1 1428 0 s/2 2857 1887 s/3 4285 2516 s/4 5714 2831 s/5 7142 3019 s/6 8571 3145 s/7 10000 3235 s/8 11428 3302 s/9 12857 3355 s/2 or s/3 look reasonable. So, should I just write a patch that sets the bucket width and pruning count using 0.07 as the assumed frequency of the most common word and epsilon equal to s/2 or s/3? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 29 May 2010 12:38 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes: > [ e of ] s/2 or s/3 look reasonable. The examples in the LC paper seem to all use e = s/10. Note the stated assumption e << s. > So, should I just write a patch that sets the bucket width and pruning > count using 0.07 as the assumed frequency of the most common word and > epsilon equal to s/2 or s/3? I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for a first cut to experiment with. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 29 May 2010 11:09 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes: > Now I tried to substitute some numbers there, and so assuming the > English language has ~1e6 words H(W) is around 6.5. Let's assume the > statistics target to be 100. > I chose s as 1/(st + 10)*H(W) because the top 10 English words will most > probably be stopwords, so we will never see them in the input. > Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 There is definitely something wrong with your math there. It's not possible for the 100'th most common word to have a frequency as high as 0.06 --- the ones above it presumably have larger frequencies, which makes the total quite a lot more than 1.0. For the purposes here, I think it's probably unnecessary to use the more complex statements of Zipf's law. The interesting property is the rule "the k'th most common element occurs 1/k as often as the most common one". So if you suppose the most common lexeme has frequency 0.1, the 100'th most common should have frequency around 0.0001. That's pretty crude of course but it seems like the right ballpark. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 29 May 2010 11:34
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes: > On 29/05/10 17:09, Tom Lane wrote: >> There is definitely something wrong with your math there. It's not >> possible for the 100'th most common word to have a frequency as high >> as 0.06 --- the ones above it presumably have larger frequencies, >> which makes the total quite a lot more than 1.0. > Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be > 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 Um, apparently I can't do simple arithmetic first thing in the morning either, cause I got my number wrong too ;-) After a bit more research: if you use the basic form of Zipf's law with a 1/k distribution, the first frequency has to be about 0.07 to make the total come out to 1.0 for a reasonable number of words. So we could use s = 0.07 / K when we wanted a final list of K words. Some people (including the LC paper) prefer a higher exponent, ie 1/k^S with S around 1.25. That makes the F1 value around 0.22 which seems awfully high for the type of data we're working with, so I think the 1/k rule is probably what we want here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |