From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 29/05/10 12:34, Jesper Krogh wrote:
> On 2010-05-28 23:47, Jan Urbański wrote:
>> On 28/05/10 22:22, Tom Lane wrote:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>>
> I think you should skip the assumption about stop-words, users may
> use something where they are needed in the index or have a language
> than the typical. (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

> Isn't it the same "type" of logic that is used for collecting statistics
> for
> "array-types", say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>>
>> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>>
>
> Im not sure I get this one.. does this mean that we prune everytime
> we have collected 167 new datapoints .. that would seem too often
> for me since that would roughly be once per "row".

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like "the" would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such "squashing" of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

>> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N
>>
>> So assuming that on average a tsvector has 154 elements and that we went
>> through 35017 rows (as it would be in Jesper's case, before he raised
>> the stats target from 100 to 1000), we will remove lexemes with f<
>> 0.054 * 35017 * 154 that is f< 291201.37
>>
>> I wonder what would happen if Jasper's case if we did that... And I
>> wonder how sound that maths is
>>
>
> If it means that I would get an accurate MCE-histogram for all
> things that have an occourance of more than 5.4% of the rows
> (given the samples chosen), then I think that would be really
> reasonable.

Here's the spreadsheet spat out.

The variables are:
* the statistics target
* top stopwords
* error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f < (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with < 173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

> I can "fairly easy" try out patches or do other kind of testing.

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.

Cheers,
Jan
From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there. It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it presumably have larger frequencies,
>>> which makes the total quite a lot more than 1.0.
>
>> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
>> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
>
> Um, apparently I can't do simple arithmetic first thing in the morning
> either, cause I got my number wrong too ;-)
>
> After a bit more research: if you use the basic form of Zipf's law
> with a 1/k distribution, the first frequency has to be about 0.07
> to make the total come out to 1.0 for a reasonable number of words.
> So we could use s = 0.07 / K when we wanted a final list of K words.
> Some people (including the LC paper) prefer a higher exponent, ie
> 1/k^S with S around 1.25. That makes the F1 value around 0.22 which
> seems awfully high for the type of data we're working with, so I think
> the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts < 1887. For s/3, s/4 etc the numbers look like this:

s/1 1428 0
s/2 2857 1887
s/3 4285 2516
s/4 5714 2831
s/5 7142 3019
s/6 8571 3145
s/7 10000 3235
s/8 11428 3302
s/9 12857 3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> [ e of ] s/2 or s/3 look reasonable.

The examples in the LC paper seem to all use e = s/10. Note the stated
assumption e << s.

> So, should I just write a patch that sets the bucket width and pruning
> count using 0.07 as the assumed frequency of the most common word and
> epsilon equal to s/2 or s/3?

I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for
a first cut to experiment with.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law. The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001. That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> On 29/05/10 17:09, Tom Lane wrote:
>> There is definitely something wrong with your math there. It's not
>> possible for the 100'th most common word to have a frequency as high
>> as 0.06 --- the ones above it presumably have larger frequencies,
>> which makes the total quite a lot more than 1.0.

> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25. That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers