From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 28/05/10 22:22, Tom Lane wrote:
> The idea that I was toying with is to assume a Zipfian distribution of
> the input (with some reasonable parameter), and use that to estimate
> what the frequency of the K'th element will be, where K is the target
> number of MCV entries or perhaps a bit more. Then use that estimate as
> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
> per the paper. If the eventual filtering results in a lot less than the
> target number of MCV entries (because the input wasn't so Zipfian), we
> lose, but at least we have accurate numbers for the entries we kept.

I see what you mean, so the idea would be:

* assume some value of W as the number of all words in the language
* estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
* set e = s/10 and w = 1/e, that is 10/s
* perform LC using that value of w
* remove all elements for which f < (s-e)N, that is f < 0.9*sN, where N
is the total number of lexemes processed
* create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

After that, we remove lexemes with f < 0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f <
0.054 * 35017 * 154 that is f < 291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is.

>> I we should definitely prune the table one last time in the very
>> probable case that the loop ended in the middle of two regularly
>> happening prunes.
>
> The paper doesn't say that you need to do that. I suspect if you work
> through the math, you'll find out that the minimum-f filter skips
> anything that would have been pruned anyway.

Possible.

Cheers,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignoring the fact that it's a stopword
> and so won't appear at all) in a tsvector it will be present only once.
> This may or may not be a problem, not sure if such "squashing" of
> occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents. In any case we are not really depending on Zipf's
law heavily with this approach. The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jesper Krogh on
On 2010-05-28 23:47, Jan UrbaƄski wrote:
> On 28/05/10 22:22, Tom Lane wrote:
>
>> The idea that I was toying with is to assume a Zipfian distribution of
>> the input (with some reasonable parameter), and use that to estimate
>> what the frequency of the K'th element will be, where K is the target
>> number of MCV entries or perhaps a bit more. Then use that estimate as
>> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
>> per the paper. If the eventual filtering results in a lot less than the
>> target number of MCV entries (because the input wasn't so Zipfian), we
>> lose, but at least we have accurate numbers for the entries we kept.
>>
> I see what you mean, so the idea would be:
>
> * assume some value of W as the number of all words in the language
> * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
> number and st is the statistics target, using Zipf's law
> * set e = s/10 and w = 1/e, that is 10/s
> * perform LC using that value of w
> * remove all elements for which f< (s-e)N, that is f< 0.9*sN, where N
> is the total number of lexemes processed
> * create the MCELEM entries as (item, f/N)
>
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.
>
> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.
>

I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical. (and they dont seem to influcence the math that much).

Isn't it the same "type" of logic that is used for collecting statistics
for
"array-types", say integer-arrays and text arrays?
> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>
> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>

Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per "row".

> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N
>
> So assuming that on average a tsvector has 154 elements and that we went
> through 35017 rows (as it would be in Jesper's case, before he raised
> the stats target from 100 to 1000), we will remove lexemes with f<
> 0.054 * 35017 * 154 that is f< 291201.37
>
> I wonder what would happen if Jasper's case if we did that... And I
> wonder how sound that maths is
>

If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can "fairly easy" try out patches or do other kind of testing.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 29/05/10 17:09, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>
>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>
> There is definitely something wrong with your math there. It's not
> possible for the 100'th most common word to have a frequency as high
> as 0.06 --- the ones above it presumably have larger frequencies,
> which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jesper Krogh on
On 2010-05-28 04:47, Tom Lane wrote:
> Cranking up the stats target actually makes it worse not better, since
> low-frequency items are then more likely to get into the MCV list
>

I should have been more precise in the wording. Cranking up the
stats target gave me overall a "better plan", but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a "Bitmap Index Scan" on the gin-index to
"Index Scan" on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers