From: Alvaro Herrera on
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010:

> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

I don't know if this is related, but tsvector stats are computed and
stored per term, not per datum. This is different from all other
datatypes. Maybe there's code somewhere that's assuming per-datum and
coming up with the wrong estimates? Or maybe the tsvector-specific code
contains a bug somewhere; maybe a rounding error?

--
Álvaro Herrera <alvherre(a)alvh.no-ip.org>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 19/05/10 21:01, Jesper Krogh wrote:
> The document base is arount 350.000 documents and
> I have set the statistics target on the tsvector column
> to 1000 since the 100 seems way of.

So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track" means
keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the 10
factor is meant to approximate the average number of unique lexemes in a
tsvector. If your documents are very large, this might not be a good
approximation.

> # ANALYZE verbose reference (document_tsvector);
> INFO: analyzing "reference"
> INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live
> rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
> ANALYZE
>
> Ok, so analyze allmost examined all rows. Looking into
> "most_common_freqs" I find
> # select count(unnest) from (select unnest(most_common_freqs) from
> pg_stats where attname = 'document_tsvector') as foo;
> count
> -------
> 2810
> (1 row)

So the size of the most_common_freqs and most_common_vals rows in
pg_statistics for tsvectors has an upper bound of <stats-target> * 10
(for the same reasons as mentioned before) and holds lexemes (not whole
tsvectors). What happens also is that lexemes that where seen only one
while going through the analysed set are discarded, so that's why you
can actually get less entries in these arrays, even if your document set
is big.


> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an
> estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

Yeah, this might meant that you could try cranking up the stats target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

> So far I have no idea if this is bad or good, so a couple of sample runs
> of stuff that
> is sitting outside the "most_common_vals" array:
>
> [gathered statistics suck]

> So the "most_common_vals" seems to contain a lot of values that should
> never have been kept in favor
> of other values that are more common.

> In practice, just cranking the statistics estimate up high enough seems
> to solve the problem, but doesn't
> there seem to be something wrong in how the statistics are collected?

The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c

It would be interesting to know what's the average size of a tsvector in
your document set (ie. how many unique lexemes does a tsvector have on
average). In general, the tsvector typanalyze function is designed to
suck less than the constant factor that has been used previously, but it
only works really well on the most common lexemes (thus preventing most
gross misestimates). I'm not very surprised it misses the difference
between 1612/350174 and 4/350174 and I'm quite happy that is gets that
if you set the stats target really high :o)

There's always the possibility that there's some stupid bug there, but I
think you just set your expectations too high for the tsvector typanalze
function. If you could come up with a better way of doing tsvector
stats, that would be awesome - currently it's just doing its best to
prevent the most outrageous errors.

Cheers,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jesper Krogh on

On 26/05/2010, at 01.16, Jan Urbański <wulczer(a)wulczer.org> wrote:

> On 19/05/10 21:01, Jesper Krogh wrote:
>> The document base is arount 350.000 documents and
>> I have set the statistics target on the tsvector column
>> to 1000 since the 100 seems way of.
>
> So for tsvectors the statistics target means more or less "at any time
> track at most 10 * <target> lexemes simultaneously" where "track"
> means
> keeping them in memory while going through the tuples being analysed.
>
> Remember that the measure is in lexemes, not whole tsvectors and the
> 10
> factor is meant to approximate the average number of unique lexemes
> in a
> tsvector. If your documents are very large, this might not be a good
> approximation.

I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.
>
>> But the distribution is very "flat" at the end, the last 128 values
>> are
>> excactly
>> 1.00189e-05
>> which means that any term sitting outside the array would get an
>> estimate of
>> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
>
> Yeah, this might meant that you could try cranking up the stats
> target a
> lot, to make the set of simulatenously tracked lexemes larger (it will
> cost time and memory during analyse though). If the documents have
> completely different contents, what can happen is that almost all
> lexemes are only seen a few times and get removed during the pruning
> of
> the working set. I have seen similar behaviour while working on the
> typanalyze function for tsvectors.

I Think i would prefer something less "magic" I Can increase the
statistics target and get more reliable data but that increases also
the amount of tuples being picked out for analysis which is really
time consuming.

But that also means that what gets stored as the lower bound of the
historgram isn't anywhere near the lower bound, more the lower bound
of the "artificial" histogram that happened after the last pruning.

I Would suggest that the pruning in the end should be aware of this.
Perhaps by keeping track of the least frequent value that never got
pruned and using that as the last pruning ans lower bound?

Thanks a lot for the explanation it fits fairly well why i couldn't
construct a simple test set that had the problem.

>
>> So far I have no idea if this is bad or good, so a couple of sample
>> runs
>> of stuff that
>> is sitting outside the "most_common_vals" array:
>>
>> [gathered statistics suck]
>
>> So the "most_common_vals" seems to contain a lot of values that
>> should
>> never have been kept in favor
>> of other values that are more common.
>
>> In practice, just cranking the statistics estimate up high enough
>> seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?
>
> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about
> the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I'll do some Reading

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 28/05/10 04:47, Tom Lane wrote:
> I re-scanned that paper and realized that there is indeed something
> wrong with the way we are doing it. The paper says (last sentence in
> the definition of the algorithm, section 4.2):
>
> When a user requests a list of items with threshold s, we output
> those entries in D where f >= (s-e)N.
>
> What we are actually doing is emitting every entry with f >= 2. Since
> e is fixed at 1/w, this effectively means that we are setting s to be
> only fractionally greater than e, which means very large relative errors
> in the estimates.

I gave it a though and reread the paper, but since I already blundered
once, please verify me on this.

We follow the algorithm as written, the trouble starts when we want to
output the result. The paper says which items from the D structure
should be returned when the user asks for items that have frequencies
higher than a threshold s. What we want to put in the statistics table
are items accompanied by their frequencies, so we need to do some
reasoning before we can construct the result.

Say we have an item I with count f (taken from our D structure). The
total number of entries is N. The question would be: what would be the
minimum frequency that the user could specify, that would still make us
output this element. From

f >= (s - e) * N

we can say it's

s <= (f / N) + e

So if the user wants items that occur with frequency (f / N) + e or
less. This would mean that the corresponding value in the statistics
entry should be < I, (f / N) + e) >

The thing is, this doesn't change much, because currently we are putting
(f / N) there, and e is set to 1 / stats_target * 10, so the difference
would not be dramatic.

> Or, if you want it explained another way: we are emitting words whose f
> is very small and whose delta is very large, representing items that got
> added to the scan very late. These really shouldn't be there because
> their true frequency is probably a lot less than the intended threshold.

Well, the idea of the algorithm is that if their frequency would have
been bigger, they would appear earlier and would survive the pruning, as
I understand it.

> The net effect of this is first that there are a lot of rather useless
> entries in the MCV list whose claimed frequency is quite small, like as
> little as two occurrences. Their true frequency could be quite a bit
> more. What's even worse is that we believe that the minimum of these
> claimed frequencies is a reliable upper bound for the frequencies of
> items not listed in the MCV list.

Per the algorithm it *is* the upper bound, if I got my maths correctly.
The last item in the list would not be returned if the requested
frequency was higher than the one that is associated to that item.

> So I think we have to fix this. The right thing is to select s and e
> values that are actually meaningful in the terms of the paper, then
> compute w from that, and be sure to enforce the minimum f value
> specified in the algorithm ... ie, don't be afraid to throw away values
> in the final D table.

I we should definitely prune the table one last time in the very
probable case that the loop ended in the middle of two regularly
happening prunes.

As for selecting the algorithm parameters: we don't get to select s. We
do get to select e, but that's it. I have a feeling that our problems
are caused by thte fact, that the algorithm tries to answer the question
"which elements occur more frequently than X" and we actually want the
answer to the "what's the frequency of each element". I've almost
convinced myself that the transformation between the answers to these
questions exists, but this trouble report keeps giving me doubts.

Cheers,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> We follow the algorithm as written, the trouble starts when we want to
> output the result. The paper says which items from the D structure
> should be returned when the user asks for items that have frequencies
> higher than a threshold s. What we want to put in the statistics table
> are items accompanied by their frequencies, so we need to do some
> reasoning before we can construct the result.

Well, the estimated frequency is still just f/N. The point is that we
must filter out items with small f values because they're probably
inaccurate --- in particular, anything with f < eN is completely
untrustworthy.

I agree that we currently aren't bothering to determine a specific s
value, but we probably need to do that in order to have a clear
understanding of what we are getting.

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more. Then use that estimate as
the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper. If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.

> I we should definitely prune the table one last time in the very
> probable case that the loop ended in the middle of two regularly
> happening prunes.

The paper doesn't say that you need to do that. I suspect if you work
through the math, you'll find out that the minimum-f filter skips
anything that would have been pruned anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers