tsvector pg_stats seems quite a bit off. [PgSql]

Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)

From: Jesper Krogh on 30 May 2010 03:08

On 2010-05-29 15:56, Jan Urbański wrote:
> On 29/05/10 12:34, Jesper Krogh wrote:
>
>> On 2010-05-28 23:47, Jan Urbański wrote:
>>
>>> On 28/05/10 22:22, Tom Lane wrote:
>>> Now I tried to substitute some numbers there, and so assuming the
>>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>>> statistics target to be 100.
>>>
>>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>>> probably be stopwords, so we will never see them in the input.
>>>
>>>
>> I think you should skip the assumption about stop-words, users may
>> use something where they are needed in the index or have a language
>> than the typical. (and they dont seem to influcence the math that much).
>>
> Turns out it has nearly linear influence on the bucket width and the
> frequency necessary to survive the final pruning. I put some data in a
> spreadsheet, results below.
>
>
How about setting it to "some default" in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?

>> Isn't it the same "type" of logic that is used for collecting statistics
>> for "array-types", say integer-arrays and text arrays?
>>
> AFAIK statistics for everything other than tsvectors are built based on
> the values of whole rows. ts_typanalyze is the only typanalyze function
> that takes the trouble of looping over the actual contents of each cell,
> all the others just compare whole arrays (which means that for a text[]
> field you will probably a quite useless MCV entry).
>

In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the
node-ids
in the set and then indexing them using gin. That seems like a "really
bad idea"
based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?

> The results are attached in a text (CSV) file, to preserve formatting.
> Based on them I'd like to propose top_stopwords and error_factor to be 100.
>

I know it is not percieved the correct way to do things, but I would
really like to keep the "stop words" in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the "full set" and not the "empty
set" as
we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more
generalized
code around it.

>> I can "fairly easy" try out patches or do other kind of testing.
>>
> I'll try to come up with a patch for you to try and fiddle with these
> values before Monday.
>

Excellent.

testdb=# explain select id from testdb.reference where document_tsvector
@@ plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=41.02..41.03 rows=1 width=4)
-> Sort (cost=41.02..41.03 rows=1 width=4)
Sort Key: id
-> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1
width=4)
Recheck Cond: (document_tsvector @@
plainto_tsquery('where'::text))
-> Bitmap Index Scan on reference_fts_idx
(cost=0.00..34.50 rows=1 width=0)
Index Cond: (document_tsvector @@
plainto_tsquery('where'::text))
(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@
plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
id
----
(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id.. trivial
to extract.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jan =?UTF-8?Q?Urba=C5=84ski?= on 30 May 2010 10:41

> Jesper Krogh <jesper(a)krogh.cc> writes:
> > On 2010-05-29 15:56, Jan UrbaÅski wrote:
> > > AFAIK statistics for everything other than tsvectors are built based
> > > on the values of whole rows.
>
> > Wouldn't it make sense to treat array types like the tsvectors?
>
> Yeah, I have a personal TODO item to look into that in the future.

There were plans to generalise the functions in ts_typanalyze and use LC for array types as well. If one day I'd find myself with a lot of free time I'd take a stab at that.

> > > The results are attached in a text (CSV) file, to preserve
> > > formatting. Based on them I'd like to propose top_stopwords and
> > > error_factor to be 100.
>
> > I know it is not percieved the correct way to do things, but I would
> > really like to keep the "stop words" in the dataset and have
> > something that is robust to that.
>
> Any stop words would already have been eliminated in the transformation
> to tsvector (or not, if none were configured in the dictionary setup).
> We should not assume that there are any in what ts_typanalyze is seeing.

Yes, and as a side note, if you want to be indexing stopwords, just don't pass a stopword file when creating the text search dictionary (or pass a custom one).

>
> I think the only relevance of stopwords to the current problem is that
> *if* stopwords have been removed, we would see a Zipfian distribution
> with the first few entries removed, and I'm not sure if it's still
> really Zipfian afterwards.Â Â However, we only need the assumption of
> Zipfianness to compute a target frequency cutoff, so it's not like
> things will be completely broken if the distribution isn't quite
> Zipfian.

That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.

Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 30 May 2010 10:46

Jan =?UTF-8?Q?Urba=C5=84ski?= <wulczer(a)wulczer.org> writes:
>> I think the only relevance of stopwords to the current problem is that
>> *if* stopwords have been removed, we would see a Zipfian distribution
>> with the first few entries removed, and I'm not sure if it's still
>> really Zipfian afterwards.

> That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.

Oh, now I get the point of that. Yeah, it is probably a good idea.
If the input doesn't have stopwords removed, the worst that will happen
is we'll collect stats for an extra 10 or so lexemes, which will then
get thrown away when they don't fit into the MCE list. +1.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 30 May 2010 10:24

Jesper Krogh <jesper(a)krogh.cc> writes:
> On 2010-05-29 15:56, Jan Urbański wrote:
>> AFAIK statistics for everything other than tsvectors are built based on
>> the values of whole rows.

> Wouldn't it make sense to treat array types like the tsvectors?

Yeah, I have a personal TODO item to look into that in the future.

>> The results are attached in a text (CSV) file, to preserve formatting.
>> Based on them I'd like to propose top_stopwords and error_factor to be 100.

> I know it is not percieved the correct way to do things, but I would
> really like to keep the "stop words" in the dataset and have
> something that is robust to that.

Any stop words would already have been eliminated in the transformation
to tsvector (or not, if none were configured in the dictionary setup).
We should not assume that there are any in what ts_typanalyze is seeing.

I think the only relevance of stopwords to the current problem is that
*if* stopwords have been removed, we would see a Zipfian distribution
with the first few entries removed, and I'm not sure if it's still
really Zipfian afterwards. However, we only need the assumption of
Zipfianness to compute a target frequency cutoff, so it's not like
things will be completely broken if the distribution isn't quite
Zipfian.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 30 May 2010 14:02

On 30/05/10 09:08, Jesper Krogh wrote:
> On 2010-05-29 15:56, Jan Urbański wrote:
>> On 29/05/10 12:34, Jesper Krogh wrote:
>>> I can "fairly easy" try out patches or do other kind of testing.
>>>
>> I'll try to come up with a patch for you to try and fiddle with these
>> values before Monday.

Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.

> testdb=# explain select id from testdb.reference where document_tsvector
> @@ plainto_tsquery('where') order by id limit 50;
> NOTICE: text-search query contains only stop words or doesn't contain
> lexemes, ignored

That's orthogonal to the issue with the statistics collection, you just
need to modify your stopwords list (for instance make it empty).

Cheers,
Jan

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)