tsvector pg_stats seems quite a bit off. [PgSql]

Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)

From: Tom Lane on 30 May 2010 18:07

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't done any performance impact testing on
> it.

I did a little bit of testing using a dataset I had handy (a couple
hundred thousand publication titles) and found that ANALYZE seems to be
noticeably but far from intolerably slower --- it's almost the same
speed at statistics targets up to 100, and even at the max setting of
10000 it's only maybe 25% slower. However I'm not sure if this result
will scale to very large document sets, so more testing would be a good
idea.

I committed the attached revised version of the patch. Revisions are
mostly minor but I did make two substantive changes:

* The patch changed the target number of mcelems from 10 *
statistics_target to just statistics_target. I reverted that since
I don't think it was intended; at least we hadn't discussed it.

* I modified the final processing to avoid one qsort step if there are
fewer than num_mcelems hashtable entries that pass the cutoff frequency
filter, and in any case to sort only those entries that pass it rather
than all of them. With the significantly larger number of hashtable
entries that will now be used, it seemed like a good thing to try to
cut the qsort overhead.

regards, tom lane

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on 30 May 2010 18:24

On 31/05/10 00:07, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(a)wulczer.org> writes:
> I committed the attached revised version of the patch. Revisions are
> mostly minor but I did make two substantive changes:
>
> * The patch changed the target number of mcelems from 10 *
> statistics_target to just statistics_target. I reverted that since
> I don't think it was intended; at least we hadn't discussed it.

Yeah, that was accidental.

> * I modified the final processing to avoid one qsort step if there are
> fewer than num_mcelems hashtable entries that pass the cutoff frequency
> filter, and in any case to sort only those entries that pass it rather
> than all of them. With the significantly larger number of hashtable
> entries that will now be used, it seemed like a good thing to try to
> cut the qsort overhead.

Make sense.

Thanks,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jesper Krogh on 31 May 2010 14:12

On 2010-05-30 20:02, Jan Urbański wrote:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't done any performance impact testing on
> it. I updated the initial comment block in compute_tsvector_stats, but
> the prose could probably be improved.
>
Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.

Thanks alot.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 31 May 2010 14:38

Jesper Krogh <jesper(a)krogh.cc> writes:
> Just a small follow up. I tried out the patch (or actually a fresh git
> checkout) and it now gives very accurate results for both upper and
> lower end of the MCE-histogram with a lower cutoff that doesn't
> approach 2.

Good. How much did the ANALYZE time change for your table?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jesper Krogh on 1 Jun 2010 00:50

On 2010-05-31 20:38, Tom Lane wrote:
> Jesper Krogh<jesper(a)krogh.cc> writes:
>
>> Just a small follow up. I tried out the patch (or actually a fresh git
>> checkout) and it now gives very accurate results for both upper and
>> lower end of the MCE-histogram with a lower cutoff that doesn't
>> approach 2.
>>
> Good. How much did the ANALYZE time change for your table?
>
1.3m documents.

New code ( 3 runs):
statistics target 1000 => 155s/124s/110s
statictics target 100 => 86s/55s/61s
Old code:
statistics target 1000 => 158s/101s/99s
statistics target 100 => 90s/29s/33s

Somehow I think that the first run is the relevant one, its pretty much
a "dead disk" test,
and I wouldn't expect that random sampling of tuples would have any sane
caching
effect in a production system. But it looks like the algoritm is "a bit"
slower.

Thanks again..

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev |
Pages: 1 2 3 4 5
Prev: [PATCH] Add XMLEXISTS function from the SQL/XML standard
Next: [HACKERS] mergejoin null handling (was Re: [PERFORM] merge join killing performance)