From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 02/07/10 14:33, Teodor Sigaev wrote:
> Patch implements much more accuracy estimation of cost for GIN index
> scan than generic cost estimation function.

Hi,

I'm reviewing this patch, and to begin with it I tried to reproduce the
problem that originally came up on -performance in
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php

The links from that mail are now dead, so I set up my own test environment:
* one table testfts(id serial, body text, body_fts tsvector)
* 50000 rows, each with 1000 random words taken from
/usr/share/dict/british-english-insane (the wbritish-insane Debian
package) separated by a single space
* each row also had the word "commonterm" at the end, 80% had
commonterm80, 60% had commonterm60 etc (using the same methodology as
Jesper, that commonterm60 can appear only if commonterm80 is in the row)
* a GIN index on the tsvectors

I was able to reproduce his issue, that is: select id from ftstest where
body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
which was resulting in much longer execution than the bitmap index plan
that I got after disabling seqscans.

I then applied the patch, recompiled PG and tried again... and nothing
changed. I first tried running ANALYSE and then dropping and recreating
the GIN index, but the planner still chooses the seq scan.

Full explains below (the NOTICE is a debugging aid from the patch, which
I temporarily enabled to see if it's picking up the code).

I'll continue reading the code and trying to understand what it does,
but in the meantime: am I doing something wrong that I don't see the
planner switching to the bitmap index plan? I see that the difference in
costs is small, so maybe I just need to tweak the planner knobs a bit?
Is the output below expected?

Cheers,
Jan


wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE: GIN stats: nEntryPages: 49297.000000 nDataPages: 16951.000000
nPendingPages :0.000000 nEntries: 277521.000000
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
Seq Scan on ftstest (cost=0.00..1567.00 rows=39890 width=4) (actual
time=221.893..33179.794 rows=39923 loops=1)
Filter: (body_fts @@ to_tsquery('commonterm80'::text))
Total runtime: 33256.661 ms
(3 rows)

wulczer=# set enable_seqscan to false;
SET
Time: 0.257 ms
wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE: GIN stats: nEntryPages: 49297.000000 nDataPages: 16951.000000
nPendingPages :0.000000 nEntries: 277521.000000
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ftstest (cost=449.15..1864.50 rows=39890 width=4)
(actual time=107.421..181.284 rows=39923 loops=1)
Recheck Cond: (body_fts @@ to_tsquery('commonterm80'::text))
-> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..439.18
rows=39890 width=0) (actual time=97.057..97.057 rows=39923 loops=1)
Index Cond: (body_fts @@ to_tsquery('commonterm80'::text))
Total runtime: 237.218 ms
(5 rows)

Time: 237.999 ms

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
On 26/07/10 12:58, Oleg Bartunov wrote:
> Jan,
>
> On Sun, 25 Jul 2010, Jan Urbaski wrote:
>
>> On 02/07/10 14:33, Teodor Sigaev wrote:
>>> Patch implements much more accuracy estimation of cost for GIN index
>>> scan than generic cost estimation function.

>> I was able to reproduce his issue, that is: select id from ftstest where
>> body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
>> which was resulting in much longer execution than the bitmap index plan
>> that I got after disabling seqscans.
>>
>> I then applied the patch, recompiled PG and tried again... and nothing
>> changed. I first tried running ANALYSE and then dropping and recreating
>> the GIN index, but the planner still chooses the seq scan.
>
> read thread
> http://archives.postgresql.org/pgsql-hackers/2010-04/msg01407.php
> There is always a fuzz factor, as Tom said, about 1% in path cost
> comparisons.
> You may compare plans for 'commonterm60', 'commonterm40'.

OK, I thought this might be the case, as with the patch the sequential
scan is
winning only be a small margin.

Thanks,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Fri, Jul 30, 2010 at 1:19 PM, Jan UrbaƄski <wulczer(a)wulczer.org> wrote:
> The patch has lots of statements like if ( GinPageIsLeaf(page) ), that is
> with extra space between the outer parenthesis and the condition, which
> AFAIK is not the project style. I guess pgindent fixes that, so it's no big
> deal.

It's better if these get cleaned up. pgindent will fix it eventually,
but the less stuff pgindent has to touch, the less likelihood there is
of breaking outstanding patches when it's run.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= on
OK, here's a review, as much as I was able to do it without
understanding deeply how GIN works.

The patch is context, applies cleanly to HEAD, compiles without warnings
and passes regression tests.

Using the script from
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php I
was able to get an index scan with commonterm40, while with the
unpatched source I was getting an index scan only for commonterm20, so
it indeed improves the situation as far as cost estimation is concerned.

Codewise I have one question: the patch changes a loop in
ginvacuumcleanup from

for (blkno = GIN_ROOT_BLKNO + 1; blkno < npages; blkno++)

to

for (blkno = GIN_ROOT_BLKNO; blkno < npages; blkno++)

why should it now go through all blocks? I couldn't immediately see why
was it not OK to do it before and why is it OK to do it now, but I don't
really get how GIN works internally. I guess a comment would be good to
have there in any case.

The patch has lots of statements like if ( GinPageIsLeaf(page) ), that
is with extra space between the outer parenthesis and the condition,
which AFAIK is not the project style. I guess pgindent fixes that, so
it's no big deal.

There are lines with elog(NOTICE) that are #if 0, they probably should
either become elog(DEBUGX) or get removed.

As for performance, I tried running the attached script a couple of
times. I used the standard config file, only changed checkpoint_segments
to 30 and shared_buffers to 512MB. The timings were:

HEAD

INSERT 0 500000
Time: 13487.450 ms
VACUUM
Time: 337.673 ms

INSERT 0 500000
Time: 13751.110 ms
VACUUM
Time: 315.812 ms

INSERT 0 500000
Time: 12691.259 ms
VACUUM
Time: 312.320 ms

HEAD + gincostestimate

INSERT 0 500000
Time: 13961.894 ms
VACUUM
Time: 355.798 ms

INSERT 0 500000
Time: 14114.975 ms
VACUUM
Time: 341.822 ms

INSERT 0 500000
Time: 13679.871 ms
VACUUM
Time: 340.576 ms

so there is no immediate slowdown for a quick test with one client.

Since there was no stability or performance issues and it solves the
problem, I am marking this as ready for committer, although it might be
beneficial if someone more acquianted with GIN takes another look at it
before the committer review.

I will be travelling during the whole August and will only have
intermittent email access, so in case of any questions with regards to
review the respionse time can be a few days.

Cheers,
Jan