english parser in text search: support for multiple words in the same position [PgSql]

Prev: Postgres as Historian
Next: Where in the world is Itagaki Takahiro?

From: Robert Haas on 2 Aug 2010 10:26

On Mon, Aug 2, 2010 at 10:21 AM, Kevin Grittner
<Kevin.Grittner(a)wicourts.gov> wrote:
> Sushant Sinha <sushant354(a)gmail.com> wrote:
>
>> Yes thats what I am planning to do. I just wanted to see if anyone
>> can help me in estimating whether this is doable in the current
>> parser or I need to write a new one. If possible, then some idea
>> on how to go about implementing?
>
> The current tsearch parser is a state machine which does clunky mode
> switches to handle special cases like you describe. �If you're
> looking at doing very much in there, you might want to consider a
> rewrite to something based on regular expressions. �See discussion
> in these threads:
>
> http://archives.postgresql.org/message-id/200912102005.16560.andres(a)anarazel.de
>
> http://archives.postgresql.org/message-id/4B210D9E020000250002D344(a)gw.wicourts.gov
>
> That was actually at the top of my personal PostgreSQL TODO list
> (after my current project is wrapped up), but I wouldn't complain if
> someone else wanted to take it. �:-)

If you end up rewriting it, it may be a good idea, in the initial
rewrite, to mimic the current results as closely as possible - and
then submit a separate patch to change the results. Changing two
things at the same time exponentially increases the chance of your
patch getting rejected.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 2 Aug 2010 10:20

Sushant Sinha <sushant354(a)gmail.com> writes:
>> This would needlessly increase the number of tokens. Instead you'd
>> better make it work like compound word support, having just "wikipedia"
>> and "org" as tokens.

> The current text parser already returns url and url_path. That already
> increases the number of unique tokens. I am only asking for adding of
> normal english words as well so that if someone types only "wikipedia"
> he gets a match.

The suggestion to make it work like compound words is still a good one,
ie given wikipedia.org you'd get back

host wikipedia.org
host-part wikipedia
host-part org

not just the "host" token as at present.

Then the user could decide whether he needed to index hostname
components or not, by choosing whether to forward hostname-part
tokens to a dictionary or just discard them.

If you submit a patch that tries to force the issue by classifying
hostname parts as plain words, it'll probably get rejected out of
hand on backwards-compatibility grounds.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 2 Aug 2010 09:32

On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354(a)gmail.com> wrote:
> The current text parser already returns url and url_path. That already
> increases the number of unique tokens. I am only asking for adding of
> normal english words as well so that if someone types only "wikipedia"
> he gets a match.
[...]
> Postgres english parser already emits urls as tokens. Only thing I am
> asking is on improving the tokenization and positioning.

Can you write a patch to implement your idea?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Sushant Sinha on 2 Aug 2010 09:59

On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote:
> On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354(a)gmail.com> wrote:
> > The current text parser already returns url and url_path. That already
> > increases the number of unique tokens. I am only asking for adding of
> > normal english words as well so that if someone types only "wikipedia"
> > he gets a match.
> [...]
> > Postgres english parser already emits urls as tokens. Only thing I am
> > asking is on improving the tokenization and positioning.
>
> Can you write a patch to implement your idea?
>

Yes thats what I am planning to do. I just wanted to see if anyone can
help me in estimating whether this is doable in the current parser or I
need to write a new one. If possible, then some idea on how to go about
implementing?

-Sushant.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

|
Pages: 1
Prev: Postgres as Historian
Next: Where in the world is Itagaki Takahiro?