From: Robert Haas on 2 Aug 2010 10:26 On Mon, Aug 2, 2010 at 10:21 AM, Kevin Grittner <Kevin.Grittner(a)wicourts.gov> wrote: > Sushant Sinha <sushant354(a)gmail.com> wrote: > >> Yes thats what I am planning to do. I just wanted to see if anyone >> can help me in estimating whether this is doable in the current >> parser or I need to write a new one. If possible, then some idea >> on how to go about implementing? > > The current tsearch parser is a state machine which does clunky mode > switches to handle special cases like you describe. �If you're > looking at doing very much in there, you might want to consider a > rewrite to something based on regular expressions. �See discussion > in these threads: > > http://archives.postgresql.org/message-id/200912102005.16560.andres(a)anarazel.de > > http://archives.postgresql.org/message-id/4B210D9E020000250002D344(a)gw.wicourts.gov > > That was actually at the top of my personal PostgreSQL TODO list > (after my current project is wrapped up), but I wouldn't complain if > someone else wanted to take it. �:-) If you end up rewriting it, it may be a good idea, in the initial rewrite, to mimic the current results as closely as possible - and then submit a separate patch to change the results. Changing two things at the same time exponentially increases the chance of your patch getting rejected. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 2 Aug 2010 10:20 Sushant Sinha <sushant354(a)gmail.com> writes: >> This would needlessly increase the number of tokens. Instead you'd >> better make it work like compound word support, having just "wikipedia" >> and "org" as tokens. > The current text parser already returns url and url_path. That already > increases the number of unique tokens. I am only asking for adding of > normal english words as well so that if someone types only "wikipedia" > he gets a match. The suggestion to make it work like compound words is still a good one, ie given wikipedia.org you'd get back host wikipedia.org host-part wikipedia host-part org not just the "host" token as at present. Then the user could decide whether he needed to index hostname components or not, by choosing whether to forward hostname-part tokens to a dictionary or just discard them. If you submit a patch that tries to force the issue by classifying hostname parts as plain words, it'll probably get rejected out of hand on backwards-compatibility grounds. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 2 Aug 2010 09:32 On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354(a)gmail.com> wrote: > The current text parser already returns url and url_path. That already > increases the number of unique tokens. I am only asking for adding of > normal english words as well so that if someone types only "wikipedia" > he gets a match. [...] > Postgres english parser already emits urls as tokens. Only thing I am > asking is on improving the tokenization and positioning. Can you write a patch to implement your idea? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Sushant Sinha on 2 Aug 2010 09:59 On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote: > On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha <sushant354(a)gmail.com> wrote: > > The current text parser already returns url and url_path. That already > > increases the number of unique tokens. I am only asking for adding of > > normal english words as well so that if someone types only "wikipedia" > > he gets a match. > [...] > > Postgres english parser already emits urls as tokens. Only thing I am > > asking is on improving the tokenization and positioning. > > Can you write a patch to implement your idea? > Yes thats what I am planning to do. I just wanted to see if anyone can help me in estimating whether this is doable in the current parser or I need to write a new one. If possible, then some idea on how to go about implementing? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: Postgres as Historian Next: Where in the world is Itagaki Takahiro? |