Prev: More fun with GIN lossy-page pointers
Next: [HACKERS] Initial review of xslt with no limits patch
From: Sushant Sinha on 1 Aug 2010 14:04 Currently the english parser in text search does not support multiple words in the same position. Consider a word "wikipedia.org". The text search would return a single token "wikipedia.org". However if someone searches for "wikipedia org" then there will not be a match. There are two problems here: 1. We do not have separate tokens "wikipedia" and "org" 2. If we have the two tokens we should have them at adjacent position so that a phrase search for "wikipedia org" should work. It will be nice to have the following tokenization and positioning for "wikipedia.org" position 0: WORD(wikipedia), URL(wikipedia.org) position 1: WORD(org) Take the example of "wikipedia.org/search?q=sushant" Here is the TSVECTOR: select to_tsvector('english', 'wikipedia.org/search?q=sushant'); to_tsvector ---------------------------------------------------------------------------- '/search?q=sushant':3 'wikipedia.org':2 'wikipedia.org/search?q=sushant':1 And here are the tokens: select ts_debug('english', 'wikipedia.org/search?q=sushant'); ts_debug -------------------------------------------------------------------------------- (url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q =sushant}) (host,Host,wikipedia.org,{simple},simple,{wikipedia.org}) (url_path,"URL path",/search?q=sushant,{simple},simple,{/search?q=sushant}) The tokenization I would like to see is: position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) position 1: WORD(org) position 2: WORD(search), URL_PATH(search/?q=sushant) position 3: WORD(q), URL_QUERY(q=search) position 4: WORD(sushant) So what we need is to support multiple tokens at the same position. And I need help in understanding how to realize this. Currently the position assignment happens in make_tsvector by working or parsed lexemes. The lexeme is obtained by prsd_nexttoken. However, prsd_nexttoken only returns a single token. Will it be possiblt to store some tokens and return them tokegher? Or can we put a flag to certain tokens that say the position should not be increased? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: More fun with GIN lossy-page pointers Next: [HACKERS] Initial review of xslt with no limits patch |