From: Tatsuo Ishii on
> So I think a GUC is broken because pg_tgrm has a index opclasses and
> any indexes built using one setting will be broken if the GUC is
> changed.
>
> Perhaps we need two sets of functions (which presumably call the same
> implementation with a flag to indicate which definition to use). Then
> you can define an index using one or the other and the meaning would
> be stable.

It's worse. pg_trgm has another compile option "IGNORECASE" which
might affect index opclasses.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
> > I don't know about Japanese, but the locale approach works just fine for
> > other agglutinative languages. I would rather suspect that it is the
> > trigram approach that might be rather useless for such languages,
> > because you are going to get a lot of similarity hits for the affixes.
>
> I'm not sure what you mean by "affixes". But I will explain...
>
> A Japanese sentence consists of words. Problem is, each word is not
> separated by space (agglutinative). So most text tools such as text
> search need preprocess which finds word boundaries by looking up
> dictionaries (and smart grammer analysis routine). In the process
> "affixes" can be determined and perhaps removed from the target word
> group to be used for text search (note that removing affixes is no
> relevant to locale). Once we get space separated sentence, it can be
> processed by text search or by pg_trgm just same as Engligh. (Note
> that these preprocessing are done outside PostgreSQL world). The
> difference is just the "word" can be consists of non ASCII letters.

I think the problem at hand has nothing at all to do with agglutination
or CJK-specific issues. You will get the same problem with other
languages *if* you set a locale that does not adequately support the
characters in use. E.g., Russian with locale C and encoding UTF8:

select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
\u043D\u044B');
similarity
────────────
NaN
(1 row)



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e(a)gmx.net> wrote:
> On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote:
>> > I don't know about Japanese, but the locale approach works just fine for
>> > other agglutinative languages. šI would rather suspect that it is the
>> > trigram approach that might be rather useless for such languages,
>> > because you are going to get a lot of similarity hits for the affixes.
>>
>> I'm not sure what you mean by "affixes". šBut I will explain...
>>
>> A Japanese sentence consists of words. Problem is, each word is not
>> separated by space (agglutinative). So most text tools such as text
>> search need preprocess which finds word boundaries by looking up
>> dictionaries (and smart grammer analysis routine). In the process
>> "affixes" can be determined and perhaps removed from the target word
>> group to be used for text search (note that removing affixes is no
>> relevant to locale). Once we get space separated sentence, it can be
>> processed by text search or by pg_trgm just same as Engligh. (Note
>> that these preprocessing are done outside PostgreSQL world). The
>> difference is just the "word" can be consists of non ASCII letters.
>
> I think the problem at hand has nothing at all to do with agglutination
> or CJK-specific issues. šYou will get the same problem with other
> languages *if* you set a locale that does not adequately support the
> characters in use. šE.g., Russian with locale C and encoding UTF8:
>
> select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E
> \u043D\u044B');
> šsimilarity
> €€€€€€€€€€€€
> š š š šNaN
> (1 row)

What I can't help wondering as I'm reading this discussion is -
Tatsuo-san said upthread that he has a problem with pg_trgm that he
does not have with full text search. So what is full text search
doing differently than pg_trgm?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> > It's not a practical solution for people working with prebuilt Postgres
> > versions, which is most people. I don't object to finding a way to
> > provide a "not-space" behavior instead of an "is-alnum" behavior,
> > but as noted upthread a GUC isn't the right way. How do you feel
> > about a new set of functions with an additional flag argument of
> > some sort?
>
> Let me see how many functions we need to create...

After thinking a little bit more, I think following patch would not
break existing behavior and also adopts mutibyte + C locale case. What
do you think?

*** trgm_op.c~ 2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c 2010-05-29 17:07:28.000000000 +0900
***************
*** 59,65 ****
}

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif
--- 59,67 ----
}

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (lc_ctype_is_c()? \
! ((*(c) & 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \
! (t_isalpha(c) || t_isdigit(c)))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Tatsuo Ishii <ishii(a)postgresql.org> writes:
> After thinking a little bit more, I think following patch would not
> break existing behavior and also adopts mutibyte + C locale case. What
> do you think?

This is still ignoring the point: arbitrarily changing the module's
longstanding standard behavior isn't acceptable. You need to provide
a way for the user to control the behavior. (Once you've done that,
I think it can be just either "alnum" or "!isspace", but maybe some
other behaviors would be interesting.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers