Prev: functional call named notation clashes with SQLfeature
Next: [HACKERS] Straightforward Synchronous Replication
From: Tatsuo Ishii on 27 May 2010 11:51 > So I think a GUC is broken because pg_tgrm has a index opclasses and > any indexes built using one setting will be broken if the GUC is > changed. > > Perhaps we need two sets of functions (which presumably call the same > implementation with a flag to indicate which definition to use). Then > you can define an index using one or the other and the meaning would > be stable. It's worse. pg_trgm has another compile option "IGNORECASE" which might affect index opclasses. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Peter Eisentraut on 27 May 2010 14:01 On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote: > > I don't know about Japanese, but the locale approach works just fine for > > other agglutinative languages. I would rather suspect that it is the > > trigram approach that might be rather useless for such languages, > > because you are going to get a lot of similarity hits for the affixes. > > I'm not sure what you mean by "affixes". But I will explain... > > A Japanese sentence consists of words. Problem is, each word is not > separated by space (agglutinative). So most text tools such as text > search need preprocess which finds word boundaries by looking up > dictionaries (and smart grammer analysis routine). In the process > "affixes" can be determined and perhaps removed from the target word > group to be used for text search (note that removing affixes is no > relevant to locale). Once we get space separated sentence, it can be > processed by text search or by pg_trgm just same as Engligh. (Note > that these preprocessing are done outside PostgreSQL world). The > difference is just the "word" can be consists of non ASCII letters. I think the problem at hand has nothing at all to do with agglutination or CJK-specific issues. You will get the same problem with other languages *if* you set a locale that does not adequately support the characters in use. E.g., Russian with locale C and encoding UTF8: select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E \u043D\u044B'); similarity ──────────── NaN (1 row) -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 27 May 2010 15:00 On Thu, May 27, 2010 at 2:01 PM, Peter Eisentraut <peter_e(a)gmx.net> wrote: > On fre, 2010-05-28 at 00:46 +0900, Tatsuo Ishii wrote: >> > I don't know about Japanese, but the locale approach works just fine for >> > other agglutinative languages. I would rather suspect that it is the >> > trigram approach that might be rather useless for such languages, >> > because you are going to get a lot of similarity hits for the affixes. >> >> I'm not sure what you mean by "affixes". But I will explain... >> >> A Japanese sentence consists of words. Problem is, each word is not >> separated by space (agglutinative). So most text tools such as text >> search need preprocess which finds word boundaries by looking up >> dictionaries (and smart grammer analysis routine). In the process >> "affixes" can be determined and perhaps removed from the target word >> group to be used for text search (note that removing affixes is no >> relevant to locale). Once we get space separated sentence, it can be >> processed by text search or by pg_trgm just same as Engligh. (Note >> that these preprocessing are done outside PostgreSQL world). The >> difference is just the "word" can be consists of non ASCII letters. > > I think the problem at hand has nothing at all to do with agglutination > or CJK-specific issues. You will get the same problem with other > languages *if* you set a locale that does not adequately support the > characters in use. E.g., Russian with locale C and encoding UTF8: > > select similarity(E'\u0441\u043B\u043E\u043D', E'\u0441\u043B\u043E > \u043D\u044B'); > similarity > > NaN > (1 row) What I can't help wondering as I'm reading this discussion is - Tatsuo-san said upthread that he has a problem with pg_trgm that he does not have with full text search. So what is full text search doing differently than pg_trgm? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tatsuo Ishii on 29 May 2010 04:13 > > It's not a practical solution for people working with prebuilt Postgres > > versions, which is most people. I don't object to finding a way to > > provide a "not-space" behavior instead of an "is-alnum" behavior, > > but as noted upthread a GUC isn't the right way. How do you feel > > about a new set of functions with an additional flag argument of > > some sort? > > Let me see how many functions we need to create... After thinking a little bit more, I think following patch would not break existing behavior and also adopts mutibyte + C locale case. What do you think? *** trgm_op.c~ 2009-06-11 23:48:51.000000000 +0900 --- trgm_op.c 2010-05-29 17:07:28.000000000 +0900 *************** *** 59,65 **** } #ifdef KEEPONLYALNUM ! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c)) #else #define iswordchr(c) (!t_isspace(c)) #endif --- 59,67 ---- } #ifdef KEEPONLYALNUM ! #define iswordchr(c) (lc_ctype_is_c()? \ ! ((*(c) & 0x80)? !t_isspace(c) : (t_isalpha(c) || t_isdigit(c))) : \ ! (t_isalpha(c) || t_isdigit(c))) #else #define iswordchr(c) (!t_isspace(c)) #endif -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 29 May 2010 10:31
Tatsuo Ishii <ishii(a)postgresql.org> writes: > After thinking a little bit more, I think following patch would not > break existing behavior and also adopts mutibyte + C locale case. What > do you think? This is still ignoring the point: arbitrarily changing the module's longstanding standard behavior isn't acceptable. You need to provide a way for the user to control the behavior. (Once you've done that, I think it can be just either "alnum" or "!isspace", but maybe some other behaviors would be interesting.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |