From: Tatsuo Ishii on
> Well, that doesn't mean that the answer is to use C locale ;-)

Of course it's up to user whether to use C locale or not. I just want
pg_trgm work with C locale as well.

> However, you could possibly think about making this bit of code
> more flexible:
>
> #ifdef KEEPONLYALNUM
> #define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
> #else
> #define iswordchr(c) (!t_isspace(c))
> #endif
>
> Currently it seems to be hard-wired to the first case in standard
> builds.

Yup. Here is the patch in my mind:

*** trgm_op.c~ 2009-06-11 23:48:51.000000000 +0900
--- trgm_op.c 2010-05-27 23:38:20.000000000 +0900
***************
*** 59,65 ****
}

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif
--- 59,65 ----
}

#ifdef KEEPONLYALNUM
! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Tatsuo Ishii <ishii(a)postgresql.org> writes:
> ! #define iswordchr(c) (t_isalpha(c) || t_isdigit(c) || (lc_ctype_is_c() && !t_isspace(c)))

This seems entirely arbitrary. It might "fix" things in your view
but it will break the longstanding behavior for other people.

I think a more appropriate type of fix would be to expose the
KEEPONLYALNUM option as a GUC, or some other way of letting the
user decide what he wants.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on
On Thu, May 27, 2010 at 3:52 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> I think a more appropriate type of fix would be to expose the
> KEEPONLYALNUM option as a GUC, or some other way of letting the
> user decide what he wants.
>

So I think a GUC is broken because pg_tgrm has a index opclasses and
any indexes built using one setting will be broken if the GUC is
changed.

Perhaps we need two sets of functions (which presumably call the same
implementation with a flag to indicate which definition to use). Then
you can define an index using one or the other and the meaning would
be stable.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On tor, 2010-05-27 at 23:20 +0900, Tatsuo Ishii wrote:
> Anyway locale is completely usesless for finding word vs non-character
> an agglutinative language such as Japanese.

I don't know about Japanese, but the locale approach works just fine for
other agglutinative languages. I would rather suspect that it is the
trigram approach that might be rather useless for such languages,
because you are going to get a lot of similarity hits for the affixes.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> I don't know about Japanese, but the locale approach works just fine for
> other agglutinative languages. I would rather suspect that it is the
> trigram approach that might be rather useless for such languages,
> because you are going to get a lot of similarity hits for the affixes.

I'm not sure what you mean by "affixes". But I will explain...

A Japanese sentence consists of words. Problem is, each word is not
separated by space (agglutinative). So most text tools such as text
search need preprocess which finds word boundaries by looking up
dictionaries (and smart grammer analysis routine). In the process
"affixes" can be determined and perhaps removed from the target word
group to be used for text search (note that removing affixes is no
relevant to locale). Once we get space separated sentence, it can be
processed by text search or by pg_trgm just same as Engligh. (Note
that these preprocessing are done outside PostgreSQL world). The
difference is just the "word" can be consists of non ASCII letters.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers