From: Tom Lane on
Tatsuo Ishii <ishii(a)postgresql.org> writes:
> What is your locale?
>> It was en_EN.UTF-8. Interesting. With C it fails...

> Yes, pg_trgm seems to have problems with multibyte + C locale.

It's not a problem, it's just pilot error, or possibly inadequate
documentation. pg_trgm uses the locale's definition of "alpha",
"digit", etc. In C locale only basic ASCII letters and digits will be
recognized as word constituents.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> > Yes, pg_trgm seems to have problems with multibyte + C locale.
>
> It's not a problem, it's just pilot error, or possibly inadequate
> documentation. pg_trgm uses the locale's definition of "alpha",
> "digit", etc. In C locale only basic ASCII letters and digits will be
> recognized as word constituents.

That means there is no chance to make pg_trgm work with multibyte + C
locale? If so, I will leave pg_trgm as it is and provide private
patches for those who need the functionality.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Tatsuo Ishii <ishii(a)postgresql.org> writes:
>> It's not a problem, it's just pilot error, or possibly inadequate
>> documentation. pg_trgm uses the locale's definition of "alpha",
>> "digit", etc. In C locale only basic ASCII letters and digits will be
>> recognized as word constituents.

> That means there is no chance to make pg_trgm work with multibyte + C
> locale? If so, I will leave pg_trgm as it is and provide private
> patches for those who need the functionality.

Exactly what do you consider to be the missing functionality?
You need a notion of word vs non-word character from somewhere,
and the locale setting is the standard place to get that. The
core text search functionality behaves the same way.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> Exactly what do you consider to be the missing functionality?
> You need a notion of word vs non-word character from somewhere,
> and the locale setting is the standard place to get that. The
> core text search functionality behaves the same way.

No. Text search works fine with multibyte + C locale.

Anyway locale is completely usesless for finding word vs non-character
an agglutinative language such as Japanese.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Tatsuo Ishii <ishii(a)sraoss.co.jp> writes:
> Anyway locale is completely usesless for finding word vs non-character
> an agglutinative language such as Japanese.

Well, that doesn't mean that the answer is to use C locale ;-)

However, you could possibly think about making this bit of code
more flexible:

#ifdef KEEPONLYALNUM
#define iswordchr(c) (t_isalpha(c) || t_isdigit(c))
#else
#define iswordchr(c) (!t_isspace(c))
#endif

Currently it seems to be hard-wired to the first case in standard
builds.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers