From: Tatsuo Ishii on
> It's already multibyte safe since 8.4

No, it doesn't.

$ psql test
Pager usage is off.
psql (8.4.4)
Type "help" for help.

test=# select similarity('abc', 'abd'); -- OK
similarity
------------
0.333333
(1 row)

test=# select similarity('$BF|K\8l(B', '$BF|K\8e(B'); -- NG
similarity
------------
NaN
(1 row)

test=# select show_trgm('abc'); -- OK
show_trgm
-------------------------
{" a"," ab",abc,"bc "}
(1 row)

test=# select show_trgm('$BF|K\8l(B'); -- NG
show_trgm
-----------
{}
(1 row)

Encoding is EUC_JP, locale is C. Included is the script to reproduce
the problem.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
From: Andres Freund on
Hi,

On Thursday 27 May 2010 13:53:37 Tatsuo Ishii wrote:
> > It's already multibyte safe since 8.4
>
> No, it doesn't.
> Encoding is EUC_JP, locale is C. Included is the script to reproduce
> the problem.
test=# select show_trgm('日本語');
show_trgm
---------------------------------------
{0x8194c0,0x836e53,0x1dc363,0x1e22e9}
(1 row)

Time: 0.443 ms
test=# select similarity('日本語', '日本後');
similarity
------------
0.333333
(1 row)

Time: 0.426 ms


Encoding is UTF-8...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> > No, it doesn't.
> > Encoding is EUC_JP, locale is C. Included is the script to reproduce
> > the problem.
> test=# select show_trgm('$BF|K\8l(B');
> show_trgm
> ---------------------------------------
> {0x8194c0,0x836e53,0x1dc363,0x1e22e9}
> (1 row)
>
> Time: 0.443 ms
> test=# select similarity('$BF|K\8l(B', '$BF|K\8e(B');
> similarity
> ------------
> 0.333333
> (1 row)
>
> Time: 0.426 ms
>
>
> Encoding is UTF-8...

What is your locale?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on
On Thursday 27 May 2010 14:40:41 Tatsuo Ishii wrote:
> > > No, it doesn't.
> > > Encoding is EUC_JP, locale is C. Included is the script to reproduce
> > > the problem.
> >
> > test=# select show_trgm('$BF|K\8l(B');
> >
> > show_trgm
> >
> > ---------------------------------------
> >
> > {0x8194c0,0x836e53,0x1dc363,0x1e22e9}
> >
> > (1 row)
> >
> > Time: 0.443 ms
> > test=# select similarity('$BF|K\8l(B', '$BF|K\8e(B');
> >
> > similarity
> >
> > ------------
> >
> > 0.333333
> >
> > (1 row)
> >
> > Time: 0.426 ms
> >
> >
> > Encoding is UTF-8...
>
> What is your locale?
It was en_EN.UTF-8. Interesting. With C it fails...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tatsuo Ishii on
> > What is your locale?
> It was en_EN.UTF-8. Interesting. With C it fails...

Yes, pg_trgm seems to have problems with multibyte + C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers