From: Tatsuo Ishii on
> > > This is in 9.0, because 8.4 doesn't recognize the \u escape syntax. If
> > > you run this in 8.4, you're just comparing a sequence of ASCII letters
> > > and digits.
> >
> > Hum. Still I prefer 8.4's behavior since anything is better than
> > returning NaN. It seems 9.0 does not have any escape route for
> > multibyte+C locale users.
>
> I think you are confusing some things here. The \u escape syntax is for
> string literals in general. The behavior of pg_trgm is still the same
> in 8.4 and in 9.0. It's just easier in 9.0 to write out examples
> relevant to the current problem.

I just wanted to point out from the point of view of users. I do not
object the new \u escape syntax. I think pg_trgm has a problem. But
Tom thinks that it's not a problem. That's the point.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on
On Sun, May 30, 2010 at 3:41 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> I don't think it's unreasonable to insist that behavioral changes be
> made in an upward compatible fashion ... especially ones that seem as
> least as likely to break some current usages as to enable new usages.

Fwiw I don't think we've traditionally been so tense about contrib
modules. With the advent of extensions that users can easily install
with a single command that might be about to change though.

There seem to be three behaviours on the table here:

1) Status quo -- only alpha and digit characters for the current
locale are considered word elements

2) All characters aside from space characters for the current locale
are considered word elements

3) Alpha and digit characters for the current locale, and for C locale
any non-ascii (high bit set) character is considered a word element

1 -> 3 seems like a pretty safe change considering that anyone using
non-ascii characters in C locale probably isn't using pg_tgrm or they
would be complaining about it already. How big a user-base do we think
pg_tgrm has anyways? How many of those are using it on non-ascii
characters in C locale? And of those how many expect the non-ascii
characters to be considered non-word characters? It doesn't sound like
terribly useful behaviour to me.

Behaviour 2 also seems like it would be useful so providing it as well
is also a perfectly reasonable option. But I agree that 1->2 would be
a user-visible change for basically all users so it would have to be
an additional option.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Greg Stark <gsstark(a)mit.edu> writes:
> There seem to be three behaviours on the table here:

You're neglecting

4) Let the user decide whether he wants pg_trgm to consider word
elements to be "alphanumerics" or "any non-space".

The main problem I have with Tatsuo's patch is that it forecloses any
reasonably upward-compatible extension to a user-selected behavior like
(4). The current behavior can be extended and is simple to document
(though we've neglected to do so). But once you've put in this
arbitrary warping of the behavior of C locale, you're going to be at
a dead end for improving it later.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers