From: Tom Lane on
Peter Eisentraut <peter_e(a)gmx.net> writes:
> Well, the comparison function varstr_cmp() contains this comment:

> /*
> * In some locales strcoll() can claim that nonidentical strings are
> * equal. Believing that would be bad news for a number of reasons,
> * so we follow Perl's lead and sort "equal" strings according to
> * strcmp().
> */

> This might not be strictly necessary, seeing that citext obviously
> doesn't work that way, but resolving this is really an orthogonal issue.

The problem with not doing that is it breaks hashing --- hash joins and
hash aggregation being the real pain points.

citext works around this in a rather klugy fashion by decreeing that two
strings are equal iff their str_tolower() conversions are bitwise equal.
So it can hash the str_tolower() representation. But that's kinda slow
and it fails in the general case anyhow, I think.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on
On Thu, Jul 15, 2010 at 4:24 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> The problem with not doing that is it breaks hashing --- hash joins and
> hash aggregation being the real pain points.
>
> citext works around this in a rather klugy fashion by decreeing that two
> strings are equal iff their str_tolower() conversions are bitwise equal.
> So it can hash the str_tolower() representation. �But that's kinda slow
> and it fails in the general case anyhow, I think.

I think the general equivalent would be to call strxfrm and hash the
result of that.



--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jaime Casanova on
On Tue, Jul 13, 2010 at 1:25 PM, Peter Eisentraut <peter_e(a)gmx.net> wrote:
> Here is a proof of concept for per-column collation support.
>

Hi,

i was looking at this.

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.
"""
postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate "en_US.UTF-8");
ERROR: collation "en_US.UTF-8" does not exist
test=# create table t1 (col1 text collate "en_US.utf8");
CREATE TABLE
"""

also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute "select * from t2 where col1 > 'n'; "
and i got this error: "ERROR: could not create locale "es_EC.utf8""
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

i can attach a collate to a domain but i can't see where are we
storing that info (actually it says it's not collatable):

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL
From: Peter Eisentraut on
On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:
> nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
> clause similar as the lc_collate?
> i mean, is lc_collate what we will use as a default?

Yes, if you do not specify anything per column, the database default is
used.

How to integrate the per-database or per-cluster configuration with the
new system is something to figure out in the future.

> if yes, then probably we need to use pg_collation there too because
> lc_collate and the new collate clause use different collation names.
> """
> postgres=# create database test with lc_collate 'en_US.UTF-8';
> CREATE DATABASE
> test=# create table t1 (col1 text collate "en_US.UTF-8");
> ERROR: collation "en_US.UTF-8" does not exist
> test=# create table t1 (col1 text collate "en_US.utf8");
> CREATE TABLE
> """

This is something that libc does for you. The locale as listed by
locale -a is called "en_US.utf8", but apparently libc takes
"en_US.UTF-8" as well.

> also i got errors from regression tests when MULTIBYTE=UTF8
> (attached). it seems i was trying to create locales that weren't
> defined on locales.txt (from were was fed that file?). i added a line
> to that file (for es_EC.utf8) then i create a table with a column
> using that collate and execute "select * from t2 where col1 > 'n'; "
> and i got this error: "ERROR: could not create locale "es_EC.utf8""
> (of course, that last part was me messing the things up, but it show
> we shouldn't be using a file locales.txt, i think)

It might be that you don't have those locales installed in your system.
locales.txt is created by using locale -a. Check what that gives you.

> i can attach a collate to a domain but i can't see where are we
> storing that info (actually it says it's not collatable):

Domain support is not done yet.



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers