Per-column collation, proof of concept [PgSql]

Prev: Synchronous replication
Next: Patch for 9.1: initdb -C option

From: Tom Lane on 15 Jul 2010 11:24

Peter Eisentraut <peter_e(a)gmx.net> writes:
> Well, the comparison function varstr_cmp() contains this comment:

> /*
> * In some locales strcoll() can claim that nonidentical strings are
> * equal. Believing that would be bad news for a number of reasons,
> * so we follow Perl's lead and sort "equal" strings according to
> * strcmp().
> */

> This might not be strictly necessary, seeing that citext obviously
> doesn't work that way, but resolving this is really an orthogonal issue.

The problem with not doing that is it breaks hashing --- hash joins and
hash aggregation being the real pain points.

citext works around this in a rather klugy fashion by decreeing that two
strings are equal iff their str_tolower() conversions are bitwise equal.
So it can hash the str_tolower() representation. But that's kinda slow
and it fails in the general case anyhow, I think.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 15 Jul 2010 13:04

On Thu, Jul 15, 2010 at 4:24 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> The problem with not doing that is it breaks hashing --- hash joins and
> hash aggregation being the real pain points.
>
> citext works around this in a rather klugy fashion by decreeing that two
> strings are equal iff their str_tolower() conversions are bitwise equal.
> So it can hash the str_tolower() representation. �But that's kinda slow
> and it fails in the general case anyhow, I think.

I think the general equivalent would be to call strxfrm and hash the
result of that.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jaime Casanova on 2 Aug 2010 02:43

On Tue, Jul 13, 2010 at 1:25 PM, Peter Eisentraut <peter_e(a)gmx.net> wrote:
> Here is a proof of concept for per-column collation support.
>

Hi,

i was looking at this.

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.
"""
postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate "en_US.UTF-8");
ERROR: collation "en_US.UTF-8" does not exist
test=# create table t1 (col1 text collate "en_US.utf8");
CREATE TABLE
"""

also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute "select * from t2 where col1 > 'n'; "
and i got this error: "ERROR: could not create locale "es_EC.utf8""
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

i can attach a collate to a domain but i can't see where are we
storing that info (actually it says it's not collatable):

--
Jaime CasanovaÂ Â Â Â Â www.2ndQuadrant.com
Soporte y capacitaciÃ³n de PostgreSQL

From: Peter Eisentraut on 3 Aug 2010 12:32

On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:
> nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
> clause similar as the lc_collate?
> i mean, is lc_collate what we will use as a default?

Yes, if you do not specify anything per column, the database default is
used.

How to integrate the per-database or per-cluster configuration with the
new system is something to figure out in the future.

> if yes, then probably we need to use pg_collation there too because
> lc_collate and the new collate clause use different collation names.
> """
> postgres=# create database test with lc_collate 'en_US.UTF-8';
> CREATE DATABASE
> test=# create table t1 (col1 text collate "en_US.UTF-8");
> ERROR: collation "en_US.UTF-8" does not exist
> test=# create table t1 (col1 text collate "en_US.utf8");
> CREATE TABLE
> """

This is something that libc does for you. The locale as listed by
locale -a is called "en_US.utf8", but apparently libc takes
"en_US.UTF-8" as well.

> also i got errors from regression tests when MULTIBYTE=UTF8
> (attached). it seems i was trying to create locales that weren't
> defined on locales.txt (from were was fed that file?). i added a line
> to that file (for es_EC.utf8) then i create a table with a column
> using that collate and execute "select * from t2 where col1 > 'n'; "
> and i got this error: "ERROR: could not create locale "es_EC.utf8""
> (of course, that last part was me messing the things up, but it show
> we shouldn't be using a file locales.txt, i think)

It might be that you don't have those locales installed in your system.
locales.txt is created by using locale -a. Check what that gives you.

> i can attach a collate to a domain but i can't see where are we
> storing that info (actually it says it's not collatable):

Domain support is not done yet.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev |
Pages: 1 2
Prev: Synchronous replication
Next: Patch for 9.1: initdb -C option