How to pass around collation information [PgSql]

Prev: [HACKERS] How to pass around collation information
Next: [BUGS] dividing money by money

From: alvherre on 28 May 2010 13:05

Excerpts from Peter Eisentraut's message of vie may 28 12:27:52 -0400 2010:

> Option 2, invent some new mechanism that accompanies a datum or a type
> whereever it goes. Kind of like typmod, but not really. Then the
> collation information would presumably be made available to functions
> through the fmgr interface. The binary representation of data values
> stays the same.

Is the collation a property of the datum, or one of the comparison?
If the latter, should it be really be made a sidecar of a datum, or
would it make more sense to attach it to the operation being performed?

I wonder if instead of trying to pass it down multiple layers till
bttextcmp and further down, it would make more sense to set a global
variable somewhere in the high levels, and only have it checked in
varstr_cmp.

--
Álvaro Herrera <alvherre(a)commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on 28 May 2010 13:22

On 28/05/10 19:27, Peter Eisentraut wrote:
> I have been thinking about this collation support business a bit.
> Ignoring for the moment where we would get the actual collation routines
> from, I wonder how we are going to pass this information around in the
> system. Someone declares a collation on a column in a table, and
> somehow this information needs to arrive in bttextcmp() and friends.

Yes. Comparison operators need it, as do functions like isalpha().

> Also, functions that take in a string and return one (e.g., substring),
> need to take in this information and return it back out. How should
> this work?

Hmm, I don't see what substring would need collation for. And it
certainly shouldn't be returning it. Collation is a property of the
comparison operators (and isalpha etc.), and the planner needs to deduce
the right collation for each such operation in the query. That involves
looking at the tables and columns involved, as well as per-user
information and any explicit COLLATE clauses in the query, but all that
happens at plan-time.

> Option 1, make it part of the datum. That way it will pass through the
> system just fine, but it would waste a lot of storage and break just
> about everything that operates on string types now, as well as
> pg_upgrade. So that's probably out.

It's also fundamentally wrong, collation is not a property of a datum
but of the operation.

> Option 2, invent some new mechanism that accompanies a datum or a type
> whereever it goes. Kind of like typmod, but not really. Then the
> collation information would presumably be made available to functions
> through the fmgr interface. The binary representation of data values
> stays the same.

Something like that. I'm thinking that bttextcmp() and friends will
simply take an extra argument indicating the collation, and we'll teach
the operator / operator class infrastructure about that too.

One way to approach this is to realize that it's already possible to use
multiple collations in a database. You just have to define separate < =
> operators and operator classes for every collation, and change all
your queries to use the right operator depending on the desired
collation everywhere where you use < = > (including ORDER BYs, with the
USING <operator> syntax). The behavior is exactly what we want, it's
just completely inpractical, so we need something to do the same in a
less cumbersome way.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 28 May 2010 13:59

On fre, 2010-05-28 at 20:22 +0300, Heikki Linnakangas wrote:
> It's also fundamentally wrong, collation is not a property of a datum
> but of the operation.

> One way to approach this is to realize that it's already possible to
> use
> multiple collations in a database. You just have to define separate <
> =
> > operators and operator classes for every collation, and change all
> your queries to use the right operator depending on the desired
> collation everywhere where you use < = > (including ORDER BYs, with
> the
> USING <operator> syntax). The behavior is exactly what we want, it's
> just completely inpractical, so we need something to do the same in a
> less cumbersome way.

Well, maybe we should step back a little and work out what sort of
feature we actually want, if any. The feature I'm thinking of is what
people might call "per-column locale", and the SQL standard defines
that. It would look like this:

CREATE TABLE test (
a varchar COLLATE de,
b varchar COLLATE fr
);

SELECT * FROM test WHERE a > 'baz' ORDER BY b;

So while it's true that the collation is used by the operations (> and
ORDER BY), the information which collation to use comes with the data
values. It's basically saying, a is in language "de", so sort it like
that unless told otherwise. There is also an override syntax available,
like this:

SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;

But here again the collation is attached to a data value, and only from
there it is passed to the operator. What is actually happening is

SELECT * FROM test WHERE (a COLLATE en) > 'baz' ORDER BY (b COLLATE sv);

What you appear to be describing is a "per-operation locale", which also
sounds valid, but it would be a different thing. It might be thought of
as this:

SELECT * FROM test WHERE a (> COLLATE en) 'baz' ORDER BY COLLATE sv b;

with some suitable global default.

So which one of these should it be?

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 28 May 2010 14:48

Peter Eisentraut <peter_e(a)gmx.net> writes:
> So while it's true that the collation is used by the operations (> and
> ORDER BY), the information which collation to use comes with the data
> values. It's basically saying, a is in language "de", so sort it like
> that unless told otherwise. There is also an override syntax available,
> like this:

> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;

That seems fairly bizarre. What does this mean:

WHERE a COLLATE en > b COLLATE de

? If it's an error, why is this not an error

WHERE a COLLATE en > b

if b is marked as COLLATE de in its table?

I guess the more general question is whether the spec expects that
collation settings can be derived statically (like type information)
or whether they might sometimes only be known at runtime.

We also need to think about whether we're okay with only applying
collation to built-in types (text, varchar, char) or whether we need
the feature to work for add-on types as well. In particular, is citext
still a meaningful feature if we have this, or is it superseded by
COLLATE? In the abstract I'd prefer to let it work for user-defined
types, but if we can have a much simpler implementation by not doing
so, it might be better to give that up.

Is COLLATE a property that can be attached to a domain over text?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 28 May 2010 15:03

On Fri, May 28, 2010 at 2:48 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Peter Eisentraut <peter_e(a)gmx.net> writes:
>> So while it's true that the collation is used by the operations (> and
>> ORDER BY), the information which collation to use comes with the data
>> values. It's basically saying, a is in language "de", so sort it like
>> that unless told otherwise. There is also an override syntax available,
>> like this:
>
>> SELECT * FROM test WHERE a COLLATE en > 'baz' ORDER BY b COLLATE sv;
>
> That seems fairly bizarre. What does this mean:
>
> WHERE a COLLATE en > b COLLATE de
>
> ? If it's an error, why is this not an error
>
> WHERE a COLLATE en > b
>
> if b is marked as COLLATE de in its table?

I think we need to think of the comparison operators as ternary, and
the COLLATE syntax applied to columns or present in queries as various
ways of setting defaults or explicit overrides for what the third
argument will end up being.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4
Prev: [HACKERS] How to pass around collation information
Next: [BUGS] dividing money by money