[GENERAL] trouble with to_char('L') [PgSql]

Prev: Performance Patches Was: [HACKERS] Lock Wait Statistics (next commitfest)
Next: contrib/xml2 pfree bug

From: Magnus Hagander on 16 Apr 2010 06:52

On Mon, Mar 22, 2010 at 9:14 PM, Bruce Momjian <bruce(a)momjian.us> wrote:
> Takahiro Itagaki wrote:
>>
>> Bruce Momjian <bruce(a)momjian.us> wrote:
>>
>> > Takahiro Itagaki wrote:
>> > > Since 9.0 has GetPlatformEncoding() for the purpose, we could simplify
>> > > db_encoding_strdup() with the function. Like this:
>> >
>> > OK, I don't have any Win32 people testing this patch so if we want this
>> > fixed for 9.0 someone is going to have to test my patch to see that it
>> > works. Can you make the adjustments suggested above to my patch and
>> > test it to see that it works so we can apply it for 9.0?
>>
>> Here is a full patch that can be applied cleanly to HEAD.
>> Can anyone test it on Windows?
>>
>> I'm not sure why temporary changes of lc_ctype was required in the
>> original patch. The codes are not included in my patch, but please
>> notice me it is still needed.
>
> Sorry for the delay in replying to you.
>
> I considered your idea of using the existing Postgres encoding
> conversion routines to do the conversion of localenv() strings, but
> found two problems.
>
> First, GetPlatformEncoding() caches its result, so it assumes the
> LC_CTYPE never changes for the server, while fixing this issue actually
> requires us to change LC_CTYPE. We could avoid the caching but that
> then involves complex table lookups, etc, which seems overly complex:
>
> + /* convert the string to the database encoding */
> + pstr = (char *) pg_do_encoding_conversion(
> + (unsigned char *) str, strlen(str),
> + GetPlatformEncoding(), GetDatabaseEncoding());
>
> Second, having our backend routines do the conversion seems wrong
> because it is possible for someone to set LC_MONETARY to an encoding
> that our database does not understand, e.g. UTF16, but one that WIN32
> can convert to a valid encoding.
>
> The reason we are doing all this is because of this updated comment in
> my patch:
>
> ftp://momjian.us/pub/postgresql/mypatches/pg_locale
>
> + * Ideally, monetary and numeric local symbols could be returned in
> + * any server encoding. Unfortunately, the WIN32 API does not allow
> + * setlocale() to return values in a codepage/CTYPE that uses more
> + * than two bytes per character, like UTF-8:
> + *
> + * http://msdn.microsoft.com/en-us/library/x99tb11d.aspx
> + *
> + * Evidently, LC_CTYPE allows us to control the encoding used
> + * for strings returned by localeconv(). The Open Group
> + * standard, mentioned at the top of this C file, doesn't
> + * explicitly state this.
> + *
> + * Therefore, we set LC_CTYPE to match LC_NUMERIC and
> + * LC_MONETARY, call localeconv(), and use mbstowcs() to
> + * convert the locale-aware string, e.g. Euro symbol (which
> + * is not in UTF-8), to the server encoding.
>
> One new idea would be to set LC_CTYPE to UTF16/widechars unconditionally
> on Win32 and then just convert that always to the server encoding with
> win32_wchar_to_db_encoding(), instead of using the encoding from
> LC_MONETARY to set LC_CTYPE and having to do double-conversion.

So, hugely late, reviving this thread.

Ideally, we should definitely consider doing that. Internally, Windows
will do it in UTF16 anyway. So we're basically doing
UTF16->db->UTF16->UTF8->db or something like that with this patch.

But I'm unsure how that would work. We're talking about the output of
localeconv(), right? I don't see a version of localeconv() that does
wide chars anywhere. (You can't just set LC_CTYPE and use the regular
function - Windows has a separate set of functions for dealing with
UTF16).

Looking at the patch, you're passing "item" to db_encoding_strdup()
but it doesn't seem to be used anywhere. Leftover from previous
experiments, or forgot to use it? Perhaps you intended for it to be in
the error messages?

Also, won't this need special-casing for UTF8? Per comment in
mbutils.c, wcstombs() doesn't work for UTF8 encodings - you need to
use MultiByteToWideChar().

I also note that we have char2wchar() already - we should perhaps just
call that? Or will that use the wrong locale?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Bruce Momjian on 20 Apr 2010 09:10

Magnus Hagander wrote:
> > One new idea would be to set LC_CTYPE to UTF16/widechars unconditionally
> > on Win32 and then just convert that always to the server encoding with
> > win32_wchar_to_db_encoding(), instead of using the encoding from
> > LC_MONETARY to set LC_CTYPE and having to do double-conversion.
>
> So, hugely late, reviving this thread.
>
> Ideally, we should definitely consider doing that. Internally, Windows
> will do it in UTF16 anyway. So we're basically doing
> UTF16->db->UTF16->UTF8->db or something like that with this patch.
>
> But I'm unsure how that would work. We're talking about the output of
> localeconv(), right? I don't see a version of localeconv() that does
> wide chars anywhere. (You can't just set LC_CTYPE and use the regular
> function - Windows has a separate set of functions for dealing with
> UTF16).

I thought there was an LC_CTYPE for UTF16 that we could use without a
wide version of that function. If not, forget that idea.

> Looking at the patch, you're passing "item" to db_encoding_strdup()
> but it doesn't seem to be used anywhere. Leftover from previous
> experiments, or forgot to use it? Perhaps you intended for it to be in
> the error messages?

It originally was in the error message but can be removed. I have now
removed 'item' from my version of the patch.

> Also, won't this need special-casing for UTF8? Per comment in
> mbutils.c, wcstombs() doesn't work for UTF8 encodings - you need to
> use MultiByteToWideChar().

Well, we don't support UTF8 for any of the non-encoding locales, e.g.
monetary, numeric, so I never considered that we would support it. If
we did support it, we would have to _pick_ a locale that is <= 2 bytes
per character and use that, and then convert to UTF8, but what locale
would we pick? They could use a LC_TYPE that is <= 2 bytes and a
numeric that is UTF8, but I never suspected we would want to support
that, and we would need some logic to detect that case.

> I also note that we have char2wchar() already - we should perhaps just
> call that? Or will that use the wrong locale?

I see char2wchar() calling GetDatabaseEncoding() right away, which does
use the cached value for the server encoding, so I don't think it will
work. We can't use our existing routines to convert _from_ the current
encoding to wide characters (because our numeric encoding might not
match the server encoding). However, we can use existing code that
converts from wide to the server encoding, perhaps replacing
win32_wchar_to_db_encoding().

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Bruce Momjian on 20 Apr 2010 09:38

Magnus Hagander wrote:
> > Another idea is to use GetLocaleInfoW() [1], that is win32 native locale
> > functions, instead of the libc one. It returns locale characters in wide
> > chars, so we can safely convert them as UTF16->UTF8->db. But it requires
> > an additional branch in our locale codes only for Windows.
>
> If we can go UTF16->db directly, it might be a good idea. If we're
> going via UTF8 anyway, I doubt it's going to be worth it.
>
> Let's work off what we have now to start with at least. Bruce, can you
> comment on that thing about the extra parameter? And UTF8?

I do like the idea of using UTF16 directly because that would eliminate
our need to even set LC_CTYPE for Win32 in this routine. That would
also eliminate any need to refer to the encoding for numeric/monetary,
so we could get rid of the odd case where their encoding is UTF8 but
their numeric/monetary locale settings have to use a non-UTF8 encoding.
For example, the original bug report has these locale settings:

http://archives.postgresql.org/pgsql-general/2009-04/msg00829.php

psql (PostgreSQL) 8.3.7

server_version 8.3.7
server_encoding UTF8
client_encoding win1252
lc_numeric Finnish, Finland
lc_monetary Finnish, Finland

but really needed to use "Finnish_Finland.1252":

http://archives.postgresql.org/pgsql-general/2009-04/msg00859.php

However, I noticed that both lc_collate and lc_ctype are set to
Finnish_Finland.1252 by the installer. Should I have just run initdb
with --locale fi_FI.UTF8 at the very start? The to_char('L') works
fine with a database with win1252 encoding.

Of course, that still does not work with our current CVS code if the
database encoding is UTF8, which is what we are trying to fix now.

I am not even sure how users set these things properly but I assume the
installer does all that magic. And, of course, if someone manually runs
initdb on Windows, they can easily set things wrong.

Magnus, if I remember correctly, all our non-UTF8 to UTF8 conversion
already has to pass through UTF16 as an intermediary case, so going to
UTF16 directly seems fine.

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Takahiro Itagaki on 20 Apr 2010 21:50

Hiroshi Inoue <inoue(a)tpf.co.jp> wrote:

> 1. How does it work when LC_MONETARY and LC_NUMERIC are different?

I think it is rarely used, but possible. Fixed.

> 2. Calling db_encoding_strdup() for lconv->grouping is appropriate?

Ah, we didn't need it. Removed.

Revised patch attached. Please test it.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

From: Takahiro Itagaki on 21 Apr 2010 22:00

Takahiro Itagaki <itagaki.takahiro(a)oss.ntt.co.jp> wrote:

> Revised patch attached. Please test it.

I applied this version of the patch.
Please check wheter the bug is fixed and any buildfarm failures.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Performance Patches Was: [HACKERS] Lock Wait Statistics (next commitfest)
Next: contrib/xml2 pfree bug