UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Pete Becker on 18 Jun 2006 01:31

Alf P. Steinbach wrote:
>
> As a practical matter consider wide character literals.
>
> FooW( L"Hello, hello?" );
>
> where FooW is some API function.
>
> If wchar_t isn't what the API level expects, this won't compile.
>

Why should it compile? The idiomatic way to write this is

Foo(_T("Hello, hello?"));

where _T is the MS macro that turns the quoted string into the
appropriate type for the API. Once you abandon that indirection layer
you're locked into a specific choice of type. As a result, every Win32
application that does wide character handling has to use 16-bit wide
characters, or use a non-standard library to take advantage of the
simpler code for wider characters. There are many places other than OS
calls where wide characters can be used, and the OS should be only one
factor in choosing an appropriate type.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 18 Jun 2006 09:34

Eugene Gershnik wrote:
> Eugene Gershnik wrote:
>
>>Well here is AFAIK correctly formed Unicode sequence that means e?
>>
>>U+0065 U+0301
>>
>>All my editors seem to agree with that.
>
>
> And unintentionally this was also a good demonstration of how broken modern
> software is with regards to Unicode. My NNTP client (Microsoft Outlook
> Express) had correctly shown the character as ? while editing but
> transmitted it as e? as you can see above. This is despite being Windows
> only application that presumably uses UTF-16 wchar_t internally.

That seems like how it ought to work. U0065 is LATIN SMALL LETTER E, and
U0301 is COMBINING ACUTE ACCENT. They're two distinct characters, which
is why they're written that way and transmitted that way. For display,
they combine to represent the single glyph that the editor shows. If you
want that glyph to be represented by a single character you have to
canonicalize the character sequence, and replace U0065 U0301 with U00E9,
LATIN SMALL LETTER E WITH ACUTE. That is, there are two different
representations for that glyph; one that consists of two code points,
and one that consists of one.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Kirit Sælensminde on 18 Jun 2006 09:41

Pete Becker wrote:
> Why should it compile? The idiomatic way to write this is
>
> Foo(_T("Hello, hello?"));
>
> where _T is the MS macro that turns the quoted string into the
> appropriate type for the API. Once you abandon that indirection layer
> you're locked into a specific choice of type. As a result, every Win32
> application that does wide character handling has to use 16-bit wide
> characters, or use a non-standard library to take advantage of the
> simpler code for wider characters. There are many places other than OS
> calls where wide characters can be used, and the OS should be only one
> factor in choosing an appropriate type.

Microsoft has long been pushing this as a way of being able to compile
something with or without Unicode, but the problem is that it doesn't
quite work. It's fine so long as all of your strings are US ASCII, but
what if you wanted to put an AE ligature in there? It may work or it
may not depending on a whole raft of other things.

To my mind there is absolutely no reason why anybody should be writing
new code that takes narrow strings other than where protocol issues
force them to (at least on Windows). In those few circumstance it'll
foce you to confront the encoding issues and you're much more likely to
get them right.

If you're using narrow character sequences in your code then I as the
person who _runs_ the software gets to decide how your string was
encoded. Not you as the person who _wrote_ it. Using wide character
strings is the only choice you have if you actually want to control
what is in the string. It's a wonder to me that any of it works at all.

K

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Alf P. Steinbach on 19 Jun 2006 07:05

{This thread is drifting too far off topic, follow-ups are likely to be
rejected unless they include Standard C++ relevant content. -mod/fwg}

* Pete Becker:
> Alf P. Steinbach wrote:
>> As a practical matter consider wide character literals.
>>
>> FooW( L"Hello, hello?" );
>>
>> where FooW is some API function.
>>
>> If wchar_t isn't what the API level expects, this won't compile.
>>
>
> Why should it compile? The idiomatic way to write this is
>
> Foo(_T("Hello, hello?"));
>

I'm sorry, no.

I'm not sure how topical this is, but we're talking about a library (the
Windows API) where most functions come in two version: a char-based one,
and a wchar_t based on. The _T macro appends an L prefix to the string
literal, or not, which is a way of supporting use of only char-based or
only wchar_t based functions. When the function in question, such as
FooW, is only available in wchar_t based variant, using _T will yield a
compilation error when _T is defined as not prepending L to the literal.

[snip]
> Once you abandon that indirection layer
> you're locked into a specific choice of type.

I'm sorry, no. There is no indirection layer (an indirection layer
would be what many higher level libraries provide through smart string
classes). There is however a C/C++ "choice" layer, if a set of macros
might be called a layer, choosing between only wchar_t or only char,
which does not work for functions that come in only one variant, and
which in modern versions of the OS has no benefits; in particular, the
choice layer does not remove the coupling between wchar_t and a
particular OS-defined size.

[snip]
> simpler code for wider characters. There are many places other than OS
> calls where wide characters can be used, and the OS should be only one
> factor in choosing an appropriate type.

Yes.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on 19 Jun 2006 07:12

Pete Becker wrote:
> Eugene Gershnik wrote:
>> Eugene Gershnik wrote:
>>
>>> Well here is AFAIK correctly formed Unicode sequence that means e?
>>>
>>> U+0065 U+0301
>>>
>>> All my editors seem to agree with that.
>>
>>
>> And unintentionally this was also a good demonstration of how broken
>> modern software is with regards to Unicode. My NNTP client
>> (Microsoft Outlook Express) had correctly shown the character as ?
>> while editing but transmitted it as e? as you can see above. This is
>> despite being Windows only application that presumably uses UTF-16
>> wchar_t internally.
>
> That seems like how it ought to work. U0065 is LATIN SMALL LETTER E,
> and U0301 is COMBINING ACUTE ACCENT. They're two distinct characters,
> which
> is why they're written that way and transmitted that way.

Not at all. Outlook Express informs me that my message was transmitted in
what it calls "Western European (ISO)" encoding (presumably ISO 8859-1). How
the Unicode sequence in its editor is converted to this encoding is up to
the application but a reasonable user expectation is that what looks like ?
should be transmitted as ?.
Instead OE tranmitted the sequence as two distinct characters e and ?. This
is *not* how it is supposed to work. What is supposed to happen is that an
application canonicalizes the string prior to doing encoding conversions.
Which it obviously didn't.

> For display,
> they combine to represent the single glyph that the editor shows. If
> you want that glyph to be represented by a single character you have
> to canonicalize the character sequence,

I in this context am the *user* of the application. I type characters in my
WYSIWYG editor and press "Send" button. I am not supposed to know what
canonicalization is, much less to do it manually.
It is the application which is supposed to do it transparently for me. If it
doesn't it is broken.

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?