Prev: localtime deprecated?
Next: bind guard ?
From: Pete Becker on 18 Jun 2006 01:31 Alf P. Steinbach wrote: > > As a practical matter consider wide character literals. > > FooW( L"Hello, hello?" ); > > where FooW is some API function. > > If wchar_t isn't what the API level expects, this won't compile. > Why should it compile? The idiomatic way to write this is Foo(_T("Hello, hello?")); where _T is the MS macro that turns the quoted string into the appropriate type for the API. Once you abandon that indirection layer you're locked into a specific choice of type. As a result, every Win32 application that does wide character handling has to use 16-bit wide characters, or use a non-standard library to take advantage of the simpler code for wider characters. There are many places other than OS calls where wide characters can be used, and the OS should be only one factor in choosing an appropriate type. -- Pete Becker Roundhouse Consulting, Ltd. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Pete Becker on 18 Jun 2006 09:34 Eugene Gershnik wrote: > Eugene Gershnik wrote: > >>Well here is AFAIK correctly formed Unicode sequence that means e? >> >>U+0065 U+0301 >> >>All my editors seem to agree with that. > > > And unintentionally this was also a good demonstration of how broken modern > software is with regards to Unicode. My NNTP client (Microsoft Outlook > Express) had correctly shown the character as ? while editing but > transmitted it as e? as you can see above. This is despite being Windows > only application that presumably uses UTF-16 wchar_t internally. That seems like how it ought to work. U0065 is LATIN SMALL LETTER E, and U0301 is COMBINING ACUTE ACCENT. They're two distinct characters, which is why they're written that way and transmitted that way. For display, they combine to represent the single glyph that the editor shows. If you want that glyph to be represented by a single character you have to canonicalize the character sequence, and replace U0065 U0301 with U00E9, LATIN SMALL LETTER E WITH ACUTE. That is, there are two different representations for that glyph; one that consists of two code points, and one that consists of one. -- Pete Becker Roundhouse Consulting, Ltd. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Kirit Sælensminde on 18 Jun 2006 09:41 Pete Becker wrote: > Why should it compile? The idiomatic way to write this is > > Foo(_T("Hello, hello?")); > > where _T is the MS macro that turns the quoted string into the > appropriate type for the API. Once you abandon that indirection layer > you're locked into a specific choice of type. As a result, every Win32 > application that does wide character handling has to use 16-bit wide > characters, or use a non-standard library to take advantage of the > simpler code for wider characters. There are many places other than OS > calls where wide characters can be used, and the OS should be only one > factor in choosing an appropriate type. Microsoft has long been pushing this as a way of being able to compile something with or without Unicode, but the problem is that it doesn't quite work. It's fine so long as all of your strings are US ASCII, but what if you wanted to put an AE ligature in there? It may work or it may not depending on a whole raft of other things. To my mind there is absolutely no reason why anybody should be writing new code that takes narrow strings other than where protocol issues force them to (at least on Windows). In those few circumstance it'll foce you to confront the encoding issues and you're much more likely to get them right. If you're using narrow character sequences in your code then I as the person who _runs_ the software gets to decide how your string was encoded. Not you as the person who _wrote_ it. Using wide character strings is the only choice you have if you actually want to control what is in the string. It's a wonder to me that any of it works at all. K [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Alf P. Steinbach on 19 Jun 2006 07:05 {This thread is drifting too far off topic, follow-ups are likely to be rejected unless they include Standard C++ relevant content. -mod/fwg} * Pete Becker: > Alf P. Steinbach wrote: >> As a practical matter consider wide character literals. >> >> FooW( L"Hello, hello?" ); >> >> where FooW is some API function. >> >> If wchar_t isn't what the API level expects, this won't compile. >> > > Why should it compile? The idiomatic way to write this is > > Foo(_T("Hello, hello?")); > I'm sorry, no. I'm not sure how topical this is, but we're talking about a library (the Windows API) where most functions come in two version: a char-based one, and a wchar_t based on. The _T macro appends an L prefix to the string literal, or not, which is a way of supporting use of only char-based or only wchar_t based functions. When the function in question, such as FooW, is only available in wchar_t based variant, using _T will yield a compilation error when _T is defined as not prepending L to the literal. [snip] > Once you abandon that indirection layer > you're locked into a specific choice of type. I'm sorry, no. There is no indirection layer (an indirection layer would be what many higher level libraries provide through smart string classes). There is however a C/C++ "choice" layer, if a set of macros might be called a layer, choosing between only wchar_t or only char, which does not work for functions that come in only one variant, and which in modern versions of the OS has no benefits; in particular, the choice layer does not remove the coupling between wchar_t and a particular OS-defined size. [snip] > simpler code for wider characters. There are many places other than OS > calls where wide characters can be used, and the OS should be only one > factor in choosing an appropriate type. Yes. -- A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Eugene Gershnik on 19 Jun 2006 07:12
Pete Becker wrote: > Eugene Gershnik wrote: >> Eugene Gershnik wrote: >> >>> Well here is AFAIK correctly formed Unicode sequence that means e? >>> >>> U+0065 U+0301 >>> >>> All my editors seem to agree with that. >> >> >> And unintentionally this was also a good demonstration of how broken >> modern software is with regards to Unicode. My NNTP client >> (Microsoft Outlook Express) had correctly shown the character as ? >> while editing but transmitted it as e? as you can see above. This is >> despite being Windows only application that presumably uses UTF-16 >> wchar_t internally. > > That seems like how it ought to work. U0065 is LATIN SMALL LETTER E, > and U0301 is COMBINING ACUTE ACCENT. They're two distinct characters, > which > is why they're written that way and transmitted that way. Not at all. Outlook Express informs me that my message was transmitted in what it calls "Western European (ISO)" encoding (presumably ISO 8859-1). How the Unicode sequence in its editor is converted to this encoding is up to the application but a reasonable user expectation is that what looks like ? should be transmitted as ?. Instead OE tranmitted the sequence as two distinct characters e and ?. This is *not* how it is supposed to work. What is supposed to happen is that an application canonicalizes the string prior to doing encoding conversions. Which it obviously didn't. > For display, > they combine to represent the single glyph that the editor shows. If > you want that glyph to be represented by a single character you have > to canonicalize the character sequence, I in this context am the *user* of the application. I type characters in my WYSIWYG editor and press "Send" button. I am not supposed to know what canonicalization is, much less to do it manually. It is the application which is supposed to do it transparently for me. If it doesn't it is broken. -- Eugene [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |