From: PRMARJORAM on 9 Sep 2009 16:26 Joe in my journey to uncover the mystery of UNICODE I have come across quite a few of your examples and they have helped alot. But what im stating here was when I understood less what im trying to do than I do now. These are simple ASCII extended codes. I assume when i convert these to UNICODE using the code page parameter they will be the correct codes as you have suggested for them displaying in my CListCtrl. What i originally assumed about a webpage that was of this charset was that it was interpreted as 2:1 characters to give the UNICODE value, but its still 1:1 but with a code page parameter. "Joseph M. Newcomer" wrote: > CC B3 is not a recognizable encoding. The Russian symbol that displays as "M" is code > U041C, and it does not encode into CC B3. CCB3 does not decode into anything recognizably > Unicode, nor does B3CC. For more details and the ability to experiment, I suggest > downloading my Locale Explorer from my MVP Tips site. > > You need to know the encoding. (Note that I tried using Windows-1251 as well). > joe > > On Wed, 9 Sep 2009 07:42:01 -0700, PRMARJORAM <PRMARJORAM(a)discussions.microsoft.com> > wrote: > > >Giovanni, I must have explained the problem pretty well as you pretty much > >have understood it. Yes the webpage in this particular instance im > >downloading is as you specified. > > > ><meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> > > > >Ok using a Binary Viewer on the first cyrillic code in the <title> tag is > > > >CC B3 > > > >Which 'should' be a cyrillic capital M? > > > >I hope this helps. Thanks again. > > > > > > > > > > > > > >"Giovanni Dicanio" wrote: > > > >> PRMARJORAM ha scritto: > >> > My application is compiled in UNICODE. I am downloading webpages using > >> > cyrillic characters for their content. Although these files themselves are > >> > ASCII. > >> [...] > >> > My problem is my CString containing this content is WCHAR and so I need to > >> > convert 2 consecutive WCHAR to a single WCHAR to then get the correct > >> > cyrillic code to display. > >> > >> I think that what I previously wrote may not be the right answer to your > >> question. > >> > >> Could it be possible for you to clarify a little better the format of > >> the input string? > >> > >> For example, in the Cyrillic code page 1251 I read here: > >> > >> http://www.fingertipsoft.com/ref/cyrillic/cp1251.html > >> > >> there is a character like an upper-case "K" (code: 202 dec, 0xCA hex). > >> > >> How is this character stored in your input string? > >> What are the values of the two WCHAR's that you want to convert to one > >> single WCHAR, in this particular case? > >> > >> Thanks, > >> Giovanni > >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm >
From: Giovanni Dicanio on 9 Sep 2009 17:31 PRMARJORAM ha scritto: > Plus when you compile your app to UNICODE all your CStrings change to WCHAR > and you call Wide versions of everything. I don't know which version of VC++ you are using. If you are using a VC++ >= 7.1 (e.g. VC++7.1 in VS.NET 2003, VC8 in VS2005, VC9 in VS2008...), then you can have both CStringA (CHAR-based) and CStringW (WCHAR-based) in the same project. Moreover, if you need to store an ANSI/MBCS string using a robust C++ class and you use VC6 (so in Unicode app you only have CString based on WCHAR), you could use the STL class std::string. In fact, std::string stores char's in both ANSI/MBCS and Unicode builds. In particular, considering your problem, if the web pages that you get use an ANSI/MBCS encoding (not Unicode), then I would suggest you to use std::string or CStringA (instead of a WCHAR-based CString) to store them. And you can call MultiByteToWideChar (or use CA2WEX class) to convert from specific code page to Unicode, and then store the resulting Unicode string in a CString (or use explicit CStringW) in your Unicode app, and then show the Unicode strings in listviews or wherever you want. Giovanni
From: Giovanni Dicanio on 9 Sep 2009 17:43 PRMARJORAM ha scritto: > Giovanni, I must have explained the problem pretty well as you pretty much > have understood it. Yes the webpage in this particular instance im > downloading is as you specified. > > <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> This text is explicitly stating that the code page is a Windows-1251, so it is an ANSI/MBCS string. I think that you should store this string in a CStringA, or in a std::string (i.e. in a string class based on char's, not on WCHAR's). Then you can use MultiByteToWideChar or CA2WEX to convert from this ANSI/MBCS string to Unicode string, and store the resulting Unicode string in a CStringW or std::wstring class (or just in a CString class if you use Unicode build, where CString's are based on WCHAR's). i.e. the original memory layout of your string should be something like this (bytes expressed in hex): <meta ... 3C 6D 65 74 61 ... '<' 'm' 'e' 't' 'a' ... It makes sense to store this in a std::string or CStringA, but *not* in a CStringW. Instead, if the memory layout of your text is something like this: 3C 00 6D 00 65 00 74 00 61 00 ... L'<' L'm' L'e' L't' L'a' ... then it might make sense to store this in a CStringW. However, this is kind of a "lie", a false statement, because you are using a Unicode string, but the 'charset' attribute is set to 'windows-1251'. In this "strange" case, I would strip the 00 bytes from the input string, and convert it in the first form, i.e. 3C 6D 65 74 61 ... store it in a std::string or CStringA, and then call MultiByteToWideChar or CA2WEX using Windows-1251 code page identifier to get the proper Unicode UTF-16 string. HTH, Giovanni
From: Joseph M. Newcomer on 9 Sep 2009 21:05 I thought of that, but the problem is that thre are three ways to look at the sequence CCB3 (or B3CC) As two 8-bit characters: ̳ (that's capital I with grave accent followed by a superscript 3) As a UTF-8 encoding: It doesn't decode into anything sensible As a Unicode character: Neither UCCB3 nor UB3CC are valid characters. But I agree: it has to be stored as a CStringA or other 8-bit representation. So the question is, what could this encoding mean. I tried all kinds of encoding in the Locale Explorer, and nothing worked out. joe On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: > >PRMARJORAM ha scritto: > >> Giovanni, I must have explained the problem pretty well as you pretty much >> have understood it. Yes the webpage in this particular instance im >> downloading is as you specified. >> >> <meta http-equiv="Content-Type" content="text/html; charset=windows-1251"> > >This text is explicitly stating that the code page is a Windows-1251, so >it is an ANSI/MBCS string. I think that you should store this string in >a CStringA, or in a std::string (i.e. in a string class based on char's, >not on WCHAR's). > >Then you can use MultiByteToWideChar or CA2WEX to convert from this >ANSI/MBCS string to Unicode string, and store the resulting Unicode >string in a CStringW or std::wstring class (or just in a CString class >if you use Unicode build, where CString's are based on WCHAR's). > >i.e. the original memory layout of your string should be something like >this (bytes expressed in hex): > > <meta ... > > 3C 6D 65 74 61 ... > '<' 'm' 'e' 't' 'a' ... > >It makes sense to store this in a std::string or CStringA, but *not* in >a CStringW. > >Instead, if the memory layout of your text is something like this: > > 3C 00 6D 00 65 00 74 00 61 00 ... > L'<' L'm' L'e' L't' L'a' ... > >then it might make sense to store this in a CStringW. >However, this is kind of a "lie", a false statement, because you are >using a Unicode string, but the 'charset' attribute is set to >'windows-1251'. >In this "strange" case, I would strip the 00 bytes from the input >string, and convert it in the first form, i.e. > > 3C 6D 65 74 61 ... > >store it in a std::string or CStringA, and then call MultiByteToWideChar >or CA2WEX using Windows-1251 code page identifier to get the proper >Unicode UTF-16 string. > >HTH, >Giovanni > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Alexander Grigoriev on 9 Sep 2009 22:20
Well, CC is indeed cyrillic M in CP1251, though B3 maps to ukrainian 'i' "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message news:uvjga5p7jm31771h0o7n4v7rvbomrh27mr(a)4ax.com... >I thought of that, but the problem is that thre are three ways to look at >the sequence > CCB3 (or B3CC) > > As two 8-bit characters: ̳ (that's capital I with grave accent followed > by a superscript > 3) > As a UTF-8 encoding: It doesn't decode into anything sensible > As a Unicode character: Neither UCCB3 nor UB3CC are valid characters. > > But I agree: it has to be stored as a CStringA or other 8-bit > representation. > > So the question is, what could this encoding mean. I tried all kinds of > encoding in the > Locale Explorer, and nothing worked out. > joe > > On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio > <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: > >> >>PRMARJORAM ha scritto: >> >>> Giovanni, I must have explained the problem pretty well as you pretty >>> much >>> have understood it. Yes the webpage in this particular instance im >>> downloading is as you specified. >>> >>> <meta http-equiv="Content-Type" content="text/html; >>> charset=windows-1251"> >> >>This text is explicitly stating that the code page is a Windows-1251, so >>it is an ANSI/MBCS string. I think that you should store this string in >>a CStringA, or in a std::string (i.e. in a string class based on char's, >>not on WCHAR's). >> >>Then you can use MultiByteToWideChar or CA2WEX to convert from this >>ANSI/MBCS string to Unicode string, and store the resulting Unicode >>string in a CStringW or std::wstring class (or just in a CString class >>if you use Unicode build, where CString's are based on WCHAR's). >> >>i.e. the original memory layout of your string should be something like >>this (bytes expressed in hex): >> >> <meta ... >> >> 3C 6D 65 74 61 ... >> '<' 'm' 'e' 't' 'a' ... >> >>It makes sense to store this in a std::string or CStringA, but *not* in >>a CStringW. >> >>Instead, if the memory layout of your text is something like this: >> >> 3C 00 6D 00 65 00 74 00 61 00 ... >> L'<' L'm' L'e' L't' L'a' ... >> >>then it might make sense to store this in a CStringW. >>However, this is kind of a "lie", a false statement, because you are >>using a Unicode string, but the 'charset' attribute is set to >>'windows-1251'. >>In this "strange" case, I would strip the 00 bytes from the input >>string, and convert it in the first form, i.e. >> >> 3C 6D 65 74 61 ... >> >>store it in a std::string or CStringA, and then call MultiByteToWideChar >>or CA2WEX using Windows-1251 code page identifier to get the proper >>Unicode UTF-16 string. >> >>HTH, >>Giovanni >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm |