From: Mihai N. on 10 Sep 2009 00:32 > You would like to have a CString with Unicode UTF-16 representation of > your Cyrillic characters. No. Most likely he has some junk, because the characters are some Cyrillic code page (cp1251, or KOI8-R) and were converted to UTF-16 as if they were 1252. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Mihai N. on 10 Sep 2009 00:34 > My application is compiled in UNICODE. I am downloading webpages using > cyrillic characters for their content. Although these files themselves are > ASCII. Then the content does not belong in a CString. - download the stuff in a char buffer - detect the encoding (from the http header or the meta tag in the buffer) - convert to Unicode using MultiByteToWideChar (and store in CString) -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Mihai N. on 10 Sep 2009 00:37 > CC B3 > > Which 'should' be a cyrillic capital M? CC is Cyrillic capital M in cp1251 B3 is Cyrillic lowercase i in cp1251 You have junk in your CString. See my previous post. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: PRMARJORAM on 10 Sep 2009 04:01 It is a Ukrainian webpage. Thanks everyone for your input, got alot to work on now. Will try all this out. Im hoping sometime today to have it working. Again in a nutshell, im downloading webpages from foreign websites not necessarily using our charset and needing to display a subset of the textual content within a CListCtrl. I understand I also need to use specific fonts to acheive this once I have the correct string representation. After the cyrillic it will also need to work for other charsets such as Arabic etc. Thanks again. I shall post my results. "Alexander Grigoriev" wrote: > Well, CC is indeed cyrillic M in CP1251, though B3 maps to ukrainian 'i' > > "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message > news:uvjga5p7jm31771h0o7n4v7rvbomrh27mr(a)4ax.com... > >I thought of that, but the problem is that thre are three ways to look at > >the sequence > > CCB3 (or B3CC) > > > > As two 8-bit characters: ̳ (that's capital I with grave accent followed > > by a superscript > > 3) > > As a UTF-8 encoding: It doesn't decode into anything sensible > > As a Unicode character: Neither UCCB3 nor UB3CC are valid characters. > > > > But I agree: it has to be stored as a CStringA or other 8-bit > > representation. > > > > So the question is, what could this encoding mean. I tried all kinds of > > encoding in the > > Locale Explorer, and nothing worked out. > > joe > > > > On Wed, 09 Sep 2009 23:43:54 +0200, Giovanni Dicanio > > <giovanniDOTdicanio(a)REMOVEMEgmail.com> wrote: > > > >> > >>PRMARJORAM ha scritto: > >> > >>> Giovanni, I must have explained the problem pretty well as you pretty > >>> much > >>> have understood it. Yes the webpage in this particular instance im > >>> downloading is as you specified. > >>> > >>> <meta http-equiv="Content-Type" content="text/html; > >>> charset=windows-1251"> > >> > >>This text is explicitly stating that the code page is a Windows-1251, so > >>it is an ANSI/MBCS string. I think that you should store this string in > >>a CStringA, or in a std::string (i.e. in a string class based on char's, > >>not on WCHAR's). > >> > >>Then you can use MultiByteToWideChar or CA2WEX to convert from this > >>ANSI/MBCS string to Unicode string, and store the resulting Unicode > >>string in a CStringW or std::wstring class (or just in a CString class > >>if you use Unicode build, where CString's are based on WCHAR's). > >> > >>i.e. the original memory layout of your string should be something like > >>this (bytes expressed in hex): > >> > >> <meta ... > >> > >> 3C 6D 65 74 61 ... > >> '<' 'm' 'e' 't' 'a' ... > >> > >>It makes sense to store this in a std::string or CStringA, but *not* in > >>a CStringW. > >> > >>Instead, if the memory layout of your text is something like this: > >> > >> 3C 00 6D 00 65 00 74 00 61 00 ... > >> L'<' L'm' L'e' L't' L'a' ... > >> > >>then it might make sense to store this in a CStringW. > >>However, this is kind of a "lie", a false statement, because you are > >>using a Unicode string, but the 'charset' attribute is set to > >>'windows-1251'. > >>In this "strange" case, I would strip the 00 bytes from the input > >>string, and convert it in the first form, i.e. > >> > >> 3C 6D 65 74 61 ... > >> > >>store it in a std::string or CStringA, and then call MultiByteToWideChar > >>or CA2WEX using Windows-1251 code page identifier to get the proper > >>Unicode UTF-16 string. > >> > >>HTH, > >>Giovanni > >> > > Joseph M. Newcomer [MVP] > > email: newcomer(a)flounder.com > > Web: http://www.flounder.com > > MVP Tips: http://www.flounder.com/mvp_tips.htm > > >
From: PRMARJORAM on 10 Sep 2009 04:03
Its not junk. Its exactly as you say. "Mihai N." wrote: > > CC B3 > > > > Which 'should' be a cyrillic capital M? > > CC is Cyrillic capital M in cp1251 > B3 is Cyrillic lowercase i in cp1251 > > You have junk in your CString. > See my previous post. > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email > > |