CString::Replace corrupts unicode strings [Visual C]

Prev: Performance of CString::Replace
Next: automatic refreshing of data in vb6.0

From: dududuil on 28 Apr 2010 08:56

My application reads from a file, and put the text in a CString.
On JP (Japanese ) machine, the file might contain unicode characters.

Although my application isn't compile with _UNICODE, CString still supports
unicode characters, and I can do the simple task of Replace a certin string.

After this Replace, I print the string back to the file and notice that some
characters are replaced with other characters.

This is my problem.

From: Jochen Kalmbach [MVP] on 28 Apr 2010 09:04

Hi dududuil!

> Although my application isn't compile with _UNICODE, CString still supports
> unicode characters,

No. It does not! It only support it, you you compile with _UNICODE!
In the current setting it is compiled with ANSI or MBCS... so it does
not know anything about unicode.

> and I can do the simple task of Replace a certin string.

No, you can't. Because it will replace the "character" on per "byte"
basis, which has the effect you see (currupts the string).
This is because CString does not know about the eoncoding of your string
and it assumes ASCII, if you have not set the thread locale!
If you set the thread locale, it will correctly replace the characters!

--
Greetings
Jochen

My blog about Win32 and .NET
http://blog.kalmbachnet.de/

From: Tom Serface on 28 Apr 2010 09:43

Ah... then perhaps the problem is that the string you are modifying really
is a Unicode string (which CString can hold even in a non-Unicode build).
Are you reading the string from a file that might actually be Unicode or
UTF-8 or some other encoding? If so you may still have to use CStringW and
still use the L'\r' method for the character.

Tom

"dududuil" <dududuil(a)discussions.microsoft.com> wrote in message
news:46371EC6-C173-4870-B38D-FDBFFD270CE8(a)microsoft.com...
> My Application is compiled without _UNICODE - so _T() is ignored, and
> L'\r'
> will shrink back to '\r' when calling the Remove

From: Ulrich Eckhardt on 28 Apr 2010 10:03

Jochen Kalmbach [MVP] wrote:
> Maybe you are using UTF8 in the string but the CRT/MFC locale is "C". So
> the UTF8-Multibyte characters will also treated as "normal" chars and
> therefor it will remove any '\r' (0x0d) in the multibyte character.

The nice thing about UTF-8 is that no ASCII byte will ever have a different
meaning than the one it has for ASCII. All bytes of a multibyte character
have their bit 7 set, so they are outside the ASCII range.

For that reason I think we can rule out UTF-8, otherwise it should work. ;)

Uli

--
C++ FAQ: http://parashift.com/c++-faq-lite

Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

From: Tom Serface on 28 Apr 2010 13:45

I wish that CString had better built in handling for in memory UTF-8. As it
is I typically only use UTF-8 for files and convert to Unicode for memory
use. It takes more memory, but makes it much easier to interact with other
SDKs.

Tom

"Ulrich Eckhardt" <eckhardt(a)satorlaser.com> wrote in message
news:arrla7-5mf.ln1(a)satorlaser.homedns.org...
> Jochen Kalmbach [MVP] wrote:
>> Maybe you are using UTF8 in the string but the CRT/MFC locale is "C". So
>> the UTF8-Multibyte characters will also treated as "normal" chars and
>> therefor it will remove any '\r' (0x0d) in the multibyte character.
>
> The nice thing about UTF-8 is that no ASCII byte will ever have a
> different
> meaning than the one it has for ASCII. All bytes of a multibyte character
> have their bit 7 set, so they are outside the ASCII range.
>
> For that reason I think we can rule out UTF-8, otherwise it should work.
> ;)
>
> Uli
>
> --
> C++ FAQ: http://parashift.com/c++-faq-lite
>
> Sator Laser GmbH
> Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

First | Prev | Next | Last
Pages: 1 2 3
Prev: Performance of CString::Replace
Next: automatic refreshing of data in vb6.0