From: RB on 23 Feb 2010 11:09 Hello, Joe was nice enough to educate me of a void in my awareness of unicode ramifications. I am in the process of trying to install the safer strsafe libs and includes. But more pertinent to me is the problem is I am still struggling to get underneath all aspects of this. I remember reading years ago (when computers usually had less than 20mb of ram) that a machine word was the width of the computer registers, usually matching the width of memory at a single address. I.e. address whatever would be say 2 bytes long in real mode or 4 bytes long in protected mode. And that ints could be defined different on different machine widths depending on how the compiler translated the int declaration down into the machine language of the word. And compilers had to be aware of which platform they were compiling for. As for old byte sized chars and newer 2 and 4 byte unicode chars the scenario deepens for me. It would appear that the char/unicode thing is not machine specific but rather OS and/or Language dependent. But I mainly am concerned at this point as to when and how this would affect my code running on a windows OS. If I do not code for Unicode and I copy a string into a font structure on windows I have been made aware that this is dangerous. But I still have trouble understanding exactly what is going on. Is it that in newer windows OS structures (like fonts) that windows has coded them in wide format size so they can accept nonUnicode and Unicode as needed ? And this affects my string copy to said struct.... but I still cannot see exactly what is going on (or why). In other words if I am not coding for Unicode character language why still must I be concerned about unicode. so I have the following questions: 1. For the following code (said to be unsafe) strcpy(NewFontLogStruct.lfFaceName, "Courier New"); Is the following (groping self created hack) any safer ? char holder [ (sizeof(NewFontLogStruct.lfFaceName)) ] = "Courier New"; for(int i = 0; i < (sizeof(NewFontLogStruct.lfFaceName)); i++) NewFontLogStruct.lfFaceName[i] = holder[i]; 2. Could someone direct me to a bibliography that would illuminate a dummy like me on the ramifications more clearly so I can fully understand the ins and outs of this ? (or feel free to try and explain it in brief if possible) I.e. Joe has told me to >Never use 8-bit characters or assume they exist, except in >exceedingly rare and exotic circumstances, of which this is most >definitely not an example. But if I look at a character in a hex editor on my machine (from a text file) it is only one byte in size, so obviously Joe (fantastic helpful guy that he is) is talking over my grasp of the situation. Hopefully I can learn this eventually to bring myself out of unicode darkness.
From: Giovanni Dicanio on 23 Feb 2010 11:27 "RB" <NoMail(a)NoSpam> ha scritto nel messaggio news:OWzRtKKtKHA.4636(a)TK2MSFTNGP06.phx.gbl... > Is it that in newer windows OS structures (like fonts) that > windows has coded them in wide format size so they can accept nonUnicode > and > Unicode as needed ? And this affects my string copy to said struct.... > but I still > cannot see exactly what is going on (or why). In other words if I am not > coding > for Unicode character language why still must I be concerned about > unicode. There are two "LOGFONT" definitions: LOGFONTA (using CHAR lfFaceName[...]) and LOGFONTW (using WCHAR lfFaceName[...]). LOGFONTA uses the old style ANSI/MBCS char's, LOGFONTW uses Unicode (UTF-16) wchar_t's. If you are building in Unicode mode and UNICODE preprocessor macro is defined, then LOGFONT is typedef'ed as LOGFONTW. Instead, if you are building in ANSI/MBCS (UNICODE preprocessor macro is not defined), then LOGFONT is typedef'ed as LOGFONTA. You can read all of that in <wingdi.h> Win32 header file. > so I have the following questions: > 1. For the following code (said to be unsafe) > strcpy(NewFontLogStruct.lfFaceName, "Courier New"); If you use VS2005 and above, the above line is secure in C++ source code, because strcpy is just expanded to a proper form of strcpy_s thanks to C++ template magic. > 2. Could someone direct me to a bibliography that would illuminate a dummy > like > me on the ramifications more clearly so I can fully understand the ins and > outs of > this ? (or feel free to try and explain it in brief if possible) About Unicode, there is an interesting article here: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html Mihai Nita's blog is a must read as well: http://www.mihai-nita.net/ Couple of posts on "The Old New Thing" blog by Raymond Chen: http://blogs.msdn.com/oldnewthing/archive/2004/02/12/71851.aspx http://blogs.msdn.com/oldnewthing/archive/2004/07/15/184076.aspx And you can't miss these articles on CodeProject written by Mike Dunn: The Complete Guide to C++ Strings, Part I - Win32 Character Encodings http://www.codeproject.com/KB/string/cppstringguide1.aspx The Complete Guide to C++ Strings, Part II - String Wrapper Classes http://www.codeproject.com/KB/string/cppstringguide2.aspx HTH, Giovanni
From: Joseph M. Newcomer on 23 Feb 2010 13:31 On Tue, 23 Feb 2010 11:09:32 -0500, "RB" <NoMail(a)NoSpam> wrote: > Hello, Joe was nice enough to educate me of a void in my awareness of >unicode ramifications. I am in the process of trying to install the safer >strsafe libs and includes. But more pertinent to me is the problem is I am still >struggling to get underneath all aspects of this. > I remember reading years ago (when computers usually had less than >20mb of ram) that a machine word was the width of the computer registers, **** Generally, this has been accepted. Linguistically, it was always a colossal failure in C that it tied its concept of "int" to the machine registers, and resulted in several disasters when we moved from 16-bit to 32-bit; when Microsoft moved to 64-bit, they retained int as 32-bit instead of making it 64-bit, which was a great idea. **** >usually matching the width of memory at a single address. I.e. address whatever >would be say 2 bytes long in real mode or 4 bytes long in protected mode. **** Well, it wasn't that simple, but it will do for now. **** >And >that ints could be defined different on different machine widths depending on how >the compiler translated the int declaration down into the machine language of the >word. And compilers had to be aware of which platform they were compiling for. > As for old byte sized chars and newer 2 and 4 byte unicode chars the scenario >deepens for me. It would appear that the char/unicode thing is not machine specific >but rather OS and/or Language dependent. **** No. The definition of Unicode is independent of all platforms, languages, and operating systems. What matters is the "encoding". For example, Windows, as an operating system, only accepts the UTF-16LE encoding of Unicode, which means that Unicode characters that require more that 16 bits to identify them require two Unicode UTF-16LE characters (this is the "surrogate" encoding). And the Microsoft C compiler defines the ANSI/ISO "wchar_t" type as a 16-bit value. So you are constrained, in using these environments, to using a specific encoding of Unicode. ***** >But I mainly am concerned at this point >as to when and how this would affect my code running on a windows OS. > If I do not code for Unicode and I copy a string into a font structure on windows >I have been made aware that this is dangerous. But I still have trouble understanding >exactly what is going on. **** There are two symbols which are either both defined or both undefined. One of them is _UNICODE, and the other is UNICODE. One controls the C runtime, one controls the Windows runtime. It doesn't matter which is which, because if you define only one of them and not the other, your program probably won't compile. Any API that involves a string does not exist. For example there is no CreateFile API. Instead, there are *two* unique API entry points to the kernel: CreateFileA: takes an 8-bit character string name CreateFileW: takes a 16-bit UTF-16LE-encoded character string name In the case of CreateFont, there are not only two entry points, CreateFontA and CreateFontW, but two data structures: LONGFONTA and LOGFONTW. When you compile with Unicode disabled (neither UNICODE nor _UNICODE defined), then your sequence LOGFONT lf; ... font.CreateFontIndirect(&lf); where the method is implemented as BOOL CFont::CreateFontIndirect(LOGFONT * lf) {return ::CreateFontIndirect(&lf); } then your code looks like LOGFONTA lf; font.CreateFontIndirect(&lf); with the method actually defined (as far as the compiler sees) as: CFont::CreateFontIndirect(LOGFONTA * lf) {return ::CreateFontIndirectA(&lf); } but if you compile with UNICODE/_UNICODE defined, you get LOGFONTW lf; ... font.CreateFontIndirect(&lf); where it is defined as CFont::CreateFontIndirect(LOGFONTW * lf) { return ::CreateFontIndirectW(&lf);} Note that all that Windows does is you call a -A entry point is to convert the strings to Unicode and effectively call the -W entry point. ***** >Is it that in newer windows OS structures (like fonts) that >windows has coded them in wide format size so they can accept nonUnicode and >Unicode as needed ? **** THe fonts may or may not actually have Unicode characters in them. Many do, but it is up to the font designer to have decided which characters to include. The "Arial Unicode MS" font really does have most of the Unicode characters in it, but you can't get at them unless your app is Unicode, or you explicitly call the -W functions and pass in a wide character string. **** >And this affects my string copy to said struct.... but I still >cannot see exactly what is going on (or why). In other words if I am not coding >for Unicode character language why still must I be concerned about unicode. **** Go back to my original question: when your manager walks in and says "We need the app in Unicode" what are you going to answer? I've been coding Unicode-aware since about 1996, and several of my apps were converted to Unicode by simple recompilation with UNICODE/_UNICODE defined, and worked perfectly the first time. In a few cases, I had not bothered with making everything Unicode-compliant (hard to do in VS6) but the three or six lines that required work failed to compile, and the fixes where essentially trivial. In VS > 6 they truly are trivial, because VS > 6 MFC supports two string types, CStringA (which is always a CString of 8-bit characters) and CStringW (which is always a CString of Unicode UTF-16LE characters), making it truly trivial to support mixed modes (necessary when dealing with embedded systems, some network protocols, etc.). If you always code Unicode-aware, then when you have to create a Unicode app, you already have all the good programming habits, styles, etc. to make it work. And since all of your coding at that point is Unicode-aware, you can convert it INSTANTLY and have a high confidence that it will work perfectly correctly! The "T"-types (no, I have NO IDEA why "T" figures so prominently) have definitions based on the settings of these symbols, we find that the representations are Declaration no [_]UNICODE [_]UNICODE TCHAR char wchar_t WCHAR wchar_t wchar_t CHAR char char LPTSTR char * wchar_t * LPWSTR whcar_t * wchar_t * LPSTR char * char * CString CStringA CStringW [VS > 6 only] CStringA CStringA CStringA CStringW CStringW CStringW _ftprintf fprintf wprintf _tcscmp strcmp wcscmp The use of _T creates literals of the right type. If you don't use it, then the literal is as you declare it, and is 8-bit or Unicode no matter what your compilation mode _T("x") "x" L"x" _T('x') 'x' L'x' "x" "x" "x" 'x' 'x' 'x' L"x" L"x" L"x" L'x' L'x' L'x' given TCHAR buffer[2]; _countof(buffer) 2 2 sizeof(buffer) 2 4 sizeof(TCHAR) 1 2 LPSTR s = "x" works works LPWSTR s = L"x" works works LPTSTR s = _T("x") works works LPTSTR s = "x" works compilation error LPTSTR s = L"x" compilation error works LPSTR s = _T("x") works compilation error LPWSTR s = "x" compilation error compilation error LPSTR s = L"x" compilation error compilation error For any API: AnyApi(LPTSTR) AnyApiA(LPSTR) AnyApiW(LPWSTR) if the API takes a pointer to a struct that has a string, we have in the header file (look at winuser.h, winbase.h, wingdi.h, or pretty much any Windows header file) typedef struct { LPSTR p; int x; } SomeStructA; typedef struct { LPWSTR p; int x; } SomeStructW; void SomeAPIW(SomeStructW * p); void SomeAPIA(SomeStructA * p); #ifdef UNICODE #define SomeStruct SomeStructW #define SomeAPI SomeAPIW #else #define SomeStruct SomeStructA #define SomeAPI SomeAPIA #endif Learning the correct progrmaming style is *not* done when your manager asks you to convert 200K lines of source to Unicode. It is done when you first start programming. ***** > so I have the following questions: >1. For the following code (said to be unsafe) >strcpy(NewFontLogStruct.lfFaceName, "Courier New"); **** Assume strcpy is ALWAYS unsafe. ALWAYS. NEVER use it ANYWHERE, for ANY REASON WHATSOEVER. This has nothing to do with Unicode, and everything to do with safe programming methodologies. strcpy, strcat and sprintf are always-and-forever deadly, and should never, ever be used in modern programming. They are archaic leftovers from an era when software safety was considered fairly unimportant. Years of virus infestation have made us conscious of the fact that these are no longer acceptable. So while there are Unicode versions wcscpy, wcscat, and swprintf, and Unicode-aware versions (look in tchar.h) _tcscpy, _tcscat, and _stprintf, these are equally unsafe and must never be used. Do you remember "Code Red"? It got in by a buffer overrun caused by a strcpy that didn't check bounds. Hundreds of thousands of machines were infested in a small number of hours. [For the person who always says, "Joe, you keep saying things have been broken. How did we survive all those years if everything was as broken as you claim?" the answer was that we didn't, and the number of virus infestations that occur because of failure to check buffer bounds is testimony to the fact that things really WERE broken. Mostly, we had apps that crashed. That's no longer the case. We now have mission-critical servers and critical corporate data placed at-risk due to these bad practices. Denial-of-service, data corruption, and industrial espionage are among the risks.] **** > > Is the following (groping self created hack) any safer ? >char holder [ (sizeof(NewFontLogStruct.lfFaceName)) ] = "Courier New"; > for(int i = 0; i < (sizeof(NewFontLogStruct.lfFaceName)); i++) > NewFontLogStruct.lfFaceName[i] = holder[i]; **** No. These are unnecessary, and the code is in fact incorrect. You don't need a holder at all, for example, and it would be inappropriate to introduce a gratuitous variable for this purpose. Your copy is overkill, because you only need to copy up to the NUL. It is also erroneous, in that it fails to NUL-terminate a string that is exactly as long as _countof(lfFaceName), or longer, resulting in incorrect behavior when the data is used in the future. Furthermore, you have still assumed that you are using 8-bit characters and that sizeof() is the correct approach. This code will NOT work with Unicode. I showed you the correct code. Use either _tcscpy_s or StringCchCopy. If you want to write the above code correctly (although it is all completely unnecessary) it would be TCHAR holder[sizeof(NewFontLogStruct.lfFaceName)/sizeof(TCHAR)] = _T("Courier New"); for(int i = 0; i < sizeof(NewLogFontStruct.lfFaceName)/sizeof(TCHAR); i++) { NewFontLogStruct.lfFaceName[i] == holder[i]; if(holder[i] == _T('\0')) break; } NewLogFontStruct.lfFaceName[ (sizeof(NewLogFontStruct.lfFaceName)/sizeof(TCHAR)) -1] = _T('\0'); Notice how much easier it is to use _tcscpy_s or StringCchCopy! It is useful to do the following #ifndef _countof #define _countof(x) ( (sizeof(x) / sizeof((x)[0])) #endif This works for all versions < VS2008 and doesn't do anything in VS2008 where _countof is readly defined. Then you could write TCHAR holder[ _countof(NewFontLogStruct.lfFaceName) ] = _T("Courier New"); for(int i = 0; i < _countof(NewFontLogStruct.lfFaceName); i++) { NewFontLogStruct.lfFaceName[i] == holder[i]; if(holder[i] == _T('\0')) break; } NewLogFontStruct.lfFaceName[_countof(NewLogFont) - 1] = _T('\0'); Now why the "break" statement? Because consider the case where you have two pages: | Courier New\0|###################| where ##### is a page that actually does not exist. If you try to copy more characters than the string "Courier New" (including the terminal NUL) then you will take an access fault. So you MUST terminate the copy on a NUL character. Maybe you can get away with it in the case of a local variable, but it is not at all good policy, and because of the potential error, should not be written that way. In particular, you don't need the local variable, you could have written LPTSTR holder = _T("Courier New"); and then the above error would be potentially fatal. **** > >2. Could someone direct me to a bibliography that would illuminate a dummy like >me on the ramifications more clearly so I can fully understand the ins and outs of >this ? (or feel free to try and explain it in brief if possible) >I.e. Joe has told me to >>Never use 8-bit characters or assume they exist, except in >>exceedingly rare and exotic circumstances, of which this is most >>definitely not an example. ***** This deals with the notion of always creating programs that represent absolutely best practice. Sure, it's "safe" to use 8-bit characters as a way of life, until the day you land the Chinese, Korean, or Japanese software contract. Then you find your future as an employee of that company severely at risk. And it also means that if you have carefully written code assuming that sizeof(buffer) == number-of-characters-in-buffer, your code is riddled with fatal errors. Consider the following TCHAR buffer[SOMESIZE]; SomeAPI(buffer, sizeof(buffer)); this works ONLY in 8-bit characters. Suppose SOMESIZE is 20. You get SomeAPI(buffer, 20); meaning there is space in the buffer for 20 characters. But when you convert to Unicode, the call becomes TCHAR buffer[20]; SomeAPI(buffer, 40); so you tell the OS it has 40 character positions, when in fact you only have 20. The correct code is SomeAPI(buffer, sizeof(buffer)/sizeof(TCHAR)); or SomeAPI(buffer, _countof(buffer)); which, independent of compilation mode, ALWAYS compiles correctly, and will compile as either SomeAPIA(buffer, 20); or SomeAPIW(buffer, 20); because all APIs that take strings are fictional; only the -A and -W forms actually exist. Similarly, if you do WriteFile, which writes BYTES, then you have to write, for example LPTSTR data; .... ::WriteFile(data, _tcslen(data) * sizeof(TCHAR), &bytesWritten, NULL); because you have to convert character counts (_tcslen) to byte counts (required by WriteFile). If you always code this way, the actual conversion to Unicode is often a recompilation. If you just use _tcslen, then it works correctly for 8-bit apps but only writes HALF the text for Unicode. The issues are not "immediate" safety of an 8-bit app, but the ultimate safety if it is converted to Unicode. Doing a Unicode conversion of 200K lines which were never written Unicode-aware is a tedious, perilous operation which may result in unexpected fatal errors including application crashes, security problems which may arise from buffer overruns, subtle data corruption failures (e.g., if WriteFile only wrote half the text and nobody noticed for a year...) VS2005 and later by default generate Unicode apps; if you have the correct programming habits, your code will naturally flow even when you upgrade from 8-bit to Unicode. Pieces of code you write can be used in more modern environments. It's just Good Programming Style. ***** >But if I look at a character in a hex editor on my machine (from a text file) >it is only one byte in size, so obviously Joe (fantastic helpful guy that he is) >is talking over my grasp of the situation. Hopefully I can learn this eventually >to bring myself out of unicode darkness. **** If you compiled with UNICODE/_UNICODE undefined, then by default you get 8-bit characters. If you compile with UNICODE/_UNICODE defined, you will see that your characters are 16-bit. But you will have a ton of errors until you recode as Unicode-aware. Then you can compile in either mode. But you are developing the right programming habits for the "real world" of commercial application design, and modern MFC programming. joe **** > > > > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: RB on 23 Feb 2010 20:16 Thanks Giovanni for the information and I will go over these links. It is going to take me awhile to decipher through Joe's reply, but I will reply to him later. I feel like I am starting to get a bit more comprehension on how this affects any scenario but still assembling all the pieces. > If you use VS2005 and above, the strcpy is just expanded to a proper form of strcpy_s thanks to C++ template magic. > About Unicode, there is an interesting article here: > > "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" > http://www.joelonsoftware.com/articles/Unicode.html > > Mihai Nita's blog is a must read as well: > http://www.mihai-nita.net/ > > Couple of posts on "The Old New Thing" blog by Raymond Chen: > > http://blogs.msdn.com/oldnewthing/archive/2004/02/12/71851.aspx > http://blogs.msdn.com/oldnewthing/archive/2004/07/15/184076.aspx > > And you can't miss these articles on CodeProject written by Mike Dunn: > > The Complete Guide to C++ Strings, Part I - Win32 Character Encodings > http://www.codeproject.com/KB/string/cppstringguide1.aspx > > The Complete Guide to C++ Strings, Part II - String Wrapper Classes > http://www.codeproject.com/KB/string/cppstringguide2.aspx > > HTH, > Giovanni >
From: RB on 24 Feb 2010 18:00
> tchar.h has these automatic, that is, if you want to check a character for alphabetic, > you would call _istalpha(...) > which will work if the build is either Unicode or 8-bit, whereas > isalpha(...) > works correctly only if the character is 8-bit, and > iswalpha(...) > works correctly only if the character is Unicode (but if you call setlocale correctly, > will handle alphabetic characters in other languages. Ok this sounds good, some of the work can be done for me if I learn enough. Before I was individually writing separate code sections called from alternating areas in my code like: #ifdef UNICODE iswalpha(...) #ifndef UNICODE isalpha(...) and the all the code I wrote in each depending section upon returns was getting to be too much for me. But from what you are saying it would appear if I educate myself some more on Text Routine Mappings of the TCHAR type, I could just call the _istalpha(...) [ which on my system maps to _ismbcalpha(...) ] and then only write the "one and only" return code routine. This sounds very good. Heck yes it does. So if I code all my character Variables in TCHAR a lot of my mapping will be done for me depending on whether the UNICODE is defined or not. And for string literals does it matter if I use TEXT, _TEXT, or _T ? Some bibliographies say for C++ I should be using TEXT while others say I can use either _TEXT or _T either one seems to expand to the same result. And does it matter where I define (or not define) UNICODE in my source files ? > both 8-bit and Unicode as determined on-the-fly during runtime. It is trivial > in VS > 6 because you can read the data in as 8-bit, immediately convert it to > Unicode, and continue on, not having to do anything special except use CStringA > for the 8-bit input) Yea I am going to have to talk my wife into the cost of a new VS it would appear. I don't think I qualify for the upgrade pricing since I bought my first one under Academic discount pricing as a student in college (taking courses at night). > If you expect to get through your entire career never writing code for anyone other > than yourself, it won't matter. But if you write anything that goes out the door, > you should probably expect that Unicode support will be required. Even simple > things like people's surnames in another language can be an issue, For example, > suppose you want to get the correct spelling of the composer Antonin Dvorak. > The "r" has a little accent mark over it, and you can only represent that in Unicode. Yea that is a good premise example. Sounds like he might be Swedish > When VS2005 came along and by default created Unicode apps, I never noticed. > I kept programming in exactly the same way I had been programming for a decade. I have an option to buy a VS pro 2005 at a good price but I heard that 2005 did not have a class wizard etc, what is your input on that? > It's 2010. As far as writing code, 8-bit characters are almost completely dead. > Note that many *files* are still kept in Unicode, but that's not the same as > programming, because you can always use an 8-bit encoding like UTF-8 to keep your > text in 8-bit files. Yes I am aware of the different prefixed codes for files > But you should always "think Unicode" inside a program. It's worth the time. Ok I will start trying immediately. Thanks again. (for everything) Later.........RB |