From: Mihai N. on 4 Aug 2006 03:46 > So what? Where we intuitively think that the stated limit of MAX_PATH > characters means MAX_PATH chars in ANSI, Microsoft informed me that the > limit really is MAX_PATH characters even if it takes twice that many bytes. This means our intuition is wrong :-) It is an internal limitation, so we should think how is Windows working internaly. And that is Unicode. I bet in Windows 9x the limit is MAX_PATH char (the 1 byte programming char, not the user "character") > You asked for examples of cases where we had been wrong in nearly always > assuming that MSDN's statements about characters meant TCHARs, and this is > a big example. True, the example is good, the the doc is not clear. > You suspect that Microsoft's e-mail to me was accurate, and as mentioned, I > have the same impression. Though they send a lot of unbelievable e-mails, > they send some believable e-mails too and this was one. Yes, I think the email is accurate, and you are right, the doc is not clear. Just noting that here the limit is "in the belly", so it might be a bit different than the something you pass as a parameter. For instance the internal implementation of some ANSI API might be: int BlaBlaA( char * wideBuff, int nBufLen ) { // here nBufLen is char count WCHAR myWideBuffer = new WCHAR [nBufLen]; MultiByteToWideChar( GetACP(), flags,buffer, nBufLen, wideBuff, BufLen ); int nRez = BlaBlaW( wideBuff, nBufLen ); // here nBufLen is WCHAR count delete [] wideBuff; return nRez; } Ok, I guess the whole thing has some error checking and does some king of memory reuse, not new/delete for each API :-) but this is the idea. So for APIs that take the length as param the limit tends to really be in char in the ANSI API. > Yup. By the way, considering that VFAT can store a filename consisting of > around 250 Kanji, one weekend experiment would be to try opening the file > under Windows 98 (Japanese version of course). I am quite sure the limit is in chars there. > But really I'll consider it > close enough if it works under Windows 2000, XP, 2003, and Vista beta. I > haven't had time to test it and I do believe that mail. I also believe the email :-) Ok, this is getting fuzzy. So, in the end, I am not arguing with you. My initial affirmation ("fact very few APIs") means I know there are some APIs, just that I could not think of one on the top of my head. And I have asked you for examples to learn something. And yes, you are also right that for the example the doc is unclear. -- Mihai Nita [Microsoft MVP, Windows - SDK] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Joseph M. Newcomer on 7 Aug 2006 00:59 I think the confusion here is that you are interpreting "character" in one context as "a sequence of bytes representing a glyph", and StringCchPrintf, as I said, when %c is used, does NOT interpret the word 'character' this way. So you can interpret it any way you want, but the only interpretation that matters is the interpretation given by StringCchPrintf, and you can see that easily, as I said, by READING THE CODE and PERFORMING THE EXPERIMENT. Now, if you have a working system with code page 932 in place, try the experiments I did, and tell us what you get. Try %c, in an ANSI code page, using any bit value of your choice for the character value, and tell us what StringCchPrintf does with respect to %c. I was not discussing %s, but %c, which you insist won't work. So if you're convinced it produces more than one 8-bit character or 16-bit character of output, please demonstrate this. Note that %lc and %C *do* expand wide character codes to multibyte representations, but that was not what we were discussing. joe On Thu, 3 Aug 2006 10:34:15 +0900, "Norman Diamond" <ndiamond(a)community.nospam> wrote: >> Multibyte Character Set is an *encoding* of a character set. > >Yes, ANSI code page 932 is an encoding just like other ANSI code pages such >as (I might not be remembering these numbers correctly) 1252 and 850. > >> however, StringCchPrintf, sprintf, etc. do only convert characters using >> code pages in special cases, e.g., %lc or %C format. > >And %s and stuff like that. (If you're compiling in an ANSI environment >then simply use %s, but if you're compiling in a Unicode environment and >want to produce an ANSI encoded string then use %S.) > >> For ANSI mode, this means that 'character' is 'byte'. In ANSI mode, one >> character is one byte. > >For some reason I thought that you had sometimes written code targetting >ANSI code pages in which you knew that these statements are not true. It >looks like I misremembered. OK, then it seems that this is your >introduction to such code pages. In ANSI mode, one character is one or more >bytes. In the ANSI code pages that Microsoft implemented, one character is >one or two bytes, no more than two. > >I haven't been using Japanese Microsoft systems for nearly 20 years, I've >only been using them for half that length of time and occasionally seen them >in use the other half of that time while I was using Japanese Unix and >Japanese VMS systems. I've used %s format in printf in Japanese Unix and >VMS and Windows systems. This is one kind of experiment that you don't need >to tell me to do. > >I will continue to respect your expertise on matters other than character >encodings. > > >"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message >news:b9i1d2p7ca3n59258h63bc1mavfgjngicd(a)4ax.com... >> Multibyte Character Set is an *encoding* of a character set. In ANSI >> mode, MBCS can be >> used to encode 'characters' in an extended set; however, StringCchPrintf, >> sprintf, etc. do >> only convert characters using code pages in special cases, e.g., %lc or %C >> format. The >> formal definition for %c, the formatting code being discussed in this >> example, is that >> the int argument is converted to 'unsigned char' and formatted as a >> character. For ANSI >> mode, this means that 'character' is 'byte'. In ANSI mode, one character >> is one byte. >> >> In a multibyte character set, a glyph might be represented by one to four >> successive 8-bit >> bytes. Note that using %c would be erroneous for formatting an integer >> value, if the >> intent was to produce a multibyte sequence representing a single logical >> character. >> >> This can easily be seen by looking at the %c formatting code in output.c >> in the CRT >> source. %c formats exactly one byte in ANSI mode. So arguing that %c >> requires two bytes >> for a character is not correct. >> >> The exact code executed for %c formatting is >> unsigned short temp; >> temp = (unsigned short) get_int_arg(&argptr); >> { >> buffer.sz[0] = (char) temp; >> textlen = 1; >> } >> >> I see nothing here that can generate more than one byte of output. Note >> that the %C and >> %lc formats, which take wide character values and format them in >> accordance with the code >> page, *can* generate more than one byte of character, which does satisfy >> the objection >> raised. But the format here is clearly %c, and %c is clearly defined, and >> the >> implementation reflects that definition. So I'm not sure what the issue >> is here. >> >> StringCchPrintf is defined in terms of 8-bit characters and 16-bit >> characters, not in >> terms of logical characters encoded in an MBCS. MBCS does not enter the >> discussion; if >> you format using %lc or %C it will actually truncate the multibyte string >> to fit in the >> buffer. Thus, it obeys its requirement of not allowing a buffer overrun. >> >> This can be seen trivially simply by--get this--DOING THE EXPERIMENT!!!!! >> So while you >> can contend until the cows come home that you think that you know how to >> read the >> documentation, it is a matter of a couple minutes to actually do the >> experiment. I found >> that even when the wctomb function produces a sequence of multiple bytes >> to represent the >> wide character as a multibyte character, when formatting with %lc, the >> ANSI definition of >> StringCchPrintf is in terms of ANSI characters, 8-bit bytes, and it writes >> exactly one of >> the three bytes of the multibyte sequence, the first byte. So the >> sequence >> >> StringCchPrintf(buffer, '%lc', 0xF95C); >> >> will simply transfer to the target buffer the first 8-bit byte of what >> turned out to be a >> 3-byte multibyte sequence. >> >> Note that since I don't have appropriate multinational support, I had to >> actually set a >> breakpoint and "fake" the results of wctomb, because what it does on my >> machine is fail >> the conversion and return -1. So I simply placed two bytes and a NUL into >> the buffer as >> if wctomb had worked correctly, changed the length to 2, and proceeded >> with the execution. >> Otherwise, I just get an empty string. >> >> UTF-8 is one of the many multibyte character encodings that exist.
From: Joseph M. Newcomer on 7 Aug 2006 01:02 I know that. But it *is* a valid encoding, and it *is* supported in Windows. You just can't pass a UTF-8 string to nearly any API and have anything reasonable happen. joe On Wed, 02 Aug 2006 23:11:23 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote: >> In a multibyte character set, a glyph might be represented by one to four >> successive 8-bit bytes. >... >> UTF-8 is one of the many multibyte character encodings that exist. >> I chose it as an example because it is specified in the Unicode standard. > >You should never use UTF-8 as an example in the Windows world. It is >guaranteed to give weird results, since it is not supported. >Windows only knows about ANSI code pages (and UTF-8 cannot be that) or >UTF-16. >The only place where utf-8 is ok in Windows is in API doing conversion >to/from utf-16 Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 7 Aug 2006 01:08 All A-suffix APIs use Unicode internally. The entire kernel is written in terms of Unicode, so all A-suffix APIs first convert the ANSI text to Unicode and then call the actual internal implementation of the API. This means that if you pass in a UTF8 string, it isn't seen as UTF8, it's seen as 8-bit ANSI bytes, and will be converted to 16-bit bytes as if it were a sequence of 8-bit characters, which leads to the comment that "UTF-8 is not supported". It *is* supported, but not at the kernel API interface level. I use it to send and receive Web page information and similar tasks. CP_UTF8 is one of the supported types in MultiByteToWideChar and WideCharToMultiByte. joe On Thu, 3 Aug 2006 10:23:20 +0900, "Norman Diamond" <ndiamond(a)community.nospam> wrote: >"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message >news:Xns9813D785509FMihaiN(a)207.46.248.16... >[Norman Diamond:] >>> Except that all the Win32 APIs don't actually work that way. SOME Win32 >>> APIs count TCHARs, i.e. counting chars in ANSI and counting wchar_ts in >>> Unicode. But SOME Win32 APIs really count characters. Microsoft has >>> responded to a few cases, including one personally, to say that for some >>> Win32 APIs, even in the ANSI versions, internal processing is performed >>> in Unicode and the limits are counted in actual characters rather than in >>> the number of bytes required for the ANSI representations. >> >> Can you give some examples? > >The one for which Microsoft sent personal e-mail was CreateFile. Microsoft >assured me that even the ANSI version (CreateFileA) uses Unicode internally >and MAX_PATH is the limit on the number of characters internally, so if an >ANSI application needs more than MAX_PATH bytes to specify a usable filename >then it can indeed do so. I've been a bit negligent in not writing a test >program to test this answer yet. > >The other cases that I recall were discussed in newsgroups, most likely >microsoft.public.win32.programmer.ui. It's been a while now. In general I >learned from it that even in cases where we think MSDN pretty obviously >doesn't mean what it says, sometimes it really does mean what it says. > >> In my experience <<There are in fact very few APIs that deal with the >> "user character">> and those are decently documented. > >Either that or there are very few that are decently documented ^_^ Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 7 Aug 2006 01:13
But the limit is MAX_PATH characters. THat's what we've been discussing. In Unicode mode, the limit is MAX_PATH characters, which would occupy 2*MAX_PATH bytes. That is, MAX_PATH TCHARs, and therefore their comment is completely CONSISTENT with the fact that a 'character' is a 'TCHAR'. Since you can't use any multibyte encoding in CreateFile, I don't see where there is any problem here. 'character', in nearly every context we've discussed, means 'TCHAR'. This also means that if you are using Unicode to represent Kanji, then you should be able to use MAX_PATH Kanji characters to name a file. joe On Thu, 3 Aug 2006 17:13:52 +0900, "Norman Diamond" <ndiamond(a)community.nospam> wrote: >"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message >news:Xns9813EB11E9700MihaiN(a)207.46.248.16... >>> The one for which Microsoft sent personal e-mail was CreateFile. >>> Microsoft assured me that even the ANSI version (CreateFileA) uses >>> Unicode internally and MAX_PATH is the limit on the number of characters >>> internally, so if an ANSI application needs more than MAX_PATH bytes to >>> specify a usable filename then it can indeed do so. I've been a bit >>> negligent in not writing a test program to test this answer yet. >> >> But CreateFile does not take a number of chars as parameter. > >So what? Where we intuitively think that the stated limit of MAX_PATH >characters means MAX_PATH chars in ANSI, Microsoft informed me that the >limit really is MAX_PATH characters even if it takes twice that many bytes. >You asked for examples of cases where we had been wrong in nearly always >assuming that MSDN's statements about characters meant TCHARs, and this is a >big example. > >> What I suspect is happening is that the MAX_PATH is the limit if you don't >> use "\\?\" and is there both in the W and A versions. >> And since the A version does a conversion to Unicode and calls the W one, >> the limit is probably there and expressed in utf16 code units, indeed. > >You suspect that Microsoft's e-mail to me was accurate, and as mentioned, I >have the same impression. Though they send a lot of unbelievable e-mails, >they send some believable e-mails too and this was one. > >> Interesting for some week-end experiments :-) > >Yup. By the way, considering that VFAT can store a filename consisting of >around 250 Kanji, one weekend experiment would be to try opening the file >under Windows 98 (Japanese version of course). But really I'll consider it >close enough if it works under Windows 2000, XP, 2003, and Vista beta. I >haven't had time to test it and I do believe that mail. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |