Prev: Send an email
Next: Weird mouse behavior
From: Rick Rothstein on 13 Feb 2010 00:11 >> Assuming any characters above ASCII 255 in a text string makes the >> text non-English, then does something like this work (note that is >> a space character after the exclamation point)? > > You have to distinguish AscW() results from Asc() results. If you use > AscW(), you will see 'English' characters with codes > 255. Using > Asc() you won't, but you may then qualify some non-English characters > as English (which may be OK depending on the circumstance). > > I don't know if Like examines the Unicode characters... if so, then it > will act the way AscW() does and fail some valid characters. I have no idea about this stuff at all. Having only ever worked with US regional settings, I know next to nothing about the international world of VB... I only threw the idea out there in case it might work. I was hoping someone knowledgeable about such things would test it out and see if it could be used or not. -- Rick (MVP - Excel)
From: CY on 13 Feb 2010 14:01 Good ideas, but I got an concern, if a file using ascii 32-128 is English bur in some country remapped for example [ ] and | as we do/ did... then it gets a bit cumbersome again ;) That is if the file is interpreted by the PC:s Codepage.. 850 or 437 (Or was this just in the good old days?) //CY
From: Nobody on 13 Feb 2010 18:29 "Phil Hunt" <aaa(a)aaa.com> wrote in message news:unthJABrKHA.1796(a)TK2MSFTNGP02.phx.gbl... > What is the best way to determine if a string contains "non Eglish" > character ? I have not developed international applications, but I know more than those who use one language only. First, you need to treat a sequence of bytes as encoded stream of characters that must be decoded first. You can't assume that every byte is a character or every two bytes are one character because of various encoding schemes, such as Multi Byte Character Set(MBCS), and surrogates in Unicode(In which case 4 bytes represent one character). You can't also assume that byte values in the range 0 to 127 are English only, although in most cases they are. You have to know how the characters were encoded. For example, in some MBCS they used the range 33 to 126 to encode some characters. In Unicode-32 however, each character is 4 bytes always and with fixed meaning. In ANSI and Unicode: 0-127 have fixed meaning and they are one and the same. In Unicode: 128-255 have fixed meaning, they follow ISO/IEC 8859-1. In ANSI: 128-255 have meaning based on what Code Page(CP) in use. In the US/Western Europe, Windows uses "Windows-1252" code page (CP1252). Characters in the range 160 to 255 are identical to Unicode, but most of the range between 128 to 159 are not. So it's not safe to assume that in English that characters in the range 128 to 255 are identical to Unicode. In VB, strings are stored internally as Unicode-16. However, the controls are ANSI and when you call API functions an ANSI version of the string is created(when using ByVal/ByRef As String) and copied back if you used ByRef As String. To pass Unicode strings to API functions, you must use "ByVal StrPtr(s)" and in most cases you have to use the W version of the function. The main API functions used for converting between Unicode and non-Unicode are WideCharToMultiByte/MultiByteToWideChar, typically with CP_ACP flag, which means use the current code page. Also, Chr() function in VB treats the number you provide as character code based on the current system code page, and returns a Unicode character. While ChrW() doesn't do any transformation and therefore faster. The same applies to Asc/AscW. Asc() uses the current system code page, and returns 63 "?" if the character cannot be represented. Some links: http://en.wikipedia.org/wiki/Unicode http://en.wikipedia.org/wiki/Latin_characters_in_Unicode http://en.wikipedia.org/wiki/ISO/IEC_8859-1 http://en.wikipedia.org/wiki/Windows-1252 http://en.wikipedia.org/wiki/Multi-byte_character_set The links above are basically derived from the first link. To answer your question, visit "Latin characters in Unicode" link above, and check the ranges that start with Latin and compare it with AscW() value. Sample code to show how VB6+SP5 deals with characters in the range 128 to 159 in an English-US based OS(XP+SP2): Option Explicit Private Sub Form_Load() Dim i As Long Dim s As String s = ChrW(&H8765&) Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s)) s = Chr(&H80) Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s)) For i = 0 To 255 s = Chr(i) ' Compare Chr() with ChrW(), and print where they differ If s <> ChrW(i) Then Debug.Print i, Hex(i), Asc(s), AscB(s), Hex(AscB(s)), Hex(AscW(s)) End If Next End Sub Output: 63 101 -30875 8765 128 172 8364 20AC 128 80 128 172 AC 20AC 130 82 130 26 1A 201A 131 83 131 146 92 192 132 84 132 30 1E 201E 133 85 133 38 26 2026 134 86 134 32 20 2020 135 87 135 33 21 2021 136 88 136 198 C6 2C6 137 89 137 48 30 2030 138 8A 138 96 60 160 139 8B 139 57 39 2039 140 8C 140 82 52 152 142 8E 142 125 7D 17D 145 91 145 24 18 2018 146 92 146 25 19 2019 147 93 147 28 1C 201C 148 94 148 29 1D 201D 149 95 149 34 22 2022 150 96 150 19 13 2013 151 97 151 20 14 2014 152 98 152 220 DC 2DC 153 99 153 34 22 2122 154 9A 154 97 61 161 155 9B 155 58 3A 203A 156 9C 156 83 53 153 158 9E 158 126 7E 17E 159 9F 159 120 78 178 As you notice, when you provide Chr() with characters in the range 128-159 in an English based system, the Unicode characters as shown by AscW do not necessarily have the same value.
From: Helmut Meukel on 13 Feb 2010 18:45 "CY" <christery(a)gmail.com> schrieb im Newsbeitrag news:8769612d-7558-468a-9336-f6de33d3efa3(a)o3g2000yqb.googlegroups.com... > Good ideas, but I got an concern, if a file using ascii 32-128 is > English bur in some country remapped for example [ ] and | as we do/ > did... then it gets a bit cumbersome again ;) > > That is if the file is interpreted by the PC:s Codepage.. 850 or 437 > (Or was this just in the good old days?) > > //CY That was before extended ASCII in the days of 7-bit ASCII or when communicating with other computers and you needed the parity bit to check for transmission errors. In 7 bit ASCII some codes were used for national characters: the US-Characters | [ { ] } and some others cold be replaced with specific characters of the national language. Mind, the same code values were used for german Umlauts (�, �, �,...), scandinavian character (�, �, �, ...), french accents and so on. So you had to know the language of the text to get it right. IBM's extended ASCII contained some of those national characters above 127 but not enough, they used most code values for graphical characters. This was the character set later known as Codepage 437. Codepage 850 contains less graphical characters and more national characters. Startup CharMap.exe, and you can see the differences. DOS: USA is Codepage 437 and DOS: Western Europe is Codepage 850. Helmut.
From: Jeff Johnson on 15 Feb 2010 09:16
"Jim Mack" <jmack(a)mdxi.nospam.com> wrote in message news:uYM5FLDrKHA.728(a)TK2MSFTNGP04.phx.gbl... >>> Thanks. I basically have to examine the bit patterns to determine. >>> I understand the ASCII, it is the Unicode I have some trouble >>> with. I know it is 16 bits insteads of 8. But in VB/debug window, >>> I have never been able to see a 16 bits character, maybe it does >>> not display on the screen. Do you know what i am talking ? >>> For the character 'A', how can I see the full 16 bits pattern in >>> VB ? >> >> I believe you can use the AscW() function to find this. If you get >> a value back > 255, I'd say you can safely assume it's a >> non-English character. > > Not even close. I never said if it's 255 or less it's guaranteed to be an English character, I said if it's ABOVE 255 it's pretty much guaranteed to NOT be an English character. There is a difference. |