Non English string ? [Visual Basic]

Prev: Send an email
Next: Weird mouse behavior

From: Rick Rothstein on 13 Feb 2010 00:11

>> Assuming any characters above ASCII 255 in a text string makes the
>> text non-English, then does something like this work (note that is
>> a space character after the exclamation point)?
>
> You have to distinguish AscW() results from Asc() results. If you use
> AscW(), you will see 'English' characters with codes > 255. Using
> Asc() you won't, but you may then qualify some non-English characters
> as English (which may be OK depending on the circumstance).
>
> I don't know if Like examines the Unicode characters... if so, then it
> will act the way AscW() does and fail some valid characters.

I have no idea about this stuff at all. Having only ever worked with US
regional settings, I know next to nothing about the international world of
VB... I only threw the idea out there in case it might work. I was hoping
someone knowledgeable about such things would test it out and see if it
could be used or not.

--
Rick (MVP - Excel)

From: CY on 13 Feb 2010 14:01

Good ideas, but I got an concern, if a file using ascii 32-128 is
English bur in some country remapped for example [ ] and | as we do/
did... then it gets a bit cumbersome again ;)

That is if the file is interpreted by the PC:s Codepage.. 850 or 437
(Or was this just in the good old days?)

//CY

From: Nobody on 13 Feb 2010 18:29

"Phil Hunt" <aaa(a)aaa.com> wrote in message
news:unthJABrKHA.1796(a)TK2MSFTNGP02.phx.gbl...
> What is the best way to determine if a string contains "non Eglish"
> character ?

I have not developed international applications, but I know more than those
who use one language only.

First, you need to treat a sequence of bytes as encoded stream of characters
that must be decoded first. You can't assume that every byte is a character
or every two bytes are one character because of various encoding schemes,
such as Multi Byte Character Set(MBCS), and surrogates in Unicode(In which
case 4 bytes represent one character). You can't also assume that byte
values in the range 0 to 127 are English only, although in most cases they
are. You have to know how the characters were encoded. For example, in some
MBCS they used the range 33 to 126 to encode some characters.

In Unicode-32 however, each character is 4 bytes always and with fixed
meaning.

In ANSI and Unicode: 0-127 have fixed meaning and they are one and the same.
In Unicode: 128-255 have fixed meaning, they follow ISO/IEC 8859-1.
In ANSI: 128-255 have meaning based on what Code Page(CP) in use. In the
US/Western Europe, Windows uses "Windows-1252" code page (CP1252).
Characters in the range 160 to 255 are identical to Unicode, but most of the
range between 128 to 159 are not. So it's not safe to assume that in English
that characters in the range 128 to 255 are identical to Unicode.

In VB, strings are stored internally as Unicode-16. However, the controls
are ANSI and when you call API functions an ANSI version of the string is
created(when using ByVal/ByRef As String) and copied back if you used ByRef
As String. To pass Unicode strings to API functions, you must use "ByVal
StrPtr(s)" and in most cases you have to use the W version of the function.

The main API functions used for converting between Unicode and non-Unicode
are WideCharToMultiByte/MultiByteToWideChar, typically with CP_ACP flag,
which means use the current code page.

Also, Chr() function in VB treats the number you provide as character code
based on the current system code page, and returns a Unicode character.
While ChrW() doesn't do any transformation and therefore faster. The same
applies to Asc/AscW. Asc() uses the current system code page, and returns 63
"?" if the character cannot be represented.

Some links:

http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
http://en.wikipedia.org/wiki/Windows-1252
http://en.wikipedia.org/wiki/Multi-byte_character_set

The links above are basically derived from the first link. To answer your
question, visit "Latin characters in Unicode" link above, and check the
ranges that start with Latin and compare it with AscW() value.

Sample code to show how VB6+SP5 deals with characters in the range 128 to
159 in an English-US based OS(XP+SP2):

Option Explicit

Private Sub Form_Load()
Dim i As Long
Dim s As String

s = ChrW(&H8765&)
Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))
s = Chr(&H80)
Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))

For i = 0 To 255
s = Chr(i)
' Compare Chr() with ChrW(), and print where they differ
If s <> ChrW(i) Then
Debug.Print i, Hex(i), Asc(s), AscB(s), Hex(AscB(s)),
Hex(AscW(s))
End If
Next

End Sub

Output:

63 101 -30875 8765
128 172 8364 20AC
128 80 128 172 AC 20AC
130 82 130 26 1A 201A
131 83 131 146 92 192
132 84 132 30 1E 201E
133 85 133 38 26 2026
134 86 134 32 20 2020
135 87 135 33 21 2021
136 88 136 198 C6 2C6
137 89 137 48 30 2030
138 8A 138 96 60 160
139 8B 139 57 39 2039
140 8C 140 82 52 152
142 8E 142 125 7D 17D
145 91 145 24 18 2018
146 92 146 25 19 2019
147 93 147 28 1C 201C
148 94 148 29 1D 201D
149 95 149 34 22 2022
150 96 150 19 13 2013
151 97 151 20 14 2014
152 98 152 220 DC 2DC
153 99 153 34 22 2122
154 9A 154 97 61 161
155 9B 155 58 3A 203A
156 9C 156 83 53 153
158 9E 158 126 7E 17E
159 9F 159 120 78 178

As you notice, when you provide Chr() with characters in the range 128-159
in an English based system, the Unicode characters as shown by AscW do not
necessarily have the same value.

From: Helmut Meukel on 13 Feb 2010 18:45

"CY" <christery(a)gmail.com> schrieb im Newsbeitrag
news:8769612d-7558-468a-9336-f6de33d3efa3(a)o3g2000yqb.googlegroups.com...
> Good ideas, but I got an concern, if a file using ascii 32-128 is
> English bur in some country remapped for example [ ] and | as we do/
> did... then it gets a bit cumbersome again ;)
>
> That is if the file is interpreted by the PC:s Codepage.. 850 or 437
> (Or was this just in the good old days?)
>
> //CY

That was before extended ASCII in the days of 7-bit ASCII or
when communicating with other computers and you needed the
parity bit to check for transmission errors.
In 7 bit ASCII some codes were used for national characters:
the US-Characters | [ { ] } and some others cold be replaced
with specific characters of the national language.
Mind, the same code values were used for german Umlauts
(�, �, �,...), scandinavian character (�, �, �, ...), french
accents and so on. So you had to know the language
of the text to get it right.

IBM's extended ASCII contained some of those national
characters above 127 but not enough, they used most code
values for graphical characters. This was the character set
later known as Codepage 437. Codepage 850 contains less
graphical characters and more national characters.

Startup CharMap.exe, and you can see the differences.
DOS: USA is Codepage 437 and DOS: Western Europe
is Codepage 850.

Helmut.

From: Jeff Johnson on 15 Feb 2010 09:16

"Jim Mack" <jmack(a)mdxi.nospam.com> wrote in message
news:uYM5FLDrKHA.728(a)TK2MSFTNGP04.phx.gbl...

>>> Thanks. I basically have to examine the bit patterns to determine.
>>> I understand the ASCII, it is the Unicode I have some trouble
>>> with. I know it is 16 bits insteads of 8. But in VB/debug window,
>>> I have never been able to see a 16 bits character, maybe it does
>>> not display on the screen. Do you know what i am talking ?
>>> For the character 'A', how can I see the full 16 bits pattern in
>>> VB ?
>>
>> I believe you can use the AscW() function to find this. If you get
>> a value back > 255, I'd say you can safely assume it's a
>> non-English character.
>
> Not even close.

I never said if it's 255 or less it's guaranteed to be an English character,
I said if it's ABOVE 255 it's pretty much guaranteed to NOT be an English
character. There is a difference.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Send an email
Next: Weird mouse behavior