Non English string ? [Visual Basic]

Prev: Send an email
Next: Weird mouse behavior

From: Jim Mack on 15 Feb 2010 11:09

Jeff Johnson wrote:
> "Jim Mack" <jmack(a)mdxi.nospam.com> wrote in message
> news:uYM5FLDrKHA.728(a)TK2MSFTNGP04.phx.gbl...
>
>>>> Thanks. I basically have to examine the bit patterns to
>>>> determine. I understand the ASCII, it is the Unicode I have some
>>>> trouble with. I know it is 16 bits insteads of 8. But in
>>>> VB/debug window, I have never been able to see a 16 bits
>>>> character, maybe it does not display on the screen. Do you know
>>>> what i am talking ? For the character 'A', how can I see the
>>>> full 16 bits pattern in VB ?
>>>
>>> I believe you can use the AscW() function to find this. If you get
>>> a value back > 255, I'd say you can safely assume it's a
>>> non-English character.
>>
>> Not even close.
>
> I never said if it's 255 or less it's guaranteed to be an English
> character, I said if it's ABOVE 255 it's pretty much guaranteed to
> NOT be an English character. There is a difference.

And yet the first assertion is mostly correct and the second isn't.

Which the code snippet clearly shows: AscW() produces more than a
dozen results >255 for characters that would be considered English
since they're in the '1033' character set.

--
Jim Mack
Twisted tees at http://www.cafepress.com/2050inc
"We sew confusion"

From: Jeff Johnson on 15 Feb 2010 15:21

"Jim Mack" <jmack(a)mdxi.nospam.com> wrote in message
news:%23fj4TllrKHA.3944(a)TK2MSFTNGP06.phx.gbl...

>>>> I believe you can use the AscW() function to find this. If you get
>>>> a value back > 255, I'd say you can safely assume it's a
>>>> non-English character.
>>>
>>> Not even close.
>>
>> I never said if it's 255 or less it's guaranteed to be an English
>> character, I said if it's ABOVE 255 it's pretty much guaranteed to
>> NOT be an English character. There is a difference.
>
> And yet the first assertion is mostly correct and the second isn't.
>
> Which the code snippet clearly shows: AscW() produces more than a
> dozen results >255 for characters that would be considered English
> since they're in the '1033' character set.

Ahhhh, I see where you're going with this. And it makes me realize I was
unclear (or rather, I was making an assumption based on what I thought the
poster wanted). I interpreted "non-English CHARACTER" to mean "non-English
LETTER." Almost everything in the range of your example code was not a
letter but rather some form of punctuation, and I don't consider punctuation
to be language-specific.

From: Jeff Johnson on 15 Feb 2010 15:23

"Jim Mack" <jmack(a)mdxi.nospam.com> wrote in message
news:%23fj4TllrKHA.3944(a)TK2MSFTNGP06.phx.gbl...

> Which the code snippet clearly shows: AscW() produces more than a
> dozen results >255 for characters that would be considered English
> since they're in the '1033' character set.

[Sent the previous reply too soon.]

Your comment about how they would be considered English since they're in the
1033 code page (or whatever that is) is actually the crux of my first reply.
DOES the poster actually consider all of those to be English?

From: Phil Hunt on 15 Feb 2010 16:52

Thanks to all who replied. After reading all the posts, I think I should
stick with
"0 - 127 is english assumption". It is safer in the context of this issue I
have.

"Nobody" <nobody(a)nobody.com> wrote in message
news:%2360qfRQrKHA.4492(a)TK2MSFTNGP05.phx.gbl...
> "Phil Hunt" <aaa(a)aaa.com> wrote in message
> news:unthJABrKHA.1796(a)TK2MSFTNGP02.phx.gbl...
>> What is the best way to determine if a string contains "non Eglish"
>> character ?
>
> I have not developed international applications, but I know more than
> those who use one language only.
>
> First, you need to treat a sequence of bytes as encoded stream of
> characters that must be decoded first. You can't assume that every byte is
> a character or every two bytes are one character because of various
> encoding schemes, such as Multi Byte Character Set(MBCS), and surrogates
> in Unicode(In which case 4 bytes represent one character). You can't also
> assume that byte values in the range 0 to 127 are English only, although
> in most cases they are. You have to know how the characters were encoded.
> For example, in some MBCS they used the range 33 to 126 to encode some
> characters.
>
> In Unicode-32 however, each character is 4 bytes always and with fixed
> meaning.
>
> In ANSI and Unicode: 0-127 have fixed meaning and they are one and the
> same.
> In Unicode: 128-255 have fixed meaning, they follow ISO/IEC 8859-1.
> In ANSI: 128-255 have meaning based on what Code Page(CP) in use. In the
> US/Western Europe, Windows uses "Windows-1252" code page (CP1252).
> Characters in the range 160 to 255 are identical to Unicode, but most of
> the range between 128 to 159 are not. So it's not safe to assume that in
> English that characters in the range 128 to 255 are identical to Unicode.
>
> In VB, strings are stored internally as Unicode-16. However, the controls
> are ANSI and when you call API functions an ANSI version of the string is
> created(when using ByVal/ByRef As String) and copied back if you used
> ByRef As String. To pass Unicode strings to API functions, you must use
> "ByVal StrPtr(s)" and in most cases you have to use the W version of the
> function.
>
> The main API functions used for converting between Unicode and non-Unicode
> are WideCharToMultiByte/MultiByteToWideChar, typically with CP_ACP flag,
> which means use the current code page.
>
> Also, Chr() function in VB treats the number you provide as character code
> based on the current system code page, and returns a Unicode character.
> While ChrW() doesn't do any transformation and therefore faster. The same
> applies to Asc/AscW. Asc() uses the current system code page, and returns
> 63 "?" if the character cannot be represented.
>
> Some links:
>
> http://en.wikipedia.org/wiki/Unicode
> http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
> http://en.wikipedia.org/wiki/ISO/IEC_8859-1
> http://en.wikipedia.org/wiki/Windows-1252
> http://en.wikipedia.org/wiki/Multi-byte_character_set
>
> The links above are basically derived from the first link. To answer your
> question, visit "Latin characters in Unicode" link above, and check the
> ranges that start with Latin and compare it with AscW() value.
>
> Sample code to show how VB6+SP5 deals with characters in the range 128 to
> 159 in an English-US based OS(XP+SP2):
>
> Option Explicit
>
> Private Sub Form_Load()
> Dim i As Long
> Dim s As String
>
> s = ChrW(&H8765&)
> Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))
> s = Chr(&H80)
> Debug.Print Asc(s), AscB(s), AscW(s), Hex(AscW(s))
>
> For i = 0 To 255
> s = Chr(i)
> ' Compare Chr() with ChrW(), and print where they differ
> If s <> ChrW(i) Then
> Debug.Print i, Hex(i), Asc(s), AscB(s), Hex(AscB(s)),
> Hex(AscW(s))
> End If
> Next
>
> End Sub
>
>
> Output:
>
> 63 101 -30875 8765
> 128 172 8364 20AC
> 128 80 128 172 AC 20AC
> 130 82 130 26 1A 201A
> 131 83 131 146 92 192
> 132 84 132 30 1E 201E
> 133 85 133 38 26 2026
> 134 86 134 32 20 2020
> 135 87 135 33 21 2021
> 136 88 136 198 C6 2C6
> 137 89 137 48 30 2030
> 138 8A 138 96 60 160
> 139 8B 139 57 39 2039
> 140 8C 140 82 52 152
> 142 8E 142 125 7D 17D
> 145 91 145 24 18 2018
> 146 92 146 25 19 2019
> 147 93 147 28 1C 201C
> 148 94 148 29 1D 201D
> 149 95 149 34 22 2022
> 150 96 150 19 13 2013
> 151 97 151 20 14 2014
> 152 98 152 220 DC 2DC
> 153 99 153 34 22 2122
> 154 9A 154 97 61 161
> 155 9B 155 58 3A 203A
> 156 9C 156 83 53 153
> 158 9E 158 126 7E 17E
> 159 9F 159 120 78 178
>
> As you notice, when you provide Chr() with characters in the range 128-159
> in an English based system, the Unicode characters as shown by AscW do not
> necessarily have the same value.
>
>

From: Jim Mack on 15 Feb 2010 16:54

Jeff Johnson wrote:
> "Jim Mack" wrote...
>
>> Which the code snippet clearly shows: AscW() produces more than a
>> dozen results >255 for characters that would be considered English
>> since they're in the '1033' character set.
>
> [Sent the previous reply too soon.]
>
> Your comment about how they would be considered English since
> they're in the 1033 code page (or whatever that is) is actually the
> crux of my first reply. DOES the poster actually consider all of
> those to be English?

Maybe he'll see and respond. It's a question about what he wants to
classify and why, but in fact if you examine the normal output of many
modern text editing / word-processing programs, you will very likely
find characters that fall in the range I called out.

If you passed such text through the AscW() test, it would fail. That's
what I was pointing out.

--
Jim Mack
Twisted tees at http://www.cafepress.com/2050inc
"We sew confusion"

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Send an email
Next: Weird mouse behavior