Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: Greg Willits on 2 Dec 2007 15:35

Greg Willits wrote:

> I'm expecting a validate_format_of with a regex like this
> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
> to allow many of the normal characters like ö é å to be submitted via
> web form. However, the extended characters are being rejected.

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

The only thing not working now is the ability to define a range of \x
characters in a regex.

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an ä in it. Perfect.

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

For now, I just have all the characters I want included < \xFF listed
individually.

utf_accents = '\xC0\xC1\xC2\.......'

Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u

But I'd like to solve the range notation if I can.

--
def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end
--
Posted via http://www.ruby-forum.com/.

From: Daniel DeLorme on 2 Dec 2007 20:18

MonkeeSage wrote:
> Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).

I enrages me to see this kind of FUD. Through regular expressions, ruby
1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support
well-near 100% complete.

>> 'aébvHögtåwHÅFuG'.scan(/./u)
=> ["a", "é", "b", "v", "H", "ö", "g", "t", "å", "w", "H", "Å", "F",
"u", "G"]

>> 'aébvHögtåwHÅFuG'.scan(/[éöåÅ]/u)
=> ["é", "ö", "å", "Å"]

Ok, sometimes you have to take a weird approach because of the missing
10-20%, but it's still workable
>> 'aébvHögtåwHÅFuG'.scan(/(?:\303\251|\303\266|\303\245|\303\205)/u)
=> ["é", "ö", "å", "Å"]

> Everything in ruby is a bytestring.

YES! And that's exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters? I'd like
to slap him around a little. Fundamentally, ever since the word "string"
was applied to computing, strings were made of 8-BIT CHARS, not n-bit
characters. If only the creators of C has called that datatype "byte"
instead of "char" it would have saved us so many misunderstandings.

Usually the complaint about the support lack of unicode support is that
something like "日本語".length returns 9 instead of 3, or that "日本語
".index("語") returns 6 instead of 2. It's nice that people want to
completely redefine the API to return character positions and all that,
but please don't complain that it's broken just because you happen to be
using it incorrectly. Use the right tool for the job. SQL for database
queries, non-home-brewed crypto libraries for security, regular
expressions for string manipulation.

I'm terribly sorry for the rant but I had to get it off my chest.

Dan

From: Daniel DeLorme on 2 Dec 2007 20:40

Greg Willits wrote:
> Greg Willits wrote:
>
>> I'm expecting a validate_format_of with a regex like this
>> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
>> to allow many of the normal characters like ö é å to be submitted via
>> web form. However, the extended characters are being rejected.
>
>
> So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> past couple days to concoct some validations that allow UTF8
> characters. I have discovered that I can get a little further by doing
> the
> following:
> - declaring $KCODE = 'UTF8'
> - adding /u to regex expressions.
>
> The only thing not working now is the ability to define a range of \x
> characters in a regex.
>
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an ä in it. Perfect.
>
> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> I've boiled the experiments down to realizing I can't define a range
> with \x
>
> Is this just one of those things that just doesn't work yet WRT Ruby/
> Rails/UTF8, or is there another syntax? I've scoured all the regex
> docs I can find, and they seem to indicate a range should work.

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

>> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
=> []
>> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
=> ["é"]

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "é" and a match is found.

So ranges *do* work in utf8 but you have to be careful:

>> "àâäçèéêîïôü".scan(/[ä-î]/u)
=> ["ä", "ç", "è", "é", "ê", "î"]
>> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)
=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]
>> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
=> ["ä", "ç", "è", "é", "ê", "î"]

Hope this helps.

Dan

From: MonkeeSage on 2 Dec 2007 20:46

On Dec 2, 2:35 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
> Greg Willits wrote:
> > I'm expecting a validate_format_of with a regex like this
> > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
> > to allow many of the normal characters like ö é å to be submitted via
> > web form. However, the extended characters are being rejected.
>
> So, I've been pounding the web for info on UTF8 in Ruby and Rails the
> past couple days to concoct some validations that allow UTF8
> characters. I have discovered that I can get a little further by doing
> the
> following:
> - declaring $KCODE = 'UTF8'
> - adding /u to regex expressions.
>
> The only thing not working now is the ability to define a range of \x
> characters in a regex.
>
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an ä in it. Perfect.
>
> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
>
> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> I've boiled the experiments down to realizing I can't define a range
> with \x
>
> Is this just one of those things that just doesn't work yet WRT Ruby/
> Rails/UTF8, or is there another syntax? I've scoured all the regex
> docs I can find, and they seem to indicate a range should work.
>
> For now, I just have all the characters I want included < \xFF listed
> individually.
>
> utf_accents = '\xC0\xC1\xC2\.......'
>
> Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u
>
> But I'd like to solve the range notation if I can.
>
> --
> def gw
> acts_as_n00b
> writes_at(www.railsdev.ws)
> end
> --
> Posted viahttp://www.ruby-forum.com/.

This seems to work...

$KCODE = "UTF8"
p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "Jäsp...it
works"
# => 0

However, it looks to me like it would be more robust to use a slightly
modified version of UTF8REGEX (found in the link Jimmy posted
above)...

UTF8REGEX = /\A(?:
[a-zA-Z\.\-\'\ ]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/mnx

p UTF8REGEX =~ "Jäsp...it works here too"
# => 0

Look at the link to see the explanation of the alternations.

Regards,
Jordan

From: Daniel DeLorme on 2 Dec 2007 20:55

Greg Willits wrote:
> So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
> to have an ä in it. Perfect.

If that actually works, it means you are really using ISO-8859-1
strings, not UTF-8.

> utf_accents = '\xC0\xC1\xC2\.......'

Nope, that's not UTF-8. UTF-8 characters ÀÁÂ would look like
utf_accents = "\xC3\x80\xC3\x81\xC3\x82..."

Dan

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem