Unicode in Regex [Ruby]

Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem

From: MonkeeSage on 3 Dec 2007 19:48

On Dec 3, 1:47 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
> Daniel DeLorme wrote:
> > Greg Willits wrote:
> >> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
> >> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> >> I've boiled the experiments down to realizing I can't define a range
> >> with \x
> > Let me try to explain that in order to redeem myself from my previous
> > angry post.
>
> :-)
>
> > Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> > character U+00E4. And in a range expression, each escaped value is taken
> > as one character within the range. Which results in not-immediately
> > obvious situations:
>
> > >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
> > => []
> > >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
> > => ["é"]
>
> OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
> character code point -- which with your explanation I can finally tie
> together what that means.
>
> Took me a second to recognize the #{} as Ruby and not some new regex I'd
> never seen :-P
>
> And I realize now too I wasn't picking up on the use of octal vs
> decimal.
>
> Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

Oniguruma is not in ruby 1.8 (though you can install it as a gem). It
is in 1.9.

> > What is happening in the first case is that the string does not contain
> > characters \303 or \251 because those are invalid utf8 sequences. But
> > when the value "\303\251" is *inlined* into the regex, that is
> > recognized as the utf8 character "é" and a match is found.
>
> > So ranges *do* work in utf8 but you have to be careful:
>
> > >> "àâäçèéêîïôü".scan(/[ä-î]/u)
> > => ["ä", "ç", "è", "é", "ê", "î"]
> > >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)
> > => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> > "\264", "\303", "\274"]
> > >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
> > => ["ä", "ç", "è", "é", "ê", "î"]
>
> > Hope this helps.
>
> Yes!
>
> -- gw
> --
> Posted viahttp://www.ruby-forum.com/.

From: Greg Willits on 3 Dec 2007 23:56

Jordan Callicoat wrote:
> On Dec 3, 1:47 pm, Greg Willits <li...(a)gregwillits.ws> wrote:
>> :-)
>>
>> Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?
> Oniguruma is not in ruby 1.8 (though you can install it as a gem). It
> is in 1.9.

Oh. I always thought Oniguruma was the engine in Ruby.

Anyway -- everyone, thanks for all the input. I believe I'm headed in
the right direction now, and have a better hands on understanding of
UTF-8.

-- gw
--
Posted via http://www.ruby-forum.com/.

From: Daniel DeLorme on 4 Dec 2007 03:46

Charles Oliver Nutter wrote:
> Regular expressions for all character work would be a *terribly* slow
> way to get things done. If you want to get the nth character, should you
> do a match for n-1 characters and a group to grab the nth? Or would it
> be better if you could just index into the string and have it do the

Ok, I'm not very familiar with the internal working of strings in 1.9,
but it seems to me that for character sets with variable byte size, it
is logically *impossible* to directly index into the string. Unless
there's some trick I'm unaware of, you *have* to count from the
beginning of the string for utf8 strings.

> right thing? How about if you want to iterate over all characters in a
> string? Should the iterating code have to know about the encoding?
> Should you use a regex to peel off one character at a time?

That is certainly one possible way of doing things...
string.scan(/./){ |char| do_someting_with(char) }

> Regex for string access goes a long way, but's just about the heaviest
> way to do it.

Heavy compared to what? Once compiled, regex are orders of magnitude
faster than jumping in and out of ruby interpreted code.

> Strings should be aware of their encoding and should be
> able to provide you access to characters as easily as bytes. That's what
> 1.9 (and upcoming changes in JRuby) fixes.

Overall I agree that the encoding stuff in 1.9 is very nice.
Encapsulating the encoding with the string is very OO. Very intuitive.
No need to think about encoding anymore, now it "just works" for
encoding-ignorant programmers (at least until the abstraction leaks). It
allows to shut up one frequent complaint about ruby; a clear political
victory. Overall it is more robust and less error-prone than the 1.8 way.

But my point was that there *is* a 1.8 way. The thing that riled me up
and that I was responding to was the claim that 1.8 did not have unicode
support AT ALL. Unequivocally, it does, and it works pretty well for me.
IMHO there is a certain minimalist elegance in considering strings as
encoding-agnostic and using regex to get encoding-specific views. I
could do str[/./n] and str[/./u]; I can't do that anymore.

1.9 makes encodings easier for the english-speaking masses not used to
extended characters, but let's remember that ruby *always* had support
for multibyte character sets; after all it *did* originate from a
country with two gazillion "characters".

Daniel

From: Daniel DeLorme on 4 Dec 2007 04:07

MonkeeSage wrote:
> Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS*
> a standard mapping of bytes to *characters*. That's what unicode is.
> I'm sorry you don't like that, but don't lie and say ruby 1.8 supports
> unicode when it knows nothing about that standard mapping and treats
> everything as individual bytes (and any byte with a value greater than
> 126 just prints an octal escape)

Ok, then how do you explain this:
>> $KCODE='u'
=> "u"
>> "abc\303\244".scan(/./)
=> ["a", "b", "c", "�"]

This doesn't require any libraries, and it seems to my eyes that ruby is
converting 5 bytes into 4 characters. It shows an awareness of utf8. If
that's not *some* kind of unicode support then please tell me what it
is. It seem were disagreeing on some basic definition of what "unicode
support" means.

> Secondly, as I said in my first post to this thread, the characters
> trying to be matched are composite characters, which requires you to
> match both bytes. You can try to using a unicode regexp, but then you
> run into the problem you mention--the regexp engine expects the pre-
> composed, one-byte form...
>
> "�".scan(/[\303\262]/u) # => []
> "�".scan(/[\xf2]/u) # => ["\303\262"]

Wow, I never knew that second one could work. Unicode support is
actually better than I thought! You learn something new every day.

> ...which is why I said it's more robust to use something like the the
> regexp that Jimmy linked to and I reposted, instead of a unicode
> regexp.

I'm not sure what makes that huge regexp more robust than a simple
unicode regexp.

Daniel

From: MonkeeSage on 4 Dec 2007 07:25

On Dec 4, 3:07 am, Daniel DeLorme <dan...(a)dan42.com> wrote:

> Ok, then how do you explain this:
> >> $KCODE='u'
> => "u"
> >> "abc\303\244".scan(/./)
> => ["a", "b", "c", "ä"]
>
> This doesn't require any libraries, and it seems to my eyes that ruby is
> converting 5 bytes into 4 characters. It shows an awareness of utf8. If
> that's not *some* kind of unicode support then please tell me what it
> is. It seem were disagreeing on some basic definition of what "unicode
> support" means.

I guess we were talking about different things then. I never meant to
imply that the regexp engine can't match unicode characters (it's
"dumb" implementation though; it basically only knows that bytes above
127 can have more bytes following and should be grouped together as
candidates for a match; that's slightly simplified, but basically
accurate).

I, like Charles (and I think most people), was referring to the
ability to index into strings by characters, find their lengths in
characters, to compose and decompose composite characters, to
normalize characters, convert them to other encodings like shift-jis,
and other such things. Ruby 1.9 has started adding such support, while
ruby 1.8 lacks it. It can be hacked together with regular expressions
(e.g., the link Jimmy posted), or even as a real, compiled extension
[1], but merely saying that *you* the programmer can implement it
using ruby 1.8, is not the same thing as saying ruby 1.8 supports it
(just like I could build a python VM in ruby, but that doesn't mean
that the ruby interpreter runs python bytecode). Anyhow, I guess it's
just a difference of opinion. I don't mind being wrong (happens a
lot! ;) I just don't like being accused of spreading FUD about ruby,
which to my mind implies malice of forethought rather that simply
mistake.

[1] http://rubyforge.org/projects/char-encodings/
http://git.bitwi.se/ruby-character-encodings.git/

> > Secondly, as I said in my first post to this thread, the characters
> > trying to be matched are composite characters, which requires you to
> > match both bytes. You can try to using a unicode regexp, but then you
> > run into the problem you mention--the regexp engine expects the pre-
> > composed, one-byte form...
>
> > "ò".scan(/[\303\262]/u) # => []
> > "ò".scan(/[\xf2]/u) # => ["\303\262"]
>
> Wow, I never knew that second one could work. Unicode support is
> actually better than I thought! You learn something new every day.
>
> > ...which is why I said it's more robust to use something like the the
> > regexp that Jimmy linked to and I reposted, instead of a unicode
> > regexp.
>
> I'm not sure what makes that huge regexp more robust than a simple
> unicode regexp.
>
> Daniel

Well, I won't claim that you can't get a unicode regexp to match the
same. And I only saw that large regexp when it was posted here, so
I've not tested it to any great length. Interestingly, 1.9 uses this
regexp (originally from jcode.rb in stdlib) to classify a string as
containing utf-8: '[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-
\xbf]'. My thought was that without knowing all of the minute
intricacies of unicode and how ruby strings and regexps work with
unicode values (which I don't, and assume the OP doesn't), I think the
huge regexp is more likely to Just Work in more cases than a home-
brewed unicode regexp. But like I said, that's just an initial
conclusion, I don't claim it's absolutely correct.

Regards,
Jordan

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem