Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem
From: Greg Willits on 2 Dec 2007 15:35 Greg Willits wrote: > I'm expecting a validate_format_of with a regex like this > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ > to allow many of the normal characters like ö é å to be submitted via > web form. However, the extended characters are being rejected. So, I've been pounding the web for info on UTF8 in Ruby and Rails the past couple days to concoct some validations that allow UTF8 characters. I have discovered that I can get a little further by doing the following: - declaring $KCODE = 'UTF8' - adding /u to regex expressions. The only thing not working now is the ability to define a range of \x characters in a regex. So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed to have an ä in it. Perfect. But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u I've boiled the experiments down to realizing I can't define a range with \x Is this just one of those things that just doesn't work yet WRT Ruby/ Rails/UTF8, or is there another syntax? I've scoured all the regex docs I can find, and they seem to indicate a range should work. For now, I just have all the characters I want included < \xFF listed individually. utf_accents = '\xC0\xC1\xC2\.......' Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u But I'd like to solve the range notation if I can. -- def gw acts_as_n00b writes_at(www.railsdev.ws) end -- Posted via http://www.ruby-forum.com/.
From: Daniel DeLorme on 2 Dec 2007 20:18 MonkeeSage wrote: > Ruby 1.8 doesn't have unicode support (1.9 is starting to get it). I enrages me to see this kind of FUD. Through regular expressions, ruby 1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support well-near 100% complete. >> 'aébvHögtåwHÅFuG'.scan(/./u) => ["a", "é", "b", "v", "H", "ö", "g", "t", "å", "w", "H", "Å", "F", "u", "G"] >> 'aébvHögtåwHÅFuG'.scan(/[éöåÅ]/u) => ["é", "ö", "å", "Å"] Ok, sometimes you have to take a weird approach because of the missing 10-20%, but it's still workable >> 'aébvHögtåwHÅFuG'.scan(/(?:\303\251|\303\266|\303\245|\303\205)/u) => ["é", "ö", "å", "Å"] > Everything in ruby is a bytestring. YES! And that's exactyly how it should be. Who is it that spread the flawed idea that strings are fundamentally made of characters? I'd like to slap him around a little. Fundamentally, ever since the word "string" was applied to computing, strings were made of 8-BIT CHARS, not n-bit characters. If only the creators of C has called that datatype "byte" instead of "char" it would have saved us so many misunderstandings. Usually the complaint about the support lack of unicode support is that something like "日本語".length returns 9 instead of 3, or that "日本語 ".index("語") returns 6 instead of 2. It's nice that people want to completely redefine the API to return character positions and all that, but please don't complain that it's broken just because you happen to be using it incorrectly. Use the right tool for the job. SQL for database queries, non-home-brewed crypto libraries for security, regular expressions for string manipulation. I'm terribly sorry for the rant but I had to get it off my chest. Dan
From: Daniel DeLorme on 2 Dec 2007 20:40 Greg Willits wrote: > Greg Willits wrote: > >> I'm expecting a validate_format_of with a regex like this >> /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ >> to allow many of the normal characters like ö é å to be submitted via >> web form. However, the extended characters are being rejected. > > > So, I've been pounding the web for info on UTF8 in Ruby and Rails the > past couple days to concoct some validations that allow UTF8 > characters. I have discovered that I can get a little further by doing > the > following: > - declaring $KCODE = 'UTF8' > - adding /u to regex expressions. > > The only thing not working now is the ability to define a range of \x > characters in a regex. > > So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed > to have an ä in it. Perfect. > > But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u > > But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u > > I've boiled the experiments down to realizing I can't define a range > with \x > > Is this just one of those things that just doesn't work yet WRT Ruby/ > Rails/UTF8, or is there another syntax? I've scoured all the regex > docs I can find, and they seem to indicate a range should work. Let me try to explain that in order to redeem myself from my previous angry post. Basically, \xE4 is counted as the byte value 0xE4, not the unicode character U+00E4. And in a range expression, each escaped value is taken as one character within the range. Which results in not-immediately obvious situations: >> 'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) => [] >> 'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u) => ["é"] What is happening in the first case is that the string does not contain characters \303 or \251 because those are invalid utf8 sequences. But when the value "\303\251" is *inlined* into the regex, that is recognized as the utf8 character "é" and a match is found. So ranges *do* work in utf8 but you have to be careful: >> "àâäçèéêîïôü".scan(/[ä-î]/u) => ["ä", "ç", "è", "é", "ê", "î"] >> "àâäçèéêîïôü".scan(/[\303\244-\303\256]/u) => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250", "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303", "\264", "\303", "\274"] >> "àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u) => ["ä", "ç", "è", "é", "ê", "î"] Hope this helps. Dan
From: MonkeeSage on 2 Dec 2007 20:46 On Dec 2, 2:35 pm, Greg Willits <li...(a)gregwillits.ws> wrote: > Greg Willits wrote: > > I'm expecting a validate_format_of with a regex like this > > /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/ > > to allow many of the normal characters like ö é å to be submitted via > > web form. However, the extended characters are being rejected. > > So, I've been pounding the web for info on UTF8 in Ruby and Rails the > past couple days to concoct some validations that allow UTF8 > characters. I have discovered that I can get a little further by doing > the > following: > - declaring $KCODE = 'UTF8' > - adding /u to regex expressions. > > The only thing not working now is the ability to define a range of \x > characters in a regex. > > So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed > to have an ä in it. Perfect. > > But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u > > But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u > > I've boiled the experiments down to realizing I can't define a range > with \x > > Is this just one of those things that just doesn't work yet WRT Ruby/ > Rails/UTF8, or is there another syntax? I've scoured all the regex > docs I can find, and they seem to indicate a range should work. > > For now, I just have all the characters I want included < \xFF listed > individually. > > utf_accents = '\xC0\xC1\xC2\.......' > > Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u > > But I'd like to solve the range notation if I can. > > -- > def gw > acts_as_n00b > writes_at(www.railsdev.ws) > end > -- > Posted viahttp://www.ruby-forum.com/. This seems to work... $KCODE = "UTF8" p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "Jäsp...it works" # => 0 However, it looks to me like it would be more robust to use a slightly modified version of UTF8REGEX (found in the link Jimmy posted above)... UTF8REGEX = /\A(?: [a-zA-Z\.\-\'\ ] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} | \xED[\x80-\x9F][\x80-\xBF] | \xF0[\x90-\xBF][\x80-\xBF]{2} | [\xF1-\xF3][\x80-\xBF]{3} | \xF4[\x80-\x8F][\x80-\xBF]{2} )*\z/mnx p UTF8REGEX =~ "Jäsp...it works here too" # => 0 Look at the link to see the explanation of the alternations. Regards, Jordan
From: Daniel DeLorme on 2 Dec 2007 20:55
Greg Willits wrote: > So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed > to have an ä in it. Perfect. If that actually works, it means you are really using ISO-8859-1 strings, not UTF-8. > utf_accents = '\xC0\xC1\xC2\.......' Nope, that's not UTF-8. UTF-8 characters ÀÁÂ would look like utf_accents = "\xC3\x80\xC3\x81\xC3\x82..." Dan |