Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem
From: MonkeeSage on 6 Dec 2007 02:02 On Dec 5, 11:29 pm, Daniel DeLorme <dan...(a)dan42.com> wrote: > MonkeeSage wrote: > > Heh, if the topic at hand is only that indexing into a string is > > slower with native utf-8 strings (don't disagree), then I guess it's > > irrelevant. ;) Regarding the idea that you can do everything just as > > efficiently with regexps that you can do with native utf-8 > > encoding...it seems relevant. > > How so? These methods work just as well in ruby1.8 which does *not* have > native utf8 encoding embedded in the strings. Of course, comparing a > string with a string is more efficient than comparing a string with a > regexp, but that is irrelevant of whether the string has "native" utf8 > encoding or not: > > $ ruby1.8 -rbenchmark -KU > puts Benchmark.measure{100000.times{ "ÆüËܸì".index("ËÜ") }}.real > puts Benchmark.measure{100000.times{ "ÆüËܸì".index(/[ËÜ]/) }}.real > puts Benchmark.measure{100000.times{ "ÆüËܸì".index(/[ËÜ]/u) }}.real > ^D > 0.225839138031006 > 0.304145097732544 > 0.313494920730591 > > $ ruby1.9 -rbenchmark -KU > puts Benchmark.measure{100000.times{ "ÆüËܸì".index("ËÜ") }}.real > puts Benchmark.measure{100000.times{ "ÆüËܸì".index(/[ËÜ]/) }}.real > puts Benchmark.measure{100000.times{ "ÆüËܸì".index(/[ËÜ]/u) }}.real > ^D > 0.183344841003418 > 0.255104064941406 > 0.263553857803345 > > 1.9 is more performant (one would hope so!) but the performance ratio > between string comparison and regex comparison does not seem affected by > the encoding at all. Ok, I wasn't being clear. What I was trying to say is, yes the methods perform the same on bytestrings -- whether using regex or standard string operations. The problem is in their behavior, not performance considered in the abstract. In 1.9, using ascii default encoding, this bytestring acts just like 1.8: "ÆüËܸì".index("ËÜ") #=> 3 That's fine! Faster than a regexp, no problems. That is, unless I want to know where the character match is (for whatever reason -- take work- necessitated interoperability with some software that required it). For that I'd have to do something hackish and likely fragile. It's possible, but not desirable; however, being able to do this gains performance and ruby already does all the work for you: "ÆüËܸì".force_encoding("utf-8").index("ËÜ".force_encoding("utf-8")) #=> 1 But it's obviously not better to type! But that's because I'm using ascii default encoding. There is, as I understand it, going to be a way to specify default encoding from the command-line, and probably from within ruby, rather than just the magic comments and String#force_encoding; so this extra typing is incidental and will go away. Actually, it goes away right now if you use utf-8 default and use the byte api to get at the underlying bytestrings. > > Someone just posted a question today about how to printf("%20s ...", > > a, ...) when "a" contains unicode (it screws up the alignment since > > printf only counts byte width, not character width). There is no > > *elegant* solution in 1.8., regexps or otherwise. > > It's not perfect in 1.9 either. "%20s" % "ÆüËܸì" results in a string of > 20 characters... that uses 23 columns of terminal space because the font > for Japanese uses double width. In other words neither bytes nor > characters have an intrinsic "width" :-/ > > Daniel It works as expected in 1.9, you just have to set the right encoding: printf("%20s\n".force_encoding("utf-8"), "ni\xc3\xb1o".force_encoding("utf-8")) #=> ni«Ðo printf("%20s\n", "nino") #=> ni«Ðo Any case, I just don't think there is any reason to dislike the new string api. It adds another tool to the toolbox. It doesn't make sense to use it always, everywhere (like trying to make data that naturally has the shape of an array, fit into a hash); but I see no reason to try and cobble it together ourselves either (like building a hash api from arrays ourselves). And with that, I'm going to sleep. Have to think more on it tomorrow. Peace, Jordan
From: Jimmy Kofler on 7 Dec 2007 05:06 > Re: Unicode in Regex > Posted by Jordan Callicoat (monkeesage) on 03.12.2007 02:50 > > This seems to work... > > $KCODE = "UTF8" > p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "J�sp...it works" > # => 0 > ... > However, it looks to me like it would be more robust to use a slightly > modified version of UTF8REGEX (found in the link Jimmy posted > above)... > > UTF8REGEX = /\A(?: > [a-zA-Z\.\-\'\ ] > | [\xC2-\xDF][\x80-\xBF] > | \xE0[\xA0-\xBF][\x80-\xBF] > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} > | \xED[\x80-\x9F][\x80-\xBF] > | \xF0[\x90-\xBF][\x80-\xBF]{2} > | [\xF1-\xF3][\x80-\xBF]{3} > | \xF4[\x80-\x8F][\x80-\xBF]{2} > )*\z/mnx Just to avoid confusion over the meaning of 'UTF8' in UTF8REGEX: the n option sets the encoding of UTF8REGEX to none! Cheers, j. k. -- Posted via http://www.ruby-forum.com/.
First
|
Prev
|
Pages: 1 2 3 4 5 6 7 Prev: Can't run cgi script with Apache 2.2 with Windows XP Next: Rubygems 0.9.5 and fastthread mswin32 gem |