Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem
From: Daniel DeLorme on 5 Dec 2007 19:15 marc wrote: > Daniel DeLorme said... >> MonkeeSage wrote: >>> Everything in ruby is a bytestring. >> YES! And that's exactyly how it should be. Who is it that spread the >> flawed idea that strings are fundamentally made of characters? > > Are you being ironic? Not at all. By "fundamentally" I mean the fundamental, lowest level of representation. If strings were fundamentally made of characters then we wouldn't be able to access individual bytes because that's a lower level than the fundamental level, which is by definition impossible. If you are using UCS2 it makes sense to consider strings as arrays of characters because that's what they are. But UTF8 strings do not follow the characteristics of arrays at all. Each access into the "array" is O(n) rather than O(1). So IMHO treating it as an array of characters is a *very* leaky abstraction. I agree that 99.9% of the time you want to deal with characters, and I believe that in 99% of those cases you would be better served with regex than this pretend "array" disguise. Daniel
From: MonkeeSage on 5 Dec 2007 20:23 On Dec 5, 6:15 pm, Daniel DeLorme <dan...(a)dan42.com> wrote: > marc wrote: > > Daniel DeLorme said... > >> MonkeeSage wrote: > >>> Everything in ruby is a bytestring. > >> YES! And that's exactyly how it should be. Who is it that spread the > >> flawed idea that strings are fundamentally made of characters? > > > Are you being ironic? > > Not at all. By "fundamentally" I mean the fundamental, lowest level of > representation. If strings were fundamentally made of characters then we > wouldn't be able to access individual bytes because that's a lower level > than the fundamental level, which is by definition impossible. > > If you are using UCS2 it makes sense to consider strings as arrays of > characters because that's what they are. But UTF8 strings do not follow > the characteristics of arrays at all. Each access into the "array" is > O(n) rather than O(1). So IMHO treating it as an array of characters is > a *very* leaky abstraction. > > I agree that 99.9% of the time you want to deal with characters, and I > believe that in 99% of those cases you would be better served with regex > than this pretend "array" disguise. > > Daniel Here is a micro-benchmark on three common string operations (split, index, length), using bytestrings and unicode regexp, verses native utf-8 strings in 1.9.0 (release). $ ruby19 -v ruby 1.9.0 (2007-10-15 patchlevel 0) [i686-linux] $ echo && cat bench.rb #!/usr/bin/ruby19 # -*- coding: ascii -*- require "benchmark" require "test/unit/assertions" include Test::Unit::Assertions $KCODE = "u" $target = "!日本語!" * 100 $unichr = "本".force_encoding('utf-8') $regchr = /[本]/u def uni_split $target.split($unichr) end def reg_split $target.split($regchr) end def uni_index $target.index($unichr) end def reg_index $target =~ $regchr end def uni_chars $target.length end def reg_chars $target.unpack("U*").length # this is *alot* slower # $target.scan(/./u).length end $target.force_encoding("ascii") a = reg_split $target.force_encoding("utf-8") b = uni_split assert_equal(a.length, b.length) $target.force_encoding("ascii") a = reg_index $target.force_encoding("utf-8") b = uni_index assert_equal(a-2, b) $target.force_encoding("ascii") a = reg_chars $target.force_encoding("utf-8") b = uni_chars assert_equal(a, b) n = 10_000 Benchmark.bm(12) { | x | $target.force_encoding("ascii") x.report("reg_split") { n.times { reg_split } } $target.force_encoding("utf-8") x.report("uni_split") { n.times { uni_split } } puts $target.force_encoding("ascii") x.report("reg_index") { n.times { reg_index } } $target.force_encoding("utf-8") x.report("uni_index") { n.times { uni_index } } puts $target.force_encoding("ascii") x.report("reg_chars") { n.times { reg_chars } } $target.force_encoding("utf-8") x.report("uni_chars") { n.times { uni_chars } } } ==== With caches initialized, an 5 prior runs, I got these numbers: $ ruby19 bench.rb user system total real reg_split 2.550000 0.010000 2.560000 ( 2.799292) uni_split 1.820000 0.020000 1.840000 ( 2.026265) reg_index 0.040000 0.000000 0.040000 ( 0.097672) uni_index 0.150000 0.000000 0.150000 ( 0.202700) reg_chars 0.790000 0.010000 0.800000 ( 0.919995) uni_chars 0.130000 0.000000 0.130000 ( 0.193307) ==== So String#=~ with a bytestring and unicode regexp is faster than String#index by a fator or ~0.5. In the other two cases, the opposite is true. Ps. BTW, in case there is any confusion, bytestrings aren't going away; you can, as you see above, specify a magic encoding comment to ensure that you have bytestrings by default. You can also explicitly decode from utf-8 back to ascii. and you can get a byte enumerator (or array from calling to_a on the enumerator) from String#bytes, and an iterator from #each_byte, irregardless of the encoding. Regards, Jordan
From: Daniel DeLorme on 5 Dec 2007 21:31 MonkeeSage wrote: > Here is a micro-benchmark on three common string operations (split, > index, length), using bytestrings and unicode regexp, verses native > utf-8 strings in 1.9.0 (release). That's nice, but split and index do not operate using integer indexing into the string, so they are rather irrelevant to the topic at hand. They produce the same results in ruby1.8, i.e. uni_split==reg_split and uni_index==reg_index. I also stated that the point of regex manipulation is to *obviate* the need for methods like index and length. So a more accurate benchmark might be something like: reg_chars N/A N/A N/A ( N/A ) uni_chars 0.130000 0.000000 0.130000 ( 0.193307) ;-) > Ps. BTW, in case there is any confusion, bytestrings aren't going > away; you can, as you see above, specify a magic encoding comment to > ensure that you have bytestrings by default. Yes, it's still possible to access bytes but it's not possible to run a utf8 regex on a bytestring if it contains extended characters: $ ruby1.9 -ve '"abc" =~ /b/u' ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] $ ruby1.9 -ve '"日本語" =~ /本/u' ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] -e:1:in `<main>': character encodings differ (ArgumentError) And that kinda kills my whole approach. Daniel
From: MonkeeSage on 5 Dec 2007 22:07 On Dec 5, 8:31 pm, Daniel DeLorme <dan...(a)dan42.com> wrote: > MonkeeSage wrote: > > Here is a micro-benchmark on three common string operations (split, > > index, length), using bytestrings and unicode regexp, verses native > > utf-8 strings in 1.9.0 (release). > > That's nice, but split and index do not operate using integer indexing > into the string, so they are rather irrelevant to the topic at hand. Heh, if the topic at hand is only that indexing into a string is slower with native utf-8 strings (don't disagree), then I guess it's irrelevant. ;) Regarding the idea that you can do everything just as efficiently with regexps that you can do with native utf-8 encoding...it seems relevant. In other words, it goes to show a general behavior that is benefited by a native implementation (the same reason we're using native hashes rather than building our own implementations out of arrays of pairs). > They produce the same results in ruby1.8, i.e. uni_split==reg_split and > uni_index==reg_index. Yes. My point was to show how a native implementation of unicode strings effects performance compared to using regular expressions on bytestrings. The behavior should be the same (hence the asserts). > I also stated that the point of regex manipulation is to *obviate* the > need for methods like index and length. So a more accurate benchmark > might be something like: > reg_chars N/A N/A N/A ( N/A ) > uni_chars 0.130000 0.000000 0.130000 ( 0.193307) > ;-) Someone just posted a question today about how to printf("%20s ...", a, ...) when "a" contains unicode (it screws up the alignment since printf only counts byte width, not character width). There is no *elegant* solution in 1.8., regexps or otherwise. There are haskish solutions (I provided one in that thread)...but the need was still there. Another example is GtkTextView widgets from ruby-gtk2. They deal with utf-8 in their C backend. So all the cursor functions that deal with characters mean utf-8 characters, not bytestrings. So without kludges, stuff doesn't always work right. > > Ps. BTW, in case there is any confusion, bytestrings aren't going > > away; you can, as you see above, specify a magic encoding comment to > > ensure that you have bytestrings by default. > > Yes, it's still possible to access bytes but it's not possible to run a > utf8 regex on a bytestring if it contains extended characters: > > $ ruby1.9 -ve '"abc" =~ /b/u' > ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] > $ ruby1.9 -ve '"日本語" =~ /本/u' > ruby 1.9.0 (2007-12-03 patchlevel 0) [i686-linux] > -e:1:in `<main>': character encodings differ (ArgumentError) > > And that kinda kills my whole approach. You can't use mixed encodings (not just in regexps, not anywhere). You'd have to use a proposed-but-not-implemented-in-1.9.0-release, command line switch to set your encoding to ascii (or whatever), or else use a magic comment [1] like I did above. That or explicitly encode both objects in the same encoding. > Daniel Regards, Jordan [1] http://www.ruby-forum.com/topic/127831
From: Daniel DeLorme on 6 Dec 2007 00:29
MonkeeSage wrote: > Heh, if the topic at hand is only that indexing into a string is > slower with native utf-8 strings (don't disagree), then I guess it's > irrelevant. ;) Regarding the idea that you can do everything just as > efficiently with regexps that you can do with native utf-8 > encoding...it seems relevant. How so? These methods work just as well in ruby1.8 which does *not* have native utf8 encoding embedded in the strings. Of course, comparing a string with a string is more efficient than comparing a string with a regexp, but that is irrelevant of whether the string has "native" utf8 encoding or not: $ ruby1.8 -rbenchmark -KU puts Benchmark.measure{100000.times{ "日本語".index("本") }}.real puts Benchmark.measure{100000.times{ "日本語".index(/[本]/) }}.real puts Benchmark.measure{100000.times{ "日本語".index(/[本]/u) }}.real ^D 0.225839138031006 0.304145097732544 0.313494920730591 $ ruby1.9 -rbenchmark -KU puts Benchmark.measure{100000.times{ "日本語".index("本") }}.real puts Benchmark.measure{100000.times{ "日本語".index(/[本]/) }}.real puts Benchmark.measure{100000.times{ "日本語".index(/[本]/u) }}.real ^D 0.183344841003418 0.255104064941406 0.263553857803345 1.9 is more performant (one would hope so!) but the performance ratio between string comparison and regex comparison does not seem affected by the encoding at all. > Someone just posted a question today about how to printf("%20s ...", > a, ...) when "a" contains unicode (it screws up the alignment since > printf only counts byte width, not character width). There is no > *elegant* solution in 1.8., regexps or otherwise. It's not perfect in 1.9 either. "%20s" % "日本語" results in a string of 20 characters... that uses 23 columns of terminal space because the font for Japanese uses double width. In other words neither bytes nor characters have an intrinsic "width" :-/ Daniel |