Prev: Can't run cgi script with Apache 2.2 with Windows XP
Next: Rubygems 0.9.5 and fastthread mswin32 gem
From: Daniel DeLorme on 4 Dec 2007 20:58 MonkeeSage wrote: > I guess we were talking about different things then. I never meant to > imply that the regexp engine can't match unicode characters Since regular expressions are embedded in the very syntax of ruby just as arrays and hashes, IMHO that qualifies as unicode support. So yeah, it seems like we have a semantic disagreement. :-( > I, like Charles (and I think most people), was referring to the > ability to index into strings by characters, find their lengths in > characters That is certainly *one* way of supporting unicodde but by no means the only way. My belief is that you can do most string manipulations in a way that obviates the need for char indexing & char length, if only you change your mindset from "operating on individual characters" to "operating on the string as a whole". And since regex are a specialized language for string manipulation, they're also a lot faster. It's a little like imperative vs functional programming; if I told you about a programming language that has no variable assignments you might think it's completely broken, and yet that's how functional languages work. > to compose and decompose composite characters, to > normalize characters, convert them to other encodings like shift-jis, > and other such things. Converting encodings is a worthy goal but unrelated to unicode support. As for character [de]composition that would be a very nice thing to have if it was handled automatically (e.g. "a\314\200"=="\303\240") but if the programmer has to worry about it then you might as well leave it to a specialized library. Well, it's not like ruby lets us abstract away composite characters either in 1.8 or 1.9... I never claimed unicode support was 100%, just good enough for most needs. > just a difference of opinion. I don't mind being wrong (happens a > lot! ;) I just don't like being accused of spreading FUD about ruby, > which to my mind implies malice of forethought rather that simply > mistake. Yes, that was too harsh on my part. My apologies. Daniel
From: Daniel DeLorme on 4 Dec 2007 21:03 Daniel DeLorme wrote: > Heavy compared to what? Once compiled, regex are orders of magnitude > faster than jumping in and out of ruby interpreted code. Sorry to beat a dead horse, but I just did an interesting little experiment with 1.9: >> str = "abcde"*1000 >> str.encoding => <Encoding:ASCII-8BIT> >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real => 0.010282039642334 >> str.force_encoding 'utf-8' >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real => 1.29934501647949 >> arr = str.scan(/./u) >> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real => 0.00343608856201172 indexing into UTF-8 strings is *expensive* Daniel
From: Charles Oliver Nutter on 5 Dec 2007 02:34 Daniel DeLorme wrote: > Daniel DeLorme wrote: >> Heavy compared to what? Once compiled, regex are orders of magnitude >> faster than jumping in and out of ruby interpreted code. > > Sorry to beat a dead horse, but I just did an interesting little > experiment with 1.9: > > >> str = "abcde"*1000 > >> str.encoding > => <Encoding:ASCII-8BIT> > >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real > => 0.010282039642334 > >> str.force_encoding 'utf-8' > >> Benchmark.measure{10.times{ 1000.times{|i|str[i]} }}.real > => 1.29934501647949 > >> arr = str.scan(/./u) > >> Benchmark.measure{10.times{ 1000.times{|i|arr[i]} }}.real > => 0.00343608856201172 > > indexing into UTF-8 strings is *expensive* ...but correct. I'd rather have correct than broken. - Charlie
From: MonkeeSage on 5 Dec 2007 06:01 On Dec 4, 7:58 pm, Daniel DeLorme <dan...(a)dan42.com> wrote: > MonkeeSage wrote: > > I guess we were talking about different things then. I never meant to > > imply that the regexp engine can't match unicode characters > > Since regular expressions are embedded in the very syntax of ruby just > as arrays and hashes, IMHO that qualifies as unicode support. So yeah, > it seems like we have a semantic disagreement. :-( > > > I, like Charles (and I think most people), was referring to the > > ability to index into strings by characters, find their lengths in > > characters > > That is certainly *one* way of supporting unicodde but by no means the > only way. My belief is that you can do most string manipulations in a > way that obviates the need for char indexing & char length, if only you > change your mindset from "operating on individual characters" to > "operating on the string as a whole". And since regex are a specialized > language for string manipulation, they're also a lot faster. It's a > little like imperative vs functional programming; if I told you about a > programming language that has no variable assignments you might think > it's completely broken, and yet that's how functional languages work. I think we'll just have to agree to disagree. But there is one point... main = do let i_like = "I like " putStrLn $ i_like ++ haskell where haskell = "a functional language" ;) > > to compose and decompose composite characters, to > > normalize characters, convert them to other encodings like shift-jis, > > and other such things. > > Converting encodings is a worthy goal but unrelated to unicode support. > As for character [de]composition that would be a very nice thing to have > if it was handled automatically (e.g. "a\314\200"=="\303\240") but if > the programmer has to worry about it then you might as well leave it to > a specialized library. Well, it's not like ruby lets us abstract away > composite characters either in 1.8 or 1.9... I never claimed unicode > support was 100%, just good enough for most needs. > > > just a difference of opinion. I don't mind being wrong (happens a > > lot! ;) I just don't like being accused of spreading FUD about ruby, > > which to my mind implies malice of forethought rather that simply > > mistake. > > Yes, that was too harsh on my part. My apologies. No worries. :) I apologize as well for responding by saying you were lying about unicode support; I see that we just have a difference of opinion and were talking past each another. > Daniel Regards, Jordan
From: marc on 5 Dec 2007 16:35
Daniel DeLorme said... > MonkeeSage wrote: > > Ruby 1.8 doesn't have unicode support (1.9 is starting to get it). > > I enrages me to see this kind of FUD. Through regular expressions, ruby > 1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support > well-near 100% complete. > > > > Everything in ruby is a bytestring. > > YES! And that's exactyly how it should be. Who is it that spread the > flawed idea that strings are fundamentally made of characters? Are you being ironic? > I'd like > to slap him around a little. Fundamentally, ever since the word "string" > was applied to computing, strings were made of 8-BIT CHARS, not n-bit > characters. If only the creators of C has called that datatype "byte" > instead of "char" it would have saved us so many misunderstandings. And look at the trouble we're having ditching the waterfall method, all because someone misread a paper in the 1700s or thereabouts. You might want to spar with Tim Bray from Sun who presented at RubyConf 2006, where his slides state: "99.99999% of the time, programmers want to deal with characters not bytes. I know of one exception: running a state machine on UTF8-encoded text. This is done by the Expat XML parser." "In 2006, programmers around the world expect that, in modern languages, strings are Unicode and string APIs provide Unicode semantics correctly & efficiently, by default. Otherwise, they perceive this as an offense against their language and their culture. Humanities/computing academics often need to work outside Unicode. Few others do." He reviews his chat here: http://www.tbray.org/ongoing/When/200x/2006/10/22/Unicode-and-Ruby and the slides are here: http://www.tbray.org/talks/rubyconf2006.pdf -- Cheers, Marc |