String comparison. Why does Ruby consider this true? [Ruby]

Prev: Build 32 bit version of ruby 1.92 on snow leopard
Next: 1.8.7 SMTP TLS How to?

From: Josh Cheek on 19 Jun 2010 03:59

[Note: parts of this message were removed to make it a legal post.]

On Sat, Jun 19, 2010 at 2:04 AM, Michael Fellinger <m.fellinger(a)gmail.com>wrote:

> On Sat, Jun 19, 2010 at 6:21 AM, Josh Cheek <josh.cheek(a)gmail.com> wrote:
> >
> > Thanks, but it doesn't seem to work on 1.8
> >
> >
> > RUBY_VERSION # => "1.8.7"
> >
> > %w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.codepoints.to_a
> } #
> > =>
> > # ~> -:3: undefined method `codepoints' for "ABC":String (NoMethodError)
> > # ~> from -:3:in `each'
> > # ~> from -:3
> >
> >
> >
> >
> > And the 1.8 ways to get it don't work on 1.9 (ie "a"[0])
>
> >> %w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack('C*') }
> {"ABC"=>[65, 66, 67]}
> {"Xeo"=>[88, 101, 111]}
> {"abc"=>[97, 98, 99]}
> {"ball"=>[98, 97, 108, 108]}
> {"xeo"=>[120, 101, 111]}
> => ["ABC", "Xeo", "abc", "ball", "xeo"]
>
> There is always a way to make things work on both, it's just that I
> don't care much about 1.8 anymore.
>
> --
> Michael Fellinger
> CTO, The Rubyists, LLC
>
>
Well, a lot of systems still ship with it, SnowLeopard, for example ships
with 1.8.7, so I think that while this is a legitimate personal decision, it
is good to be aware of one's audience. For example, since Abder-rahman is
having difficulty understanding String comparison, then it is probably fair
to assume he isn't initiated enough to understand why the example that is
supposed to help him understand ends up breaking (if he is on 1.8). That
could be very discouraging for someone new, come to the ML to get a better
understanding, and the answers, given by the people who know what they are
doing won't even run.

Anyway, I really do like your solution ^_^ It is elegant and uniform, thank
you for providing it.

From: Brian Candler on 21 Jun 2010 06:10

Josh Cheek wrote:
> Well, this used to be easy to show, but apparently since ascii has been
> abandoned, and I don't know unicode, I have to resort to hacky things
> like
> this to explain it.
>
>
> $chars = (1..128).inject(Hash.new) { |chars,num| chars[num.chr] = num ;
> chars }
>
> def to_number_array(str)
> str.split(//).map { |char| $chars[char] }
> end
>
> to_number_array 'Xeo' # => [88, 101, 111]
> to_number_array 'xeo' # => [120, 101, 111]
> to_number_array 'ball' # => [98, 97, 108, 108]
> to_number_array 'ABC' # => [65, 66, 67]
> to_number_array 'abc' # => [97, 98, 99]

Except that this is irrelevant, because even ruby 1.9 does not compare
strings by codepoints. It compares them byte-by-byte using memcmp. See
rb_str_cmp_m() and rb_str_cmp() in string.c

It's a designed-in side-effect of UTF-8 encoding that higher codepoints
sort after lower ones. There is a table at
http://en.wikipedia.org/wiki/UTF-8 under "Description" which illustrates
this.

However this does not work for other encodings. Try this for size:

>> s1 = 97.chr("UTF-8")
=> "a"
>> s2 = 257.chr("UTF-8")
=> "ā"
>> s1 < s2
=> true

>> s1 = 97.chr("UTF-16LE")
=> "a\x00"
>> s2 = 257.chr("UTF-16LE")
=> "\x01\x01"
>> s1 < s2
=> false

Yes: that's the same two unicode codepoints, but sorting in different
order. For encodings like UTF-16LE, where the least-significant byte
comes before the most-significant byte, you get an almost arbitrary
ordering.

Proviso: I tested this with
ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]

ruby 1.9.x string encoding rules are (a) undocumented, and (b) subject
to arbitrary changes between patchlevels, hence YMMV.
--
Posted via http://www.ruby-forum.com/.

From: Brian Candler on 21 Jun 2010 06:27

Michael Fellinger wrote:
>>> %w[Xeo xeo ball ABC abc].sort.each{|word| p word => word.unpack('C*') }
> {"ABC"=>[65, 66, 67]}
> {"Xeo"=>[88, 101, 111]}
> {"abc"=>[97, 98, 99]}
> {"ball"=>[98, 97, 108, 108]}
> {"xeo"=>[120, 101, 111]}
> => ["ABC", "Xeo", "abc", "ball", "xeo"]
>
> There is always a way to make things work on both, it's just that I
> don't care much about 1.8 anymore.

That does work the same on both, but it doesn't give codepoints.

$ irb --simple-prompt
>> "groß".unpack("C*")
=> [103, 114, 111, 195, 159]
>> RUBY_VERSION
=> "1.8.6"

$ irb19 --simple-prompt
>> "groß".unpack('C*')
=> [103, 114, 111, 195, 159]
>> "groß".codepoints.to_a
=> [103, 114, 111, 223]
>> RUBY_DESCRIPTION
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"
--
Posted via http://www.ruby-forum.com/.

First | Prev |
Pages: 1 2 3
Prev: Build 32 bit version of ruby 1.92 on snow leopard
Next: 1.8.7 SMTP TLS How to?