From: Stefano Crocco on 2 Feb 2010 07:54 On Tuesday 02 February 2010, Xavier Noëlle wrote: > |Hello, > |I'm trying to deal with Ruby flaws with encoding, which I thought > |would be almost past with Ruby 1.9. I managed to find a solution for > |Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no ! > | > |I fetch rows from an UTF8 database and try to work with the string. To > |do so, I would like it to be UTF8 encoded. > | > |"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these > |lines would solve the problem > |str.replace(Iconv.iconv("UTF8", "ascii", self).join()) > |OR > |self.encode!('UTF-8') > | > |But they don't ! > |First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence) > |Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8 > |(Encoding::UndefinedConversionError) > | > |The base string is "Oeuvre complète pour luth" and displays well in > |PHPMyAdmin. > | > |Any idea ? > |TIA, I'm not sure, but basing on my experience, it may be that the string are indeed stored as UTF-8, but the library you use to read from the database doesn't take care of informing ruby of the fact, so ruby assumes it is a generic array of bytes (which means, ruby thinks the string has encoding ASCII-8BIT, which is the same as BINARY). If this is the case, you don't need to transcode the string (which is what encode does), but simply tell ruby which is the correct encoding, using the force_encoding method. I hope this helps Stefano
From: David Palm on 2 Feb 2010 08:26 > I fetch rows from an UTF8 database and try to work with the string. To > do so, I would like it to be UTF8 encoded. There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think). > self.encode!('UTF-8') str.force_encoding('UTF-8') is what you want to use I think. :)
From: Xavier Noëlle on 2 Feb 2010 09:12 2010/2/2 David Palm <dvdplm(a)gmail.com>: > There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think). Not a Rails app :-) > str.force_encoding('UTF-8') is what you want to use I think. I already tried this method, but it lead me to the following error: in `downcase!': invalid byte sequence in UTF-8 (ArgumentError). This is due to a call to str.downcase!() later in the application. Any idea to solve this ? :-) -- Xavier NOELLE
From: Robert Klemme on 2 Feb 2010 09:48 2010/2/2 Xavier Noëlle <xavier.noelle(a)gmail.com>: > 2010/2/2 David Palm <dvdplm(a)gmail.com>: >> There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think). > > Not a Rails app :-) > >> str.force_encoding('UTF-8') is what you want to use I think. > > I already tried this method, but it lead me to the following error: in > `downcase!': invalid byte sequence in UTF-8 (ArgumentError). > > This is due to a call to str.downcase!() later in the application. > > Any idea to solve this ? :-) You probably first want to find out whether the byte sequence is valid UTF-8 or not. For that you would need to look at the bytes in the String. I guess chances are that your String's byte sequence is NOT valid UTF-8 OR you have a character in the string that has no lowercase representation. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
From: Xavier Noëlle on 23 Feb 2010 06:10
2010/2/2 Robert Klemme <shortcutter(a)googlemail.com>: > You probably first want to find out whether the byte sequence is valid > UTF-8 or not. For that you would need to look at the bytes in the > String. I guess chances are that your String's byte sequence is NOT > valid UTF-8 OR you have a character in the string that has no > lowercase representation. > > Kind regards > > robert I dug into the problem and ended up with this line: self.force_encoding('UTF-8') Believing that the string #encoding was right was a wrong choice, then I assumed the database provided valid UTF8 strings. BUT (because, there's a but...), for some reason I don't understand, some strings are unwilling to work: Example: puts self => médicals self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115 233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg. self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid byte sequence in UTF-8 (ArgumentError). Where am I wrong ? TIA, -- Xavier NOELLE |