UTF8 hell [Ruby]

Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM

From: Stefano Crocco on 2 Feb 2010 07:54

On Tuesday 02 February 2010, Xavier Noëlle wrote:
> |Hello,
> |I'm trying to deal with Ruby flaws with encoding, which I thought
> |would be almost past with Ruby 1.9. I managed to find a solution for
> |Ruby 1.8 and thought I did for Ruby 1.9...but in fact, no !
> |
> |I fetch rows from an UTF8 database and try to work with the string. To
> |do so, I would like it to be UTF8 encoded.
> |
> |"str.encoding()" gives me "ASCII-8BIT"...so, I thought one of these
> |lines would solve the problem
> |str.replace(Iconv.iconv("UTF8", "ascii", self).join())
> |OR
> |self.encode!('UTF-8')
> |
> |But they don't !
> |First one: in `iconv': "\xE8te pour luth" (Iconv::IllegalSequence)
> |Second one: in `encode!': "\xE8" from ASCII-8BIT to UTF-8
> |(Encoding::UndefinedConversionError)
> |
> |The base string is "Oeuvre complète pour luth" and displays well in
> |PHPMyAdmin.
> |
> |Any idea ?
> |TIA,

I'm not sure, but basing on my experience, it may be that the string are
indeed stored as UTF-8, but the library you use to read from the database
doesn't take care of informing ruby of the fact, so ruby assumes it is a
generic array of bytes (which means, ruby thinks the string has encoding
ASCII-8BIT, which is the same as BINARY).

If this is the case, you don't need to transcode the string (which is what
encode does), but simply tell ruby which is the correct encoding, using the
force_encoding method.

I hope this helps

Stefano

From: David Palm on 2 Feb 2010 08:26

> I fetch rows from an UTF8 database and try to work with the string. To
> do so, I would like it to be UTF8 encoded.

There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

> self.encode!('UTF-8')

str.force_encoding('UTF-8') is what you want to use I think.

:)

From: Xavier Noëlle on 2 Feb 2010 09:12

2010/2/2 David Palm <dvdplm(a)gmail.com>:
> There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).

Not a Rails app :-)

> str.force_encoding('UTF-8') is what you want to use I think.

I already tried this method, but it lead me to the following error: in
`downcase!': invalid byte sequence in UTF-8 (ArgumentError).

This is due to a call to str.downcase!() later in the application.

Any idea to solve this ? :-)

--
Xavier NOELLE

From: Robert Klemme on 2 Feb 2010 09:48

2010/2/2 Xavier Noëlle <xavier.noelle(a)gmail.com>:
> 2010/2/2 David Palm <dvdplm(a)gmail.com>:
>> There are several pieces to this. Even if the DB encoding and collation is utf8, doublecheck that the client connection is utf8 as well ("encoding: utf8" in database.yml for a Rails app I think).
>
> Not a Rails app :-)
>
>> str.force_encoding('UTF-8') is what you want to use I think.
>
> I already tried this method, but it lead me to the following error: in
> `downcase!': invalid byte sequence in UTF-8 (ArgumentError).
>
> This is due to a call to str.downcase!() later in the application.
>
> Any idea to solve this ? :-)

You probably first want to find out whether the byte sequence is valid
UTF-8 or not. For that you would need to look at the bytes in the
String. I guess chances are that your String's byte sequence is NOT
valid UTF-8 OR you have a character in the string that has no
lowercase representation.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From: Xavier Noëlle on 23 Feb 2010 06:10

2010/2/2 Robert Klemme <shortcutter(a)googlemail.com>:
> You probably first want to find out whether the byte sequence is valid
> UTF-8 or not. For that you would need to look at the bytes in the
> String. I guess chances are that your String's byte sequence is NOT
> valid UTF-8 OR you have a character in the string that has no
> lowercase representation.
>
> Kind regards
>
> robert

I dug into the problem and ended up with this line: self.force_encoding('UTF-8')
Believing that the string #encoding was right was a wrong choice, then
I assumed the database provided valid UTF8 strings.

BUT (because, there's a but...), for some reason I don't understand,
some strings are unwilling to work:

Example:
puts self => médicals
self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115

233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
byte sequence in UTF-8 (ArgumentError).

Where am I wrong ?

TIA,

--
Xavier NOELLE

| Next | Last
Pages: 1 2 3
Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM