UTF8 hell [Ruby]

Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM

From: Marc Heiler on 23 Feb 2010 09:03

How does python solve this?
--
Posted via http://www.ruby-forum.com/.

From: Yukihiro Matsumoto on 23 Feb 2010 09:41

Hi,

In message "Re: [ENCODING] UTF8 hell"
on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes:

|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
|
|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
|self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
|byte sequence in UTF-8 (ArgumentError).

233 is not a valid UTF-8 character. The byte sequence for médicals is
<109 195 169 100 105 99 97 108 115>.

matz.

From: Rick DeNatale on 23 Feb 2010 10:03

On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro Matsumoto <matz(a)ruby-lang.org> wrote:
> Hi,
>
> In message "Re: [ENCODING] UTF8 hell"
> on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes:
>
> |self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
> |
> |233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
> |self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
> |byte sequence in UTF-8 (ArgumentError).
>
> 233 is not a valid UTF-8 character. The byte sequence for médicals is
> <109 195 169 100 105 99 97 108 115>.

233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8.

--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

From: Xavier Noëlle on 23 Feb 2010 10:18

2010/2/23 Yukihiro Matsumoto <matz(a)ruby-lang.org>:
> 233 is not a valid UTF-8 character. The byte sequence for médicals is
> <109 195 169 100 105 99 97 108 115>.

Indeed. In the meantime, I changed the code with this one:
def isUTF8()
begin
self.unpack('U*')
rescue
return false
end
return true
end

if isUTF8()
self.force_encoding('UTF-8')
else
self.force_encoding('ISO-8859-1')
self.encode!('UTF-8')
end

This (ugly) quickfix works for what I need, but I don't know if this
problem can be somehow resolved in another way. The problem being that
my SQL database has a VARBINARY column with an unknown encoding. Is
there a way to deal with the various possible encoding or to ask MySQL
to return UTF8 converted data, or is it necessary to clean data before
inserting them ?

--
Xavier NOELLE

From: Jörg W Mittag on 23 Feb 2010 11:53

Yukihiro Matsumoto wrote:
> In message "Re: [ENCODING] UTF8 hell"
> on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes:
>|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115
>|
>|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg.
>|self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid
>|byte sequence in UTF-8 (ArgumentError).
> 233 is not a valid UTF-8 character. The byte sequence for médicals is
> <109 195 169 100 105 99 97 108 115>.

A general hint for debugging encoding troubles: the UTF-8 encoding
*guarantees* that every Unicode codepoint is *either* encoded into a
*single* octet with its most significant bit cleared to 0 (i.e. a
decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
octets, *all* of which have their MSB set to 1 (i.e. a decimal value
between 128 and 255).

A *single* octet with its MSB set to 1 can *never* be a valid UTF-8
character, it can only be part of a multi-octet character, i.e. it
must appear either immediately before or after or between another
octet with its MSB set. However, in your string there is no
multi-octet character sequence, there is only a single character with
its MSB set (the second one with the decimal value 233), so you can
see without having to look at any code tables that this string
*cannot* possibly be a UTF-8 string.

As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC
8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC
8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC
8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9,
ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or
Windows-1252 string (it's impossible to tell, but makes no difference
in this case). My guess is on ISO-8859-15.

[This property is BTW what makes UTF-8 compatible with ASCII, because
it guarantees that *every* Unicode character which is also in ASCII,
will be encoded the same way as it would be in ASCII and every Unicode
character which is *not* in ASCII will be encoded as a sequence of
octets each of which is illegal in ASCII. It also provides some
robustness against 8-bit encodings such as the ISO8859 family, because
statistically it is very likely that *somewhere* in the text, there
will be a single octet with its MSB set (in this case, it's the é and
in my name it's the ö), which is surrounded by octets with their MSB
cleared, which cannot ever happen in UTF-8.]

jwm

First | Prev | Next | Last
Pages: 1 2 3
Prev: Ruby Threads From C
Next: SOAP error: Cannot map <class> to SOAP/OM