From: Marc Heiler on 23 Feb 2010 09:03 How does python solve this? -- Posted via http://www.ruby-forum.com/.
From: Yukihiro Matsumoto on 23 Feb 2010 09:41 Hi, In message "Re: [ENCODING] UTF8 hell" on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes: |self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115 | |233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg. |self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid |byte sequence in UTF-8 (ArgumentError). 233 is not a valid UTF-8 character. The byte sequence for médicals is <109 195 169 100 105 99 97 108 115>. matz.
From: Rick DeNatale on 23 Feb 2010 10:03 On Tue, Feb 23, 2010 at 9:41 AM, Yukihiro Matsumoto <matz(a)ruby-lang.org> wrote: > Hi, > > In message "Re: [ENCODING] UTF8 hell" > on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes: > > |self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115 > | > |233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg. > |self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid > |byte sequence in UTF-8 (ArgumentError). > > 233 is not a valid UTF-8 character. The byte sequence for médicals is > <109 195 169 100 105 99 97 108 115>. 233 for e accent acute would be valid for ISO-8859-1 encoding, not UTF-8. -- Rick DeNatale Blog: http://talklikeaduck.denhaven2.com/ Twitter: http://twitter.com/RickDeNatale WWR: http://www.workingwithrails.com/person/9021-rick-denatale LinkedIn: http://www.linkedin.com/in/rickdenatale
From: Xavier Noëlle on 23 Feb 2010 10:18 2010/2/23 Yukihiro Matsumoto <matz(a)ruby-lang.org>: > 233 is not a valid UTF-8 character. The byte sequence for médicals is > <109 195 169 100 105 99 97 108 115>. Indeed. In the meantime, I changed the code with this one: def isUTF8() begin self.unpack('U*') rescue return false end return true end if isUTF8() self.force_encoding('UTF-8') else self.force_encoding('ISO-8859-1') self.encode!('UTF-8') end This (ugly) quickfix works for what I need, but I don't know if this problem can be somehow resolved in another way. The problem being that my SQL database has a VARBINARY column with an unknown encoding. Is there a way to deal with the various possible encoding or to ask MySQL to return UTF8 converted data, or is it necessary to clean data before inserting them ? -- Xavier NOELLE
From: Jörg W Mittag on 23 Feb 2010 11:53 Yukihiro Matsumoto wrote: > In message "Re: [ENCODING] UTF8 hell" > on Tue, 23 Feb 2010 20:10:20 +0900, Xavier Noëlle <xavier.noelle(a)gmail.com> writes: >|self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115 >| >|233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg. >|self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid >|byte sequence in UTF-8 (ArgumentError). > 233 is not a valid UTF-8 character. The byte sequence for médicals is > <109 195 169 100 105 99 97 108 115>. A general hint for debugging encoding troubles: the UTF-8 encoding *guarantees* that every Unicode codepoint is *either* encoded into a *single* octet with its most significant bit cleared to 0 (i.e. a decimal value between 0 and 127) *or* into a *sequence* of 2 to 6 octets, *all* of which have their MSB set to 1 (i.e. a decimal value between 128 and 255). A *single* octet with its MSB set to 1 can *never* be a valid UTF-8 character, it can only be part of a multi-octet character, i.e. it must appear either immediately before or after or between another octet with its MSB set. However, in your string there is no multi-octet character sequence, there is only a single character with its MSB set (the second one with the decimal value 233), so you can see without having to look at any code tables that this string *cannot* possibly be a UTF-8 string. As Rick already hinted, it is either an ISO/IEC 8859-1, ISO/IEC 8859-2, ISO/IEC 8859-3, ISO/IEC 8859-4, ISO/IEC 8859-9, ISO/IEC 8859-10, ISO/IEC 8859-13, ISO/IEC 8859-14, ISO/IEC 8859-15, ISO/IEC 8859-16, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16 or Windows-1252 string (it's impossible to tell, but makes no difference in this case). My guess is on ISO-8859-15. [This property is BTW what makes UTF-8 compatible with ASCII, because it guarantees that *every* Unicode character which is also in ASCII, will be encoded the same way as it would be in ASCII and every Unicode character which is *not* in ASCII will be encoded as a sequence of octets each of which is illegal in ASCII. It also provides some robustness against 8-bit encodings such as the ISO8859 family, because statistically it is very likely that *somewhere* in the text, there will be a single octet with its MSB set (in this case, it's the é and in my name it's the ö), which is surrounded by octets with their MSB cleared, which cannot ever happen in UTF-8.] jwm
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: Ruby Threads From C Next: SOAP error: Cannot map <class> to SOAP/OM |