From: Perry Smith on 23 Feb 2010 12:20 > A general hint for debugging encoding troubles: the UTF-8 encoding > *guarantees* that every Unicode codepoint is *either* encoded into a > *single* octet with its most significant bit cleared to 0 (i.e. a > decimal value between 0 and 127) *or* into a *sequence* of 2 to 6 > octets, *all* of which have their MSB set to 1 (i.e. a decimal value > between 128 and 255). Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4, or 6 but not 3 nor 5 octects? -- Posted via http://www.ruby-forum.com/.
From: Jörg W Mittag on 23 Feb 2010 16:39 Perry Smith wrote: >> A general hint for debugging encoding troubles: the UTF-8 encoding >> *guarantees* that every Unicode codepoint is *either* encoded into a >> *single* octet with its most significant bit cleared to 0 (i.e. a >> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6 >> octets, *all* of which have their MSB set to 1 (i.e. a decimal value >> between 128 and 255). > Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4, > or 6 but not 3 nor 5 octects? Nope. First off: I was wrong, the longest encoding is actually 4 octets, not 6. (I was confused by the algorithm: the algorithm actually allows for up to 8 bytes, but because of the way Unicode characters are allocated, and UTF-8 is defined, it is guaranteed that there will never be more than 4.) The encodings look like this: 0xxxxxxx for ASCII 110xxxxx 10xxxxxx for U+80 to U+7FF 1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF This is actually pretty clever: * you can always tell whether you are inside a multibyte sequence or not because of the high bit, * you can always tell whether a byte in the sequence is the first one or a later one, because the first one always starts with 11 and the other ones always start with 10 and * you can always tell how long a sequence is by the number of 1 bits in the start byte: two-byte sequences start with two 1s, three-byte sequences start with three 1s and four-byte sequences start with four 1s. This means that you can usually re-synchronize pretty easily from the middle of a corrupted network transmission, for example. You can also jump over bytes if you are counting the length. jwm
From: Robert Klemme on 23 Feb 2010 17:05 On 23.02.2010 12:10, Xavier Noëlle wrote: > 2010/2/2 Robert Klemme <shortcutter(a)googlemail.com>: >> You probably first want to find out whether the byte sequence is valid >> UTF-8 or not. For that you would need to look at the bytes in the >> String. I guess chances are that your String's byte sequence is NOT >> valid UTF-8 OR you have a character in the string that has no >> lowercase representation. > I dug into the problem and ended up with this line: self.force_encoding('UTF-8') > Believing that the string #encoding was right was a wrong choice, then > I assumed the database provided valid UTF8 strings. The string you show below does not look like UTF-8 encoded, probably rather ISO-8859-1 or such. If you enforce an encoding you leave the byte sequence untouched. This leads to the kind of error you describe below. > BUT (because, there's a but...), for some reason I don't understand, > some strings are unwilling to work: > > Example: > puts self => médicals > self.each_byte {|b| print "#{b} "} => 109 233 100 105 99 97 108 115 > > 233 is, AFAIK, a valid UTF8 character, but calling gsub(anything) (eg. > self.gsub('ruby', 'zorglub')) on this string leads to: `gsub': invalid > byte sequence in UTF-8 (ArgumentError). > > Where am I wrong ? As far as I can see 233 starts a three byte sequence http://en.wikipedia.org/wiki/UTF-8#Description I did not dig deeper but it may be that by forcing UTF-8 on an ISO something encoded string you broke it. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
From: Michael Fellinger on 23 Feb 2010 23:12 On Wed, Feb 24, 2010 at 12:18 AM, Xavier Noëlle <xavier.noelle(a)gmail.com> wrote: > 2010/2/23 Yukihiro Matsumoto <matz(a)ruby-lang.org>: >> 233 is not a valid UTF-8 character.  The byte sequence for médicals is >> <109 195 169 100 105 99 97 108 115>. > > Indeed. In the meantime, I changed the code with this one: > def isUTF8() >  begin >   self.unpack('U*') >  rescue >   return false >  end >  return true > end > > if isUTF8() >  self.force_encoding('UTF-8') > else >  self.force_encoding('ISO-8859-1') >  self.encode!('UTF-8') > end string = "\xE8te pour luth" # "\xE8te pour luth" string.encoding # #<Encoding:UTF-8> string.valid_encoding? # false string.force_encoding('ISO-8859-1') # "ète pour luth" string.valid_encoding? # true string.upcase # "èTE POUR LUTH" > This (ugly) quickfix works for what I need, but I don't know if this > problem can be somehow resolved in another way. The problem being that > my SQL database has a VARBINARY column with an unknown encoding. Is > there a way to deal with the various possible encoding or to ask MySQL > to return UTF8 converted data, or is it necessary to clean data before > inserting them ? > > -- > Xavier NOELLE > > -- Michael Fellinger CTO, The Rubyists, LLC 972-996-5199
First
|
Prev
|
Pages: 1 2 3 Prev: Ruby Threads From C Next: SOAP error: Cannot map <class> to SOAP/OM |