Prev: Help: Count special words
Next: Data cleaning issue involving bad wide characters in what ought to be ascii data
From: J�rgen Exner on 3 Sep 2009 11:51 Ted Byers <r.ted.byers(a)gmail.com> wrote: >Again, I am trying to automatically process data I receive by email, >so I have no control over the data that is coming in. > >The data is supposed to be plain text/HTML, but there are quite a >number of records where the contraction "rec'd" is misrepresented when >written to standard out as "Rec\342\200\231d" > >When the data is written to a file, these characters are represented >by the character ' when it is opened using notepad, but by the string >''' when it is opened by open office. > >So how do I tell what character it is when in three different contexts >it is displayed in three different ways? By explicitely telling the displaying program the encoding that was used to create/save the file. In your case it very much looks like UTF-8. >How can I make certain that >when I either print it or store it in my DB, I get the correct >"rec'd" (or, better, "received")? > >I suspect a minor glitch in the software that makes and send the email >as this is the ONLY string where what ought to be an ascii ' character >is identified as a wide character. That's not a wide character. A wide character is something totally different. >Regardless of how that happens (as >I don't control that), I need to clean this. And it gets confusing >when different applications handle the i18n differently (Notepad is >undoubtedly using the OS i18n support and Open Office is handling it >differently, and Emacs is doing it differently from both). Yep. If the file doesn't contain information about the encoding and/or the application either doesn't support this encoding or misinterprets it or cannot guess the encoding correctly then you will have to tell the application which encoding to use (or use a different application). Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus files in UTF-8 typically neither having nor needing a BOM. jue
From: sln on 3 Sep 2009 19:07 On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <r.ted.byers(a)gmail.com> wrote: >Again, I am trying to automatically process data I receive by email, >so I have no control over the data that is coming in. > >The data is supposed to be plain text/HTML, but there are quite a >number of records where the contraction "rec'd" is misrepresented when >written to standard out as "Rec\342\200\231d" > >When the data is written to a file, these characters are represented >by the character ' when it is opened using notepad, but by the string >''' when it is opened by open office. > >So how do I tell what character it is when in three different contexts >it is displayed in three different ways? How can I make certain that >when I either print it or store it in my DB, I get the correct >"rec'd" (or, better, "received")? > >I suspect a minor glitch in the software that makes and send the email >as this is the ONLY string where what ought to be an ascii ' character >is identified as a wide character. Regardless of how that happens (as >I don't control that), I need to clean this. And it gets confusing >when different applications handle the i18n differently (Notepad is >undoubtedly using the OS i18n support and Open Office is handling it >differently, and Emacs is doing it differently from both). > >A little enlightenment would be appreciated. > >Thanks > >Ted What you have there is encoded utf-9 character with code point \x{2019}. It is NOT an ascii single quote, rather a Unicode curly single quote (right). See this table and this web site: copyright sign 00A9 \u00A9 registered sign 00AE \u00AE trademark sign 2122 \u2122 em-dash 2014 \u2014 euro sign 20AC \u20AC curly single quotation mark (left) 2018 \u2018 curly single quotation mark (right) 2019 \u2019 curly double quotation mark (left) 201C \u201C curly double quotation mark (right) 201D \u201D http://moock.org/asdg/technotes/usingSpecialCharacters/ By the way it displays fine in Notepad and Word, it is not ascii, so you need a font and an app that can display utf-8 characters. If you want to convert these special characters, use a regex to strip them from your system. First before you do that, apparently, the embeddeding is done in raw octets 'Rec\342\200\231d' that need to be decoded into utf-8, then you can use code points in the regex. You can strip these after you decode. Something like this: $str = decode ('utf8', "your recieved string"); # utf-8 octets $str =~ s/\x{2018}/'/g; $str =~ s/\x{2019}/'/g; $str =~ s/\x{201C}/"/g; $str =~ s/\x{201D}/"/g; etc, ... Find a more efficient way to do the substitutions though. See below for an example. -sln =========================== use strict; use warnings; use Encode; my $str = decode ('utf8', "Rec\342\200\231d"); # utf-8 octets my $data = "Rec\x{2019}d"; # Unicode Code Point if ($str eq $data) { print "yes thier equal\n"; } open my $fh, '>', 'chr1.txt' or die "can't open chr1.txt: $!"; print $fh $data; exit; sub ordsplit { my $string = shift; my $buf = ''; for (map {ord $_} split //, $string) { $buf.= sprintf ("%c %02x ",$_,$_); } return $buf; } __END__
From: sln on 3 Sep 2009 20:22
On Thu, 03 Sep 2009 16:07:07 -0700, sln(a)netherlands.com wrote: >On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <r.ted.byers(a)gmail.com> wrote: > >You can strip these after you decode. Something like this: > >$str = decode ('utf8', "your recieved string"); # utf-8 octets >$str =~ s/\x{2018}/'/g; >$str =~ s/\x{2019}/'/g; >$str =~ s/\x{201C}/"/g; >$str =~ s/\x{201D}/"/g; > >etc, ... > -sln ------------------ use strict; use warnings; use Encode; binmode (STDOUT, ':utf8'); my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets my $data = "Rec\x{2019}d"; # Unicode Code Point if ($str eq $data) { print "yes thier equal\n"; } print ordsplit($data),"\n"; # Substitute select Unicode to ascii equivalent my %unisub = ( "\x{2018}" => "'", "\x{2019}" => "'", "\x{201C}" => '"', "\x{201D}" => '"', ); $str =~ s/$_/$unisub{$_}/ge for keys (%unisub); print $str,"\n"; # OR -- Substitute all Unicode code points, 100 - 1fffff with ? character $data =~ s/[\x{100}-\x{1fffff}]/?/g; print $data,"\n"; exit; sub ordsplit { my $string = shift; my $buf = ''; for (map {ord $_} split //, $string) { $buf.= sprintf ("%c %02x ",$_,$_); } return $buf; } __END__ output: yes thier equal R 52 e 65 c 63 G�� 2019 d 64 Rec'd Rec?d |