Data cleaning issue involving bad wide characters in what ought to be ascii data [Perl]

Prev: Help: Count special words
Next: Data cleaning issue involving bad wide characters in what ought to be ascii data

From: J�rgen Exner on 3 Sep 2009 11:51

Ted Byers <r.ted.byers(a)gmail.com> wrote:
>Again, I am trying to automatically process data I receive by email,
>so I have no control over the data that is coming in.
>
>The data is supposed to be plain text/HTML, but there are quite a
>number of records where the contraction "rec'd" is misrepresented when
>written to standard out as "Rec\342\200\231d"
>
>When the data is written to a file, these characters are represented
>by the character ' when it is opened using notepad, but by the string
>''' when it is opened by open office.
>
>So how do I tell what character it is when in three different contexts
>it is displayed in three different ways?

By explicitely telling the displaying program the encoding that was used
to create/save the file. In your case it very much looks like UTF-8.

>How can I make certain that
>when I either print it or store it in my DB, I get the correct
>"rec'd" (or, better, "received")?
>
>I suspect a minor glitch in the software that makes and send the email
>as this is the ONLY string where what ought to be an ascii ' character
>is identified as a wide character.

That's not a wide character. A wide character is something totally
different.

>Regardless of how that happens (as
>I don't control that), I need to clean this. And it gets confusing
>when different applications handle the i18n differently (Notepad is
>undoubtedly using the OS i18n support and Open Office is handling it
>differently, and Emacs is doing it differently from both).

Yep. If the file doesn't contain information about the encoding and/or
the application either doesn't support this encoding or misinterprets it
or cannot guess the encoding correctly then you will have to tell the
application which encoding to use (or use a different application).

Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
files in UTF-8 typically neither having nor needing a BOM.

jue

From: sln on 3 Sep 2009 19:07

On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <r.ted.byers(a)gmail.com> wrote:

>Again, I am trying to automatically process data I receive by email,
>so I have no control over the data that is coming in.
>
>The data is supposed to be plain text/HTML, but there are quite a
>number of records where the contraction "rec'd" is misrepresented when
>written to standard out as "Rec\342\200\231d"
>
>When the data is written to a file, these characters are represented
>by the character ' when it is opened using notepad, but by the string
>''' when it is opened by open office.
>
>So how do I tell what character it is when in three different contexts
>it is displayed in three different ways? How can I make certain that
>when I either print it or store it in my DB, I get the correct
>"rec'd" (or, better, "received")?
>
>I suspect a minor glitch in the software that makes and send the email
>as this is the ONLY string where what ought to be an ascii ' character
>is identified as a wide character. Regardless of how that happens (as
>I don't control that), I need to clean this. And it gets confusing
>when different applications handle the i18n differently (Notepad is
>undoubtedly using the OS i18n support and Open Office is handling it
>differently, and Emacs is doing it differently from both).
>
>A little enlightenment would be appreciated.
>
>Thanks
>
>Ted

What you have there is encoded utf-9 character with
code point \x{2019}.

It is NOT an ascii single quote, rather a Unicode curly
single quote (right). See this table and this web site:

copyright sign 00A9 \u00A9
registered sign 00AE \u00AE
trademark sign 2122 \u2122
em-dash 2014 \u2014
euro sign 20AC \u20AC
curly single quotation mark (left) 2018 \u2018
curly single quotation mark (right) 2019 \u2019
curly double quotation mark (left) 201C \u201C
curly double quotation mark (right) 201D \u201D

http://moock.org/asdg/technotes/usingSpecialCharacters/

By the way it displays fine in Notepad and Word, it is
not ascii, so you need a font and an app that can display
utf-8 characters.

If you want to convert these special characters, use a regex
to strip them from your system.

First before you do that, apparently, the embeddeding is done
in raw octets 'Rec\342\200\231d' that need to be decoded into
utf-8, then you can use code points in the regex.

You can strip these after you decode. Something like this:

$str = decode ('utf8', "your recieved string"); # utf-8 octets
$str =~ s/\x{2018}/'/g;
$str =~ s/\x{2019}/'/g;
$str =~ s/\x{201C}/"/g;
$str =~ s/\x{201D}/"/g;

etc, ...

Find a more efficient way to do the substitutions though.

See below for an example.
-sln
===========================
use strict;
use warnings;
use Encode;

my $str = decode ('utf8', "Rec\342\200\231d"); # utf-8 octets

my $data = "Rec\x{2019}d"; # Unicode Code Point

if ($str eq $data) {
print "yes thier equal\n";
}
open my $fh, '>', 'chr1.txt' or die "can't open chr1.txt: $!";

print $fh $data;
exit;

sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%c %02x ",$_,$_);
}
return $buf;
}
__END__

From: sln on 3 Sep 2009 20:22

On Thu, 03 Sep 2009 16:07:07 -0700, sln(a)netherlands.com wrote:

>On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <r.ted.byers(a)gmail.com> wrote:
>
>You can strip these after you decode. Something like this:
>
>$str = decode ('utf8', "your recieved string"); # utf-8 octets
>$str =~ s/\x{2018}/'/g;
>$str =~ s/\x{2019}/'/g;
>$str =~ s/\x{201C}/"/g;
>$str =~ s/\x{201D}/"/g;
>
>etc, ...
>
-sln
------------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':utf8');

my $str = decode ('utf8', "Rec\342\200\231d"); # utf8 octets
my $data = "Rec\x{2019}d"; # Unicode Code Point

if ($str eq $data) {
print "yes thier equal\n";
}
print ordsplit($data),"\n";

# Substitute select Unicode to ascii equivalent
my %unisub = (
"\x{2018}" => "'",
"\x{2019}" => "'",
"\x{201C}" => '"',
"\x{201D}" => '"',
);
$str =~ s/$_/$unisub{$_}/ge for keys (%unisub);
print $str,"\n";

# OR -- Substitute all Unicode code points, 100 - 1fffff with ? character
$data =~ s/[\x{100}-\x{1fffff}]/?/g;
print $data,"\n";

exit;

sub ordsplit {
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%c %02x ",$_,$_);
}
return $buf;
}
__END__

output:

yes thier equal
R 52 e 65 c 63 G�� 2019 d 64
Rec'd
Rec?d

|
Pages: 1
Prev: Help: Count special words
Next: Data cleaning issue involving bad wide characters in what ought to be ascii data