Data cleaning issue involving bad wide characters in what ought to be ascii data [Perl]

Prev: Data cleaning issue involving bad wide characters in what ought to be ascii data
Next: perl regex : surround tabbed numeric field by double quotes

From: J�rgen Exner on 3 Sep 2009 12:40

Ted Byers <r.ted.byers(a)gmail.com> wrote:
>On Sep 3, 11:51�am, J�rgen Exner <jurge...(a)hotmail.com> wrote:
>> Ted Byers <r.ted.by...(a)gmail.com> wrote:
>My program needs to store the data as plain ascii

I dare to question the wisdom of this requirement. In today's world
restricting your data to ASCII only is a severe limitation and will more
often than not backfire when you least expect it. Does your data contain
e.g. any names? Customers, employees, places, tools or equipment named
after people or places? Can you guarantee that it will never be used
outside of the English-speaking world, not even for Spanish names in the
US?
A much more robust way is to finally accept that ASCII is almost 50
years old, obsolete, and completely inadequate for today's world and to
use Unicode/UTF-8 as the standard throughout.

>regardless of how the original data was encoded.

If you insist on limiting yourself to ASCII only then obviously you will
have to deal with any non-ASCII character in some way. What do you
propose to do with e.g. my first name?

>And apart from this string, it looks
>like all the data can be safely treated as ascii. The data comes as a
>text/html attachment to the emails, so I am wondering if the headers
>to the email might tell me something about the encoding ...

Sorry, I'm not a MIME expert.

>> >How can I make certain that
>> >when I either print it or store it in my DB, I get the correct
>> >"rec'd" (or, better, "received")?

Convert it, transform it, remove it, reject it, ....
If it's really, really, really only this one instance ever, then
probably a simple s/// will do. But that will work only until some other
non-ASCII character shows up at your doorstep.

>> Does the file have a BOM? AFAIR Notepad uses the BOM to determine if a
>> file is in UTF-8 in disregard of UTF-8 being a byte sequence and thus
>> files in UTF-8 typically neither having nor needing a BOM.
>>
>I don't know what a BOM is, let alone how to tell if a file has one.

See http://en.wikipedia.org/wiki/Byte-order_mark. You might be able to
use it to determine the encoding of your data.

>Is there a safe way to ensure that all the data that is being
>processed is plain ascii?

Only if the character set is explicitely specified as ASCII. Every other
character set does contain non-ASCII characters which you will have to
handle.

>I have seen email clients displaying this
>data so I know that there are never characters in it, as displayed,
>that would not be valid ascii.

Would you bet your house on it?

jue

From: J�rgen Exner on 3 Sep 2009 12:51

Ted Byers <r.ted.byers(a)gmail.com> wrote:
>I thought I'd have to resort to a regex, if I could figure out what to
>scan for, but if there is a perl package that will make it easier to
>deal with this odd character, great.

Forgot to mention:
There is Text::Iconv (see
http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
convert text between different encodings. However I have no idea what it
does with characters that do not exist in the target character set.

jue

From: Mart van de Wege on 4 Sep 2009 01:44

Jürgen Exner <jurgenex(a)hotmail.com> writes:

> Ted Byers <r.ted.byers(a)gmail.com> wrote:
>>I thought I'd have to resort to a regex, if I could figure out what to
>>scan for, but if there is a perl package that will make it easier to
>>deal with this odd character, great.
>
> Forgot to mention:
> There is Text::Iconv (see
> http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
> convert text between different encodings. However I have no idea what it
> does with characters that do not exist in the target character set.
>
If it uses iconv, or works the same as iconv, it'll drop them.

Mart

--
"We will need a longer wall when the revolution comes."
--- AJS, quoting an uncertain source.

From: J�rgen Exner on 4 Sep 2009 14:22

Ted Byers <r.ted.byers(a)gmail.com> wrote:
>On Sep 4, 1:44�am, Mart van de Wege <mvdw...(a)mail.com> wrote:
>> J�rgen Exner <jurge...(a)hotmail.com> writes:
>> > Ted Byers <r.ted.by...(a)gmail.com> wrote:
>> >>I thought I'd have to resort to a regex, if I could figure out what to
>> >>scan for, but if there is a perl package that will make it easier to
>> >>deal with this odd character, great.
>>
>> > Forgot to mention:
>> > There is Text::Iconv (see
>> >http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which will
>> > convert text between different encodings. However I have no idea what it
>> > does with characters that do not exist in the target character set.
>>
>> If it uses iconv, or works the same as iconv, it'll drop them.
>>
>> Mart
>>
>> --
>> "We will need a longer wall when the revolution comes."
>> --- AJS, quoting an uncertain source.
>
>Does it work on Windows?

What "it" are you referring to? According to your quoting style it must
be the revolution in Mart's signature. However I find that rather
unlikely. There has never been anything revolutionary about Windows.

Or are you referreing to the iconv tool that Mart mentioned? I know
nothing about that.

Or are you referring to the Text::Iconv module that I mentioned?
I used it a lot several years ago on Windows.

>I don't find it on any of the repositories
>identified in Activestate's PPM, and haven't had much luck installing
>packages from cpan that aren't in at least one of those PPM
>repositories. The documentation for it says nothing about
>dependencies.

I had no problems installing Text::Iconv from CPAN on Windows (XP and
Server2000). However as I mentioned that was several years ago, no
recent experience.

jue

From: sln on 4 Sep 2009 18:01

On Fri, 4 Sep 2009 10:59:59 -0700 (PDT), Ted Byers <r.ted.byers(a)gmail.com> wrote:

>On Sep 3, 8:22�pm, s...(a)netherlands.com wrote:
>> On Thu, 03 Sep 2009 16:07:07 -0700, s...(a)netherlands.com wrote:
>> >On Thu, 3 Sep 2009 07:10:36 -0700 (PDT), Ted Byers <r.ted.by...(a)gmail.com> wrote:
>>
>
> I learned plenty from this, and
>Jue's posts about this.
>
>Cheers,
>
>Ted

Looking back, it can for the most part be boiled down to this.
A roll-your-own, simple regex, that covers all cases.

Good luck!
-sln
-------------
use strict;
use warnings;
use Encode;

binmode (STDOUT, ':utf8');

#my $charset = 'utf8'; # Decode raw bytes that are in $charset encoding
#my $str = decode ($charset, "Your recieved string"); # encoded octets

# Example: $str is utf8 via decoding recieved sample and is like this:
my $str = "Rec\x{2019}d, copyright \x{00A9} 2009, trademark\x{2122} affixed";

# Select Unicode to ascii char-to-string substitutions
# ----
my %unisub = (
"\x{00A9}" => '(c)',
"\x{2018}" => "'",
"\x{2019}" => "'",
"\x{201C}" => '"',
"\x{201D}" => '"',
);

# Substitute non-ascii (code points 80 - 1fffff) with ascii equivalent
# (or blank if not in hash)
# ----
$str =~ s/([\x{80}-\x{1fffff}])/ exists $unisub{$1} ? $unisub{$1} : ''/ge;
print $str,"\n";

__END__

Output:

Rec'd, copyright (c) 2009, trademark affixed

|
Pages: 1
Prev: Data cleaning issue involving bad wide characters in what ought to be ascii data
Next: perl regex : surround tabbed numeric field by double quotes