Trying to process duplicate in flat file [Perl]

Prev: FAQ 4.14 How can I compare two dates and find the difference?
Next: FAQ 4.13 How do I find the current century or millennium?

From: nobody on 10 Nov 2009 19:19

Given the following data in a flat file:

09020000251 Joe Smith 54 Abbey Road
05020033486 John Jones 98 New York Ave.
07020000279 George Washington 234 Washington Ave.
06020004293 Fred Flintstone 123 Bedrock Road
03020004472 Fred Jones 98 New York Ave.
06020004293 Wilma Flintstone 123 Bedrock Road

You can see that Fred and Wilma both share the same customer number and
street address. I'm traversing the file and looking for any duplicate
customer numbers, such as Wilma and Fred. If there is no duplicate, then
just print the record and move on. When I do encounter a duplicate I'm
trying to print the records one after the other so the output looks like:

Name: Fred Flintstone
Address 123 Bedrock Road
Customer#: 06020004293

Name: Wilma Flintstone
Address 123 Bedrock Road
Customer#: 06020004293

Here's what I have so far which almost works, which I doubt is the best
technique regardless.

#!/usr/bin/perl

use strict;
use warnings;

my $datafile = "$ARGV[0]";
my @file = ();
my @fields = ();
my $line;
my $custno;
my $name1;
my $addr1;

my $line2;
my $custno2;
my $name2;
my $addr2;

my $count;

open(HFILE, "<$datafile") || die "Cannot open $datafile: $!\n";

while ( <HFILE> ) {
push(@file, $_) if $_ =~ /[A-Za-z0-9]/;
}

close(HFILE);

my @sortedFile = sort { $a cmp $b } @file;

foreach $line (@sortedFile) {

$custno = substr($line, 0, 11);
$name1 = substr($line, 14, 19);
$addr1 = substr($line, 34, 20);

#print "$custno\n";
#print "$name1\n";
#print "$addr1\n";

$count = 0;

foreach $line2 (@sortedFile) {

$custno2 = substr($line2, 0, 11);

if ($custno eq $custno2){
$count++;
}

if ($count == 2) {
print "$custno2\n";
$count = 0;

$custno2 = substr($line2, 0, 11);
$name2 = substr($line2, 14, 19);
$addr2 = substr($line2, 34, 20);

print "$custno2\n";
print "$name2\n";
print "$addr2\n";

$count++;
last;

}

}

}

From: Dr.Ruud on 10 Nov 2009 20:10

nobody wrote:
> Given the following data in a flat file:
>
> 09020000251 Joe Smith 54 Abbey Road
> 05020033486 John Jones 98 New York Ave.
> 07020000279 George Washington 234 Washington Ave.
> 06020004293 Fred Flintstone 123 Bedrock Road
> 03020004472 Fred Jones 98 New York Ave.
> 06020004293 Wilma Flintstone 123 Bedrock Road
>
>
> You can see that Fred and Wilma both share the same customer number and
> street address. I'm traversing the file and looking for any duplicate
> customer numbers, such as Wilma and Fred.

perl -MData::Dumper -aF'\s\s+' -nle
'push@{$d{$F[0]}},[@F[1,2]]}{@{$d{$_}}>1or delete$d{$_}
for keys%d;print Dumper\%d' flat.txt

$VAR1 = {
'06020004293' => [
[
'Fred Flintstone',
'123 Bedrock Road'
],
[
'Wilma Flintstone',
'123 Bedrock Road'
]
]
};

--
Ruud

From: Tad McClellan on 11 Nov 2009 00:18

nobody <nobody(a)nowhere.com> wrote:

> my $datafile = "$ARGV[0]";

perldoc -q vars

What's wrong with always quoting "$vars"?

So then:

my $datafile = $ARGV[0];

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

From: J�rgen Exner on 11 Nov 2009 01:18

nobody <nobody(a)nowhere.com> wrote:
>Given the following data in a flat file:
>
>09020000251 Joe Smith 54 Abbey Road
>05020033486 John Jones 98 New York Ave.
>07020000279 George Washington 234 Washington Ave.
>06020004293 Fred Flintstone 123 Bedrock Road
>03020004472 Fred Jones 98 New York Ave.
>06020004293 Wilma Flintstone 123 Bedrock Road
>
>
>You can see that Fred and Wilma both share the same customer number and
>street address. I'm traversing the file and looking for any duplicate
>customer numbers, such as Wilma and Fred. If there is no duplicate, then
>just print the record and move on. When I do encounter a duplicate I'm
>trying to print the records one after the other so the output looks like:
>
>Name: Fred Flintstone
>Address 123 Bedrock Road
>Customer#: 06020004293
>
>Name: Wilma Flintstone
>Address 123 Bedrock Road
>Customer#: 06020004293
>
>
>Here's what I have so far which almost works, which I doubt is the best
>technique regardless.

So you don't really care if or how many customers are sharing the same
customer id. And each record is printed the same way, no matter if
duplicate or not.
In that case, yes, your approach of simply sorting the lines seems quite
adequate.

>#!/usr/bin/perl
>
>use strict;
>use warnings;
>
>my $datafile = "$ARGV[0]";

Don't quote variables, there is no good reason for it.

>my @file = ();
>my @fields = ();
>my $line;
>my $custno;
>my $name1;
>my $addr1;
>
>my $line2;
>my $custno2;
>my $name2;
>my $addr2;

Don't use global variables unless there is a good reason for it. For
almost all these there is no good reason.

>my $count;
>
>open(HFILE, "<$datafile") || die "Cannot open $datafile: $!\n";
>
> while ( <HFILE> ) {
> push(@file, $_) if $_ =~ /[A-Za-z0-9]/;
> }
>
>close(HFILE);
>
>my @sortedFile = sort { $a cmp $b } @file;

Sorting lexically is the default behaviour of sort() already, no reason
to mention it explicitely.

>foreach $line (@sortedFile) {
>
> $custno = substr($line, 0, 11);
> $name1 = substr($line, 14, 19);
> $addr1 = substr($line, 34, 20);

There are other ways to split the line, but this works and looks ok to
me.

> #print "$custno\n";
> #print "$name1\n";
> #print "$addr1\n";

Are these lines relevant in any way?

> $count = 0;
> foreach $line2 (@sortedFile) {

What on earth are you doing with this inner loop? And what is $count
about? You don't use it for anything useful.
Your output doesn't distinguish between the first occurence of a
customer ID and subsequent occurences. So don't bother about it, just
print each record in the sequence as it appears in the sorted array.

[snipped]

A different approach (just in case you do care if a customer ID is
duplicated or not) would be to read all customers into an HoA, using the
customer ID as the key for the hash.
And then just traverse the whole hash and print each array.

jue

|
Pages: 1
Prev: FAQ 4.14 How can I compare two dates and find the difference?
Next: FAQ 4.13 How do I find the current century or millennium?