join two files on a common field [General Linux]

Prev: Did Debian's text console font change recently? How to get theold one back? -- It was Nouveau in Kernel 2.6.32-5!
Next: alternate POP/SMTP ports with Evolution?

From: Rahul on 6 Jun 2010 17:49

I'm trying to join two unsorted files and print lines common to both based
on a "key" field. The first field is the key.

cat file1:
A;Ablah
B;Bblah

cat file2:
A;Ablahblah
B;Bblahblah

A join seems to work:
join -t ';' -j 1 file1 file2

A;Ablah;Ablahblah
B;Bblah;Bblahblah

But the moment there is a non matching line the join fails:

e.g.

cat file1:
C:Cblah
A;Ablah
B;Bblah

Is there any way around this? To still output the lines where field1
matches? If not, then can awk etc. handle this situation? I've only used
awk on single files before so not sure....

--
Rahul

From: The Natural Philosopher on 6 Jun 2010 18:57

Rahul wrote:
> I'm trying to join two unsorted files and print lines common to both based
> on a "key" field. The first field is the key.
>
> cat file1:
> A;Ablah
> B;Bblah
>
> cat file2:
> A;Ablahblah
> B;Bblahblah
>
> A join seems to work:
> join -t ';' -j 1 file1 file2
>
> A;Ablah;Ablahblah
> B;Bblah;Bblahblah
>
> But the moment there is a non matching line the join fails:
>
> e.g.
>
> cat file1:
> C:Cblah
> A;Ablah
> B;Bblah
>
> Is there any way around this? To still output the lines where field1
> matches? If not, then can awk etc. handle this situation? I've only used
> awk on single files before so not sure....
>
>
>
vaguely remember doing this with awk years ago..

From: Anton Treuenfels on 7 Jun 2010 01:25

"Rahul" <nospam(a)nospam.invalid> wrote in message
news:Xns9D8FAB1CD93556650A1FC0D7811DDBC81(a)188.40.43.230...
> I'm trying to join two unsorted files and print lines common to both based
> on a "key" field. The first field is the key.
>
> cat file1:
> A;Ablah
> B;Bblah
>
> cat file2:
> A;Ablahblah
> B;Bblahblah
>
> Is there any way around this? To still output the lines where field1
> matches? If not, then can awk etc. handle this situation? I've only used
> awk on single files before so not sure....

I'm going to assume any particular key field can appear any number of times
in either file in any line and that the rest of each line can vary and that
you only want one copy of each line from either file.

One way is to read the first file twice and the second file once:

BEGIN { ARGV[ARGC++] = ARGV[1] }

FILENAME == "file1" {
file1keys[ $1 ] = ".T."
if ($1 in file2keys)
print
}

FILENAME == "file2" {
file2keys[ $1 ] = ".T."
if ( $1 in file1keys )
print
}

Of course this will print out all the matching lines in file2 before any in
file1. You can also of course make the order of filenames on the command
line anything you want.

- Anton Treuenfels

From: pk on 7 Jun 2010 04:12

Rahul wrote:

> I'm trying to join two unsorted files and print lines common to both based
> on a "key" field. The first field is the key.
>
> cat file1:
> A;Ablah
> B;Bblah
>
> cat file2:
> A;Ablahblah
> B;Bblahblah
>
> A join seems to work:
> join -t ';' -j 1 file1 file2
>
> A;Ablah;Ablahblah
> B;Bblah;Bblahblah
>
> But the moment there is a non matching line the join fails:
>
> e.g.
>
> cat file1:
> C:Cblah
> A;Ablah
> B;Bblah
>
> Is there any way around this? To still output the lines where field1
> matches? If not, then can awk etc. handle this situation? I've only used
> awk on single files before so not sure....

Assuming no repeated keys, try

awk -F \; -v OFS=\; 'NR==FNR{a[$1]=$2;next}
$1 in a{print $1, a[$1], $2}' file1 file2

From: Rahul on 7 Jun 2010 19:12

"Anton Treuenfels" <teamtempest(a)yahoo.com> wrote in
news:buqdnWDccvgqH5HRnZ2dnUVZ_i2dnZ2d(a)earthlink.com:

> I'm going to assume any particular key field can appear any number of
> times in either file in any line and that the rest of each line can
> vary and that you only want one copy of each line from either file.
>

Thanks Anton for this general solution. My problem is simpler since keys
are non-repeated. My bad, I should have mentioned.

--
Rahul

|
Pages: 1
Prev: Did Debian's text console font change recently? How to get theold one back? -- It was Nouveau in Kernel 2.6.32-5!
Next: alternate POP/SMTP ports with Evolution?