Comm V/s XML data [Shell]

Prev: rsync performance
Next: problem using script creating users from csv

From: Pankaj on 13 Feb 2010 09:00

Greetings,

I have a file with following contents

File1.txt

<abc>CONTENT1-GOES-HERE</abc>
<abc>CONTENT2-GOES-HERE</abc>

File2.txt

<abc>CONTENT1-GOES-HERE</abc>
<abc>CONTENT2-GOES-HERE</abc>
<abc>CONTENT3-GOES-HERE</abc>

TO explain, the one record is identified by the starting of <abc> and
ending with </abc>. and after each </abc>, the next record is starting
with a new line.

I want to compare the above two files (not just content but the whole
record starting with <abc> and ending with </abc>). I want all data
present in file2.txt but not in File1.txt,

So, in above sample data the output (say in a 3rd file is),

Final.output

<abc>CONTENT3-GOES-HERE</abc>

I have tried

cat File1.txt | sort > File11.txt
cat File2.txt | sort > File22.txt

comm -23 File22.txt File11.txt > Final.output

The Final.output file does not show correct record (It was showing
more records then expected).

I am not sure if the above is the correct way of going about it? Any
help would be appreciated.

We are using Solaris 5.10

TIA

From: pk on 13 Feb 2010 08:58

Pankaj wrote:

> File1.txt
>
> <abc>CONTENT1-GOES-HERE</abc>
> <abc>CONTENT2-GOES-HERE</abc>
>
> File2.txt
>
> <abc>CONTENT1-GOES-HERE</abc>
> <abc>CONTENT2-GOES-HERE</abc>
> <abc>CONTENT3-GOES-HERE</abc>
>
> TO explain, the one record is identified by the starting of <abc> and
> ending with </abc>. and after each </abc>, the next record is starting
> with a new line.
>
> I want to compare the above two files (not just content but the whole
> record starting with <abc> and ending with </abc>). I want all data
> present in file2.txt but not in File1.txt,
>
> So, in above sample data the output (say in a 3rd file is),

This should do that using awk:

awk 'NR==FNR{a[$0];next} !($0 in a)' file1.xml file2.xml

> We are using Solaris 5.10

Then use /usr/xpg4/bin/awk.

From: Pankaj on 13 Feb 2010 09:36

On Feb 13, 7:58 am, pk <p...(a)pk.invalid> wrote:
> Pankaj wrote:
> > File1.txt
>
> > <abc>CONTENT1-GOES-HERE</abc>
> > <abc>CONTENT2-GOES-HERE</abc>
>
> > File2.txt
>
> > <abc>CONTENT1-GOES-HERE</abc>
> > <abc>CONTENT2-GOES-HERE</abc>
> > <abc>CONTENT3-GOES-HERE</abc>
>
> > TO explain, the one record is identified by the starting of <abc> and
> > ending with </abc>. and after each </abc>, the next record is starting
> > with a new line.
>
> > I want to compare the above two files (not just content but the whole
> > record starting with <abc> and ending with </abc>). I want all data
> > present in file2.txt but not in File1.txt,
>
> > So, in above sample data the output (say in a 3rd file is),
>
> This should do that using awk:
>
> awk 'NR==FNR{a[$0];next} !($0 in a)' file1.xml file2.xml
>
> > We are using Solaris 5.10
>
> Then use /usr/xpg4/bin/awk.- Hide quoted text -
>
> - Show quoted text -

That works like a charm Pk. Can you please explain the code-flow?

From: pk on 13 Feb 2010 10:12

Pankaj wrote:

>> This should do that using awk:
>>
>> awk 'NR==FNR{a[$0];next} !($0 in a)' file1.xml file2.xml
>>
>> > We are using Solaris 5.10
>>
>> Then use /usr/xpg4/bin/awk.- Hide quoted text -
>>
>> - Show quoted text -
>
> That works like a charm Pk. Can you please explain the code-flow?

NR==FNR{a[$0];next}

This reads all the first file's lines as indexes of the "a" associative
array (or hash). NR==FNR means "while we're reading the first file". $0
represents the input line, so a[$0] creates the element subscripted by $0 in
the hash.
----------------
!($0 in a)

This is evaluated when the second file is being read, and essentially tells
awk "if the line we're reading ($0) is NOT present as an index of the hash a
(this is indicated by !($0 in a), then print it".
If's probably clearer writtin as follows:

!($0 in a) {print $0}

but the two forms are equivalent, since when awk finds a true condition, by
default it prints the record (line in this case).

From: Pankaj on 13 Feb 2010 13:06

On Feb 13, 9:12 am, pk <p...(a)pk.invalid> wrote:
> Pankaj wrote:
> >> This should do that using awk:
>
> >> awk 'NR==FNR{a[$0];next} !($0 in a)' file1.xml file2.xml
>
> >> > We are using Solaris 5.10
>
> >> Then use /usr/xpg4/bin/awk.- Hide quoted text -
>
> >> - Show quoted text -
>
> > That works like a charm Pk. Can you please explain the code-flow?
>
> NR==FNR{a[$0];next}
>
> This reads all the first file's lines as indexes of the "a" associative
> array (or hash). NR==FNR means "while we're reading the first file". $0
> represents the input line, so a[$0] creates the element subscripted by $0 in
> the hash.
> ----------------
> !($0 in a)
>
> This is evaluated when the second file is being read, and essentially tells
> awk "if the line we're reading ($0) is NOT present as an index of the hash a
> (this is indicated by !($0 in a), then print it".
> If's probably clearer writtin as follows:
>
> !($0 in a) {print $0}
>
> but the two forms are equivalent, since when awk finds a true condition, by
> default it prints the record (line in this case).

Thanks again Pk. It seems I really need to learn AWK programming.
Appreciate your time.

|
Pages: 1
Prev: rsync performance
Next: problem using script creating users from csv