Prev: How to ls -lh on a FreeBSD?
Next: shell idiom to kick off background jobs and wait for completion
From: Hongyi Zhao on 21 Oct 2009 01:14 On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton <mortonspam(a)gmail.com> wrote: >The email address I use for netnews is just a spam trap, I don't read >it. Post some SMALL sample input and expected output from that input, >in particular including the IP addresses that are omitted from the >output. See the following minimal example: 1- The test.awk is as follows: BEGIN{ FS="\t"; OFS="#" } function ip2nr(ip, nr,ipA) { # aaa.bbb.ccc.ddd split(ip,ipA,".") nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] return nr } NR==FNR { addrs[$0] = ip2nr($0); next } FNR>1 { start = ip2nr($1) end = ip2nr($2) for (ip in addrs) { if (addrs[ip] >= start && addrs[ip] <= end) { print ip,$3" "$4 #delete addrs[ip] #next } } } The file1 has the following content: $ cat file1 128.83.194.98 129.21.126.99 129.21.136.140 140.180.130.93 140.180.163.6 161.53.160.104 18.127.1.91 18.181.0.128 18.246.2.48 18.246.2.79 18.246.2.83 18.246.2.88 18.251.7.53 The file2 has the following content: $ cat file2 StartIP EndIP Country Local 4.21.160.8 4.21.160.15 America MIT 18.0.0.0 18.255.255.255 America MIT 128.30.0.0 128.31.255.255 America MIT 128.52.0.0 128.52.255.255 America MIT 128.83.0.0 128.83.255.255 America The University of Texas at Austin 129.21.0.0 129.21.255.255 America Rochester 140.180.0.0 140.180.255.255 America Princeton 161.53.0.0 161.53.255.255 Croatia University of Zagreb university central computing 192.12.11.0 192.12.11.255 America MIT 192.54.222.0 192.54.222.255 America MIT 192.233.95.0 192.233.95.255 America MIT The output by running the test.awk: $ awk -f test.awk file1 file2 18.251.7.53#America MIT 18.181.0.128#America MIT 18.246.2.83#America MIT 18.246.2.48#America MIT 18.246.2.88#America MIT 18.246.2.79#America MIT 18.127.1.91#America MIT 128.83.194.98#America The University of Texas at Austin 129.21.136.140#America Rochester 129.21.126.99#America Rochester 140.180.130.93#America Princeton 140.180.163.6#America Princeton 161.53.160.104#Croatia University of Zagreb university central computing 2- This time, I use the revised version of your test.awk, i.e., BEGIN{ FS="\t"; OFS="#" } function ip2nr(ip, nr,ipA) { # aaa.bbb.ccc.ddd split(ip,ipA,".") nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] return nr } NR==FNR { addrs[$0] = ip2nr($0); next } FNR>1 { start = ip2nr($1) end = ip2nr($2) for (ip in addrs) { if (addrs[ip] >= start && addrs[ip] <= end) { print ip,$3" "$4 delete addrs[ip] next } } } The output by running the test.awk will look as follows: $ awk -f test.awk file1 file2 18.251.7.53#America MIT 128.83.194.98#America The University of Texas at Austin 129.21.136.140#America Rochester 140.180.130.93#America Princeton 161.53.160.104#Croatia University of Zagreb university central computing Any hints on this issue? Thanks in advance. Best regards. -- ..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Hongyi Zhao on 21 Oct 2009 01:18 On Wed, 21 Oct 2009 07:48:18 +1100, Grant <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >>> > ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] > >The weighting for converting dotquad IP to a number is 256, not >1000 -- using 1000 will skip IP addresses in your range matching. > >Try > nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4] > >or > nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4] > >instead -- the second version is speed optimised for gawk. I've tried all of the above three expressions for _nr_, and I _always_ get the same results. Could you please give some example to support your point of view? Best regards. -- ..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Grant on 21 Oct 2009 03:15 On Wed, 21 Oct 2009 13:18:47 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote: >On Wed, 21 Oct 2009 07:48:18 +1100, Grant ><g_r_a_n_t_(a)bugsplatter.id.au> wrote: > >>>> > ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] >> >>The weighting for converting dotquad IP to a number is 256, not >>1000 -- using 1000 will skip IP addresses in your range matching. >> >>Try >> nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4] >> >>or >> nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4] >> >>instead -- the second version is speed optimised for gawk. > >I've tried all of the above three expressions for _nr_, and I _always_ >get the same results. Could you please give some example to support >your point of view? grant(a)deltree:~$ echo 123.123.123.123 > dotquad grant(a)deltree:~$ awk '{split($1,a,".");ip=((a[1]*256+a[2])*256+a[3])*256+a[4];\ xx=((a[1]*1000+a[2])*1000+a[3])*1000+a[4];print $1, ip, xx}' dotquad 123.123.123.123 2071690107 123123123123 grant(a)deltree:~$ ccfind 123.123.123.123 123.123.123.123 CN:China grant(a)deltree:~$ ccfind 2071690107 123.123.123.123 CN:China grant(a)deltree:~$ ccfind 123123123123 (bad query) grant(a)deltree:~$ cat $(which ccfind) #!/bin/bash # # ccfind 2006-03-05, last edit 2008-08-15 # # returns '<query> cc:country name' for IP address input queries, # using the ip2cn-server daemon. # # Copyright (C) 2006-2008 Grant Coady <http://bugsplatter.id.au> GPLv2 # # 2008-08-13 # convert to ip2cn-server operation, no more access locking! :) # # check got query [ -z "$1" ] && echo " ccfind -- lookup country code and name for IP address usage $0 aa.bb.cc.dd " && exit # get server listen port port=$(gawk '/^inetport/ {print $2}' /etc/ip2cn-server.conf) # make query, may be dotquad or numeric (decimal) IP address echo "$@" | gawk -v port=$port ' BEGIN { service = "/inet/tcp/0/localhost/" port } $1 == "0" { $1 = "0." } { print |& service; service |& getline; print }' 2>/dev/null # end Grant. -- http://bugsplatter.id.au
From: Ed Morton on 21 Oct 2009 03:47 Hongyi Zhao wrote: > On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton > <mortonspam(a)gmail.com> wrote: > >> The email address I use for netnews is just a spam trap, I don't read >> it. Post some SMALL sample input and expected output from that input, >> in particular including the IP addresses that are omitted from the >> output. > > See the following minimal example: > > 1- The test.awk is as follows: > > BEGIN{ FS="\t"; OFS="#" } > function ip2nr(ip, nr,ipA) { > # aaa.bbb.ccc.ddd > split(ip,ipA,".") > nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + > ipA[4] > return nr > } > NR==FNR { addrs[$0] = ip2nr($0); next } > FNR>1 { > start = ip2nr($1) > end = ip2nr($2) > for (ip in addrs) { > if (addrs[ip] >= start && addrs[ip] <= end) { > print ip,$3" "$4 > #delete addrs[ip] > #next > } > } > } > > The file1 has the following content: > > $ cat file1 > 128.83.194.98 > 129.21.126.99 > 129.21.136.140 > 140.180.130.93 > 140.180.163.6 > 161.53.160.104 > 18.127.1.91 > 18.181.0.128 > 18.246.2.48 > 18.246.2.79 > 18.246.2.83 > 18.246.2.88 > 18.251.7.53 > > The file2 has the following content: > > $ cat file2 > StartIP EndIP Country Local > 4.21.160.8 4.21.160.15 America MIT > 18.0.0.0 18.255.255.255 America MIT > 128.30.0.0 128.31.255.255 America MIT > 128.52.0.0 128.52.255.255 America MIT > 128.83.0.0 128.83.255.255 America The University of Texas at > Austin > 129.21.0.0 129.21.255.255 America Rochester > 140.180.0.0 140.180.255.255 America Princeton > 161.53.0.0 161.53.255.255 Croatia University of Zagreb > university central > computing > 192.12.11.0 192.12.11.255 America MIT > 192.54.222.0 192.54.222.255 America MIT > 192.233.95.0 192.233.95.255 America MIT > > The output by running the test.awk: > > $ awk -f test.awk file1 file2 > 18.251.7.53#America MIT > 18.181.0.128#America MIT > 18.246.2.83#America MIT > 18.246.2.48#America MIT > 18.246.2.88#America MIT > 18.246.2.79#America MIT > 18.127.1.91#America MIT > 128.83.194.98#America The University of Texas at Austin > 129.21.136.140#America Rochester > 129.21.126.99#America Rochester > 140.180.130.93#America Princeton > 140.180.163.6#America Princeton > 161.53.160.104#Croatia University of Zagreb university central > computing > > 2- This time, I use the revised version of your test.awk, i.e., > > BEGIN{ FS="\t"; OFS="#" } > function ip2nr(ip, nr,ipA) { > # aaa.bbb.ccc.ddd > split(ip,ipA,".") > nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + > ipA[4] > return nr > } > NR==FNR { addrs[$0] = ip2nr($0); next } > FNR>1 { > start = ip2nr($1) > end = ip2nr($2) > for (ip in addrs) { > if (addrs[ip] >= start && addrs[ip] <= end) { > print ip,$3" "$4 > delete addrs[ip] > next > } > } > } > > The output by running the test.awk will look as follows: > > $ awk -f test.awk file1 file2 > 18.251.7.53#America MIT > 128.83.194.98#America The University of Texas at Austin > 129.21.136.140#America Rochester > 140.180.130.93#America Princeton > 161.53.160.104#Croatia University of Zagreb university central > computing > > Any hints on this issue? Thanks in advance. > > Best regards. It's the "next". It's causing the script to skip to the next range in file2 whenever it finds 1 IP address from file1 in that range, but of course there could be multiple IP addresses in that same range. It's not causing this problem, but Grant may be right and you need to use 256 instead of 1000 as a multiplier - I haven't thought about it very much so maybe using 1000 will cause problems for some IP addresses. Try this: $ cat tst.awk BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) } function ip2nr(ip, nr,ipA) { # aaa.bbb.ccc.ddd split(ip,ipA,".") nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + ipA[4] return nr } NR==FNR { addrs[$0] = ip2nr($0); next } FNR>1 { start = ip2nr($1) end = ip2nr($2) for (ip in addrs) { if ((addrs[ip] >= start) && (addrs[ip] <= end)) { print ip,$3" "$4 delete addrs[ip] } } } $ awk -f tst.awk file1 file2 > o1 $ awk -v scale=1000 -f tst.awk file1 file2 > o2 $ diff o1 o2 to see if it produces any difference in the output from your real, large input files. If not, I'd go with 256 as the scale. If it does, think about it and decide which is correct. Ed.
From: Ed Morton on 21 Oct 2009 03:54
Grant wrote: > On Wed, 21 Oct 2009 13:18:47 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote: > >> On Wed, 21 Oct 2009 07:48:18 +1100, Grant >> <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >> >>>>>> ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] >>> The weighting for converting dotquad IP to a number is 256, not >>> 1000 -- using 1000 will skip IP addresses in your range matching. >>> >>> Try >>> nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4] >>> >>> or >>> nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4] >>> >>> instead -- the second version is speed optimised for gawk. >> I've tried all of the above three expressions for _nr_, and I _always_ >> get the same results. Could you please give some example to support >> your point of view? > > grant(a)deltree:~$ echo 123.123.123.123 > dotquad > > grant(a)deltree:~$ awk '{split($1,a,".");ip=((a[1]*256+a[2])*256+a[3])*256+a[4];\ > xx=((a[1]*1000+a[2])*1000+a[3])*1000+a[4];print $1, ip, xx}' dotquad > 123.123.123.123 2071690107 123123123123 > > grant(a)deltree:~$ ccfind 123.123.123.123 > 123.123.123.123 CN:China > > grant(a)deltree:~$ ccfind 2071690107 > 123.123.123.123 CN:China > > grant(a)deltree:~$ ccfind 123123123123 > (bad query) I expect you're right and that multiplying by 256 does produce a "better" representation of the IP address as a decimal number, but can you think of an example where the range check Hongyi cares about would fail if we used 1000 instead of 256 as the multiplier? Ed. |