Prev: How to ls -lh on a FreeBSD?
Next: shell idiom to kick off background jobs and wait for completion
From: Hongyi Zhao on 19 Oct 2009 23:52 On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortonspam(a)gmail.com> wrote: >Adding a "delete" and a "next" would make the script more efficient if >you have a large list of IP addresses in file1 and each range in file2 >is distinct: > >BEGIN{ FS="\t"; OFS="#" } >function ip2nr(ip, nr,ipA) { > # aaa.bbb.ccc.ddd > split(ip,ipA,".") > nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] > return nr >} >NR==FNR { addrs[$0] = ip2nr($0); next } >FNR>1 { > start = ip2nr($1) > end = ip2nr($2) > for (ip in addrs) { > if (addrs[ip] >= start && addrs[ip] <= end) { > print ip,$3" "$4 > delete addrs[ip] > next > } > } >} > > Ed. Thanks a lot. In my case, the file2, i.e., the IP database is a huge one (including 373375 lines), and I find that your above revised awk script will omit some IP addresses in for the file1 in the output. Considering that it's not advisable to post attachments to this news group, I've post you via mail about the following issue along with all files used and generated by me. -- ..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Grant on 20 Oct 2009 02:58 On Tue, 20 Oct 2009 11:52:51 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote: >On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortonspam(a)gmail.com> >wrote: > >>Adding a "delete" and a "next" would make the script more efficient if >>you have a large list of IP addresses in file1 and each range in file2 >>is distinct: >> >>BEGIN{ FS="\t"; OFS="#" } >>function ip2nr(ip, nr,ipA) { >> # aaa.bbb.ccc.ddd >> split(ip,ipA,".") >> nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] >> return nr >>} >>NR==FNR { addrs[$0] = ip2nr($0); next } >>FNR>1 { >> start = ip2nr($1) >> end = ip2nr($2) >> for (ip in addrs) { >> if (addrs[ip] >= start && addrs[ip] <= end) { >> print ip,$3" "$4 >> delete addrs[ip] >> next >> } >> } >>} >> >> Ed. > >Thanks a lot. > >In my case, the file2, i.e., the IP database is a huge one (including >373375 lines), and I find that your above revised awk script will omit >some IP addresses in for the file1 in the output. In that case it's probably easier to work with decimal IPs, something like (gawk fragment): function cc_lookup(addr, a, i, l, m, h) { .... # binary search ip2c-data for country code split(addr, a, "."); i = ((a[1]*256+a[2])*256+a[3])*256+a[4] l = 1; h = ipdatsize while (h - l > 1) { m = int((l + h) / 2) if (ipdata_str[m] < i) { l = m } else { h = m } } if (i < ipdata_str[h]) --h if (i > ipdata_end[h]) return "--:unassigned" # return country code and country name return sprintf("%s:%s", ipdata_cc[h], ipname[ipdata_cc[h]]) } Though I have a smaller lookup table of 102k records since I'm interested in country code lookup, adjacent blocks are merged during database file creation. > >Considering that it's not advisable to post attachments to this news >group, I've post you via mail about the following issue along with >all files used and generated by me. Make a very limited system that demonstrates your issues? Grant. -- http://bugsplatter.id.au
From: Hongyi Zhao on 20 Oct 2009 04:29 On Tue, 20 Oct 2009 17:58:34 +1100, Grant <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >Make a very limited system that demonstrates your issues? I've also give you a copy of that mail, perhaps this way will give you more informations on this issue. BTW, when the IPdatebase is huge, the lookup process will require so many time. Are there some methods to decrease the lookup time from the IPdatebase? Best regards. -- ..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Ed Morton on 20 Oct 2009 11:20 On Oct 19, 10:52 pm, Hongyi Zhao <hongyi.z...(a)gmail.com> wrote: > On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortons...(a)gmail.com> > wrote: > > > > > > >Adding a "delete" and a "next" would make the script more efficient if > >you have a large list of IP addresses in file1 and each range in file2 > >is distinct: > > >BEGIN{ FS="\t"; OFS="#" } > >function ip2nr(ip, nr,ipA) { > > # aaa.bbb.ccc.ddd > > split(ip,ipA,".") > > nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] > > return nr > >} > >NR==FNR { addrs[$0] = ip2nr($0); next } > >FNR>1 { > > start = ip2nr($1) > > end = ip2nr($2) > > for (ip in addrs) { > > if (addrs[ip] >= start && addrs[ip] <= end) { > > print ip,$3" "$4 > > delete addrs[ip] > > next > > } > > } > >} > > > Ed. > > Thanks a lot. > > In my case, the file2, i.e., the IP database is a huge one (including > 373375 lines), and I find that your above revised awk script will omit > some IP addresses in for the file1 in the output. > > Considering that it's not advisable to post attachments to this news > group, I've post you via mail about the following issue along with > all files used and generated by me. > -- > .: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.- Hide quoted text - > > - Show quoted text - The email address I use for netnews is just a spam trap, I don't read it. Post some SMALL sample input and expected output from that input, in particular including the IP addresses that are omitted from the output. Ed.
From: Grant on 20 Oct 2009 16:48
On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton <mortonspam(a)gmail.com> wrote: >On Oct 19, 10:52 pm, Hongyi Zhao <hongyi.z...(a)gmail.com> wrote: >> On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortons...(a)gmail.com> >> wrote: >> >> >> >> >> >> >Adding a "delete" and a "next" would make the script more efficient if >> >you have a large list of IP addresses in file1 and each range in file2 >> >is distinct: >> >> >BEGIN{ FS="\t"; OFS="#" } >> >function ip2nr(ip, nr,ipA) { >> > # aaa.bbb.ccc.ddd >> > split(ip,ipA,".") >> > nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4] The weighting for converting dotquad IP to a number is 256, not 1000 -- using 1000 will skip IP addresses in your range matching. Try nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4] or nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4] instead -- the second version is speed optimised for gawk. >> > return nr >> >} >> >NR==FNR { addrs[$0] = ip2nr($0); next } >> >FNR>1 { >> > start = ip2nr($1) >> > end = ip2nr($2) >> > for (ip in addrs) { >> > if (addrs[ip] >= start && addrs[ip] <= end) { >> > print ip,$3" "$4 >> > delete addrs[ip] >> > next >> > } >> > } >> >} Grant. -- http://bugsplatter.id.au |