Prev: How to ls -lh on a FreeBSD?
Next: shell idiom to kick off background jobs and wait for completion
From: Ed Morton on 25 Oct 2009 09:54 Grant wrote: > On Sun, 25 Oct 2009 05:04:43 +0000 (UTC), Kaz Kylheku <kkylheku(a)gmail.com> wrote: > >> On 2009-10-25, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote: >>> Another issue is that: if the file1 is a huge one, say including >>> several thousands entries in it, the above process will be time >>> consuming. So >>> is it possible to revise the above awk script with multithread >>> technology >>> to improve the efficiency? >> What kind of machine are you using that a file of a mere several thousand >> entries causes a performance problem, yet threads somehow help? > > Dunno about threads being applicable, but his lookup file has > 300k records -- no point re-reading that for each of the thousand > input lines, which is what Ed's solution does. No it doesn't, it reads each file once. It reads the first one into memory and starts deleting records from it as they're found in the second one. Ed. > Grant.
From: Ed Morton on 25 Oct 2009 10:18 Hongyi Zhao wrote: > On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com> > wrote: > >> $ cat tst.awk >> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) } >> function ip2nr(ip, nr,ipA) { >> # aaa.bbb.ccc.ddd >> split(ip,ipA,".") >> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >> ipA[4] >> return nr >> } >> NR==FNR { addrs[$0] = ip2nr($0); next } >> FNR>1 { >> start = ip2nr($1) >> end = ip2nr($2) >> for (ip in addrs) { >> if ((addrs[ip] >= start) && (addrs[ip] <= end)) { >> print ip,$3" "$4 >> delete addrs[ip] >> } >> } >> } > > Another issue is that: if the file1 is a huge one, say including > several thousands entries in it, the above process will be time > consuming. So > is it possible to revise the above awk script with multithread > technology > to improve the efficiency? > > Thanks in advance. Don't know about multi-threading, but if your file of IP addresses is huge and your file of ranges and names isn't then this should be a lot more efficient: BEGIN{ FS="\t"; OFS="#"; scale=256 } function ip2nr(ip, nr,ipA) { # aaa.bbb.ccc.ddd split(ip,ipA,".") nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + ipA[4] return nr } NR==FNR { start[FNR] = ip2nr($1) end[FNR] = ip2nr($2) name[FNR] = $3" "$4 range = FNR next } { curr = ip2nr($0) for (idx=1; idx<=range; idx++) { if ((curr >= start[idx]) && (curr <= end[idx])) { print $0,name[idx] next } } notFound[$0] } END { for (ip in notFound) { print ip,"Not Found." | "cat>&2" } } Run it as: awk -f whatever.awk file2 file1 Note the switch in order of input files. You could change the search algorithm (for (idx=...) ) to binary search or something if you know any properties of "file2" that you could take advantage of. Ed.
From: Ed Morton on 25 Oct 2009 10:40 Ed Morton wrote: > Hongyi Zhao wrote: >> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com> >> wrote: >> >>> $ cat tst.awk >>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) } >>> function ip2nr(ip, nr,ipA) { >>> # aaa.bbb.ccc.ddd >>> split(ip,ipA,".") >>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >>> ipA[4] >>> return nr >>> } >>> NR==FNR { addrs[$0] = ip2nr($0); next } >>> FNR>1 { >>> start = ip2nr($1) >>> end = ip2nr($2) >>> for (ip in addrs) { >>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) { >>> print ip,$3" "$4 >>> delete addrs[ip] >>> } >>> } >>> } >> >> Another issue is that: if the file1 is a huge one, say including >> several thousands entries in it, the above process will be time >> consuming. So >> is it possible to revise the above awk script with multithread >> technology >> to improve the efficiency? >> >> Thanks in advance. > > Don't know about multi-threading, but if your file of IP addresses is > huge and your file of ranges and names isn't then this should be a lot > more efficient: > > BEGIN{ FS="\t"; OFS="#"; scale=256 } > function ip2nr(ip, nr,ipA) { > # aaa.bbb.ccc.ddd > split(ip,ipA,".") > nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + > ipA[4] > return nr > } > NR==FNR { > start[FNR] = ip2nr($1) > end[FNR] = ip2nr($2) > name[FNR] = $3" "$4 > range = FNR Change the above 4 lines to: start[range] = ip2nr($1) end[range] = ip2nr($2) name[range] = $3" "$4 range++ to account for the first line of file2 containing a header rather than data. > next > } > { > curr = ip2nr($0) > for (idx=1; idx<=range; idx++) > { > if ((curr >= start[idx]) && (curr <= end[idx])) > { > print $0,name[idx] > next > } > } > notFound[$0] > } > END { > for (ip in notFound) { > print ip,"Not Found." | "cat>&2" > } > } > > Run it as: > > awk -f whatever.awk file2 file1 > > Note the switch in order of input files. > > You could change the search algorithm (for (idx=...) ) to binary search > or something if you know any properties of "file2" that you could take > advantage of. > > Ed.
From: Grant on 25 Oct 2009 12:25 On Sun, 25 Oct 2009 09:18:46 -0500, Ed Morton <mortonspam(a)gmail.com> wrote: >Hongyi Zhao wrote: >> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com> >> wrote: >> >>> $ cat tst.awk >>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) } >>> function ip2nr(ip, nr,ipA) { >>> # aaa.bbb.ccc.ddd >>> split(ip,ipA,".") >>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >>> ipA[4] >>> return nr >>> } >>> NR==FNR { addrs[$0] = ip2nr($0); next } >>> FNR>1 { >>> start = ip2nr($1) >>> end = ip2nr($2) >>> for (ip in addrs) { >>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) { >>> print ip,$3" "$4 >>> delete addrs[ip] >>> } >>> } >>> } >> >> Another issue is that: if the file1 is a huge one, say including >> several thousands entries in it, the above process will be time >> consuming. So >> is it possible to revise the above awk script with multithread >> technology >> to improve the efficiency? >> >> Thanks in advance. > >Don't know about multi-threading, but if your file of IP addresses is >huge and your file of ranges and names isn't then this should be a lot >more efficient: > >BEGIN{ FS="\t"; OFS="#"; scale=256 } >function ip2nr(ip, nr,ipA) { > # aaa.bbb.ccc.ddd > split(ip,ipA,".") > nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >ipA[4] > return nr >} >NR==FNR { > start[FNR] = ip2nr($1) > end[FNR] = ip2nr($2) > name[FNR] = $3" "$4 > range = FNR > next >} >{ > curr = ip2nr($0) And it's in this part that I use the binary search instead of your linear scan to replace average 150k compares with ~19 compares :) - - - > for (idx=1; idx<=range; idx++) > { > if ((curr >= start[idx]) && (curr <= end[idx])) > { > print $0,name[idx] > next > } - - - # binary search to suit Ed's script: lo = 1; hi = range while (hi - lo > 1) { mid = int((lo + hi) / 2) if (start[mid] < curr) { lo = mid } else { hi = mid } } if (curr < start[hi]) --hi if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] } - - - > } > notFound[$0] >} >END { > for (ip in notFound) { > print ip,"Not Found." | "cat>&2" > } >} > >Run it as: > > awk -f whatever.awk file2 file1 > >Note the switch in order of input files. > >You could change the search algorithm (for (idx=...) ) to binary search >or something if you know any properties of "file2" that you could take >advantage of. I think OPs data to be looked up is just a list of several thousand IP addresses, so the binary search above would be closer to OP's solution? Binary search assumes lookup file (the one with 300k records) is sorted. Where I do a similar task with IPs, the database is normalised to two files: $ cat er-ip2c +----------------+ | ip2c-index | |----------------| +--------------+ |*IP block start*| | ip2c-names | | IP block end | +--------------+ | Country code |---|*Country code*| +----------------+ | Country name | +--------------+ I don't know if OP has repetive data in their lookup table (third field) that could be normalised like this, normalised lookup tables only save some memory in this type of application. Grant. -- http://bugsplatter.id.au
From: Grant on 25 Oct 2009 12:38
On Mon, 26 Oct 2009 03:25:57 +1100, Grant <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >On Sun, 25 Oct 2009 09:18:46 -0500, Ed Morton <mortonspam(a)gmail.com> wrote: > >>Hongyi Zhao wrote: >>> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com> >>> wrote: >>> >>>> $ cat tst.awk >>>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) } >>>> function ip2nr(ip, nr,ipA) { >>>> # aaa.bbb.ccc.ddd >>>> split(ip,ipA,".") >>>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >>>> ipA[4] >>>> return nr >>>> } >>>> NR==FNR { addrs[$0] = ip2nr($0); next } >>>> FNR>1 { >>>> start = ip2nr($1) >>>> end = ip2nr($2) >>>> for (ip in addrs) { >>>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) { >>>> print ip,$3" "$4 >>>> delete addrs[ip] >>>> } >>>> } >>>> } >>> >>> Another issue is that: if the file1 is a huge one, say including >>> several thousands entries in it, the above process will be time >>> consuming. So >>> is it possible to revise the above awk script with multithread >>> technology >>> to improve the efficiency? >>> >>> Thanks in advance. >> >>Don't know about multi-threading, but if your file of IP addresses is >>huge and your file of ranges and names isn't then this should be a lot >>more efficient: >> >>BEGIN{ FS="\t"; OFS="#"; scale=256 } >>function ip2nr(ip, nr,ipA) { >> # aaa.bbb.ccc.ddd >> split(ip,ipA,".") >> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) + >>ipA[4] >> return nr >>} >>NR==FNR { >> start[FNR] = ip2nr($1) >> end[FNR] = ip2nr($2) >> name[FNR] = $3" "$4 >> range = FNR >> next >>} >>{ >> curr = ip2nr($0) > >And it's in this part that I use the binary search instead of your >linear scan to replace average 150k compares with ~19 compares :) > - - - >> for (idx=1; idx<=range; idx++) >> { >> if ((curr >= start[idx]) && (curr <= end[idx])) >> { >> print $0,name[idx] >> next >> } > - - - > # binary search to suit Ed's script: > > lo = 1; hi = range > while (hi - lo > 1) { > mid = int((lo + hi) / 2) > if (start[mid] < curr) { lo = mid } else { hi = mid } > } > if (curr < start[hi]) --hi > > if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] } > - - - >> } >> notFound[$0] >>} >>END { >> for (ip in notFound) { >> print ip,"Not Found." | "cat>&2" >> } >>} >> >>Run it as: >> >> awk -f whatever.awk file2 file1 >> >>Note the switch in order of input files. >> >>You could change the search algorithm (for (idx=...) ) to binary search >>or something if you know any properties of "file2" that you could take >>advantage of. > >I think OPs data to be looked up is just a list of several thousand IP >addresses, so the binary search above would be closer to OP's solution? > >Binary search assumes lookup file (the one with 300k records) is sorted. And assumes IP blocks are non-overlapping! Gaps in IP blocks return the unknown. > >Where I do a similar task with IPs, the database is normalised to two >files: > >$ cat er-ip2c >+----------------+ >| ip2c-index | >|----------------| +--------------+ >|*IP block start*| | ip2c-names | >| IP block end | +--------------+ >| Country code |---|*Country code*| >+----------------+ | Country name | > +--------------+ > >I don't know if OP has repetive data in their lookup table (third field) >that could be normalised like this, normalised lookup tables only save >some memory in this type of application. Context where I'm doing similar task to what OP is after: .... FILENAME == ip2c_index { # record format: <block start> <block end> <cc> ipdata_str[++ipdatsize] = $1 ipdata_end[ipdatsize] = $2 ipdata_cc[ipdatsize] = $3 next } FILENAME == ip2c_names { # record format: <cc>:<country name> split($0, a, ":") ipname[a[1]] = a[2] next } .... function cc_lookup(addr, a, i, l, m, h) { .... # binary search ip2c-data for country code split(addr, a, "."); i = ((a[1]*256+a[2])*256+a[3])*256+a[4] l = 1; h = ipdatsize while (h - l > 1) { m = int((l + h) / 2) if (ipdata_str[m] < i) { l = m } else { h = m } } if (i < ipdata_str[h]) --h if (i > ipdata_end[h]) return "--:unassigned" # return country code and country name return sprintf("%s:%s", ipdata_cc[h], ipname[ipdata_cc[h]]) } > >Grant. -- http://bugsplatter.id.au |