Prev: How to ls -lh on a FreeBSD?
Next: shell idiom to kick off background jobs and wait for completion
From: Grant on 26 Oct 2009 14:02 On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com> wrote: >Hongyi Zhao wrote: >> On Mon, 26 Oct 2009 15:08:37 +1100, Grant >> <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >> >>> I forgot about sorting IPs :( >>> >>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted >> >> In my case, whether I do the above sorting process or not, your scipt >> will take the same time approximately, > >But it won't produce the same output. The binary search algorithm relies >on the input data being sorted. Sorry for confusion. I made a boo-boo thinking that the file wasn't sorted 'cos I did a straight sort + diff -- when I did a proper sort and diff, I discoverd OP's file was already sorted -- thus okay for the binary search. The problem now is whether awk can read the raw data file -- my impression is that the binary data file contains indirect pointers to the text strings, the java code OP referenced relies on random file access read the data. This is a bit tricky to teach an awk script to do? So awk would need to read the file and build in-memory data structures then dump proper unix style plain text data file/s in the END block. Possible? Yes. Easy? Dunno... Grant. -- http://bugsplatter.id.au
From: Grant on 26 Oct 2009 19:01 On Mon, 26 Oct 2009 08:03:55 -0500, Ed Morton <mortonspam(a)gmail.com> wrote: >Hongyi Zhao wrote: >> On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com> >> wrote: >> >>>>> I forgot about sorting IPs :( >>>>> >>>>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted >>>> In my case, whether I do the above sorting process or not, your scipt >>>> will take the same time approximately, >>> But it won't produce the same output. The binary search algorithm relies >>> on the input data being sorted. >> >> Do you mean that I should always do the sorting process like the above >> firstly? > >If you want to do a binary search, then yes you need to sort the input >data first. The alternative is to not sort the data and do a linear (or >other uninformed) search, so you can test to see which takes longer >(sort+binary vs linear) and make your choice. That's why I wrote a server daemon to hold database in memory, not pay the data load time for each casual query. Of course which way to go depends if one is processing some log or data file, or making casual queries, for example, a log file pretty-printer or web .cgi script. I have both pretty-printer and .cgi scripts running. An hourly cron job processing a log file will check if the daemon is running and load the database files if required. Grant. -- http://bugsplatter.id.au
From: John W. Krahn on 27 Oct 2009 00:38 Ed Morton wrote: > Hongyi Zhao wrote: >> On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com> >> wrote: >> >>>>> I forgot about sorting IPs :( >>>>> >>>>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted >>>> In my case, whether I do the above sorting process or not, your scipt >>>> will take the same time approximately, >>> But it won't produce the same output. The binary search algorithm >>> relies on the input data being sorted. >> >> Do you mean that I should always do the sorting process like the above >> firstly? > > If you want to do a binary search, then yes you need to sort the input > data first. The alternative is to not sort the data and do a linear (or > other uninformed) search, so you can test to see which takes longer > (sort+binary vs linear) and make your choice. Or you can use a database (Berkeley DB for example) which means no sort and no linear search. John -- The programmer is fighting against the two most destructive forces in the universe: entropy and human stupidity. -- Damian Conway
From: Grant on 27 Oct 2009 17:58 On Mon, 26 Oct 2009 20:52:44 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote: >On Mon, 26 Oct 2009 22:20:27 +1100, Grant ><g_r_a_n_t_(a)bugsplatter.id.au> wrote: > >>Google translate does a fairly good job with Chinese :) Unfortunately >>the diagrams have Chinese text embedded in the images -- so I cannot >>follow the data layout :( > >A quick search in google give me the following codes related to the >format of qqwry.dat, you can checkout it by the following command: > >svn checkout http://qqwry.googlecode.com/svn/trunk/ qqwry-read-only > >Then you can obtain the following subdirectory: >qqwry-read-only\libqqwry > >I hope the source files on the above folder can give you some hints. What's wrong with using the nali* applications included in that source? >If not, I'll translate the webpage for you on which the format of >qqwry.dat is given. I've just about reversed engineered the data format from the source code, but awk is too slow loading a binary data file. The in-place file binary search written in C (that you have now) is very efficient (apart from bugs). I can't see why you want to reinvent the wheel here? Grant. -- http://bugsplatter.id.au
From: Hongyi Zhao on 27 Oct 2009 21:21
On Wed, 28 Oct 2009 08:58:42 +1100, Grant <g_r_a_n_t_(a)bugsplatter.id.au> wrote: >What's wrong with using the nali* applications included in that source? > >>If not, I'll translate the webpage for you on which the format of >>qqwry.dat is given. > >I've just about reversed engineered the data format from the source >code, but awk is too slow loading a binary data file. The in-place >file binary search written in C (that you have now) is very efficient >(apart from bugs). I can't see why you want to reinvent the wheel >here? OK, Thanks a lot. I just want to konw whether awk will do this more efficient or not. Best regards. -- ..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :. |