Prev: Automatically rename a downloaded file by curl to avoidfilename conflict.
Next: Man vs. standards
From: lbrtchx on 4 Feb 2010 12:06 ~ I was wonderign if you know of a more efificient way to stratify values in a column of a file ~ Say you have a file with data on each line and you would like to know how many times the data was found in it. File could look like this: ~ // __ f00.txt ~ 6456 qweRt aAbCC aAbCC aabCC 96 qwert 96 645 aAbCC ~ 1) the way to go (I think) would be to first sort the initial file: ~ knoppix(a)Microknoppix:~/tmp$ sort f00.txt -o f02.txt knoppix(a)Microknoppix:~/tmp$ cat f02.txt 645 6456 96 96 aAbCC aAbCC aAbCC aabCC qweRt qwert ~ 2) then compare file line by line weighting the comparison with a snippet looking like: ~ #!/bin/sh ## reads file in as first and only argument line by line # input file _IFl="$1" # output file _OFl="$2" rm ${_OFl} ORIGIFS=$IFS IFS=$(echo -en "\n\b") exec 3<&0 exec 0<$_IFl _ln00="" _icntttl=0 _icnt=1 # while read line do _ln02=$line echo \"${_ln02}\" \"${_ln00}\" if [[ ${_ln00} == ${_ln02} ]]; then _icnt=`expr $_icnt + 1` else if [[ $_icnt > 1 ]]; then echo \"$_ln00\",$_icnt >> ${_OFl} _icnt=1 fi fi _ln00=$_ln02 _icntttl=`expr $_icntttl + 1` done echo "~" echo "// __ output file: ${_OFl}" cat ${_OFl} exec 0<&3 IFS=$ORIGIFS ~ sh ./comp00.sh f02.txt f04.txt ~ 3) to get as result: ~ knoppix(a)Microknoppix:~/tmp$ cat f04.txt "96",2 "aAbCC",3 ~ lbrtchx
From: Ben Bacarisse on 4 Feb 2010 12:18 lbrtchx(a)gmail.com writes: > ~ > I was wonderign if you know of a more efificient way to stratify values in a column of a file > ~ > Say you have a file with data on each line and you would like to know how many times the data was found in it. File could look like this: > ~ > // __ f00.txt > ~ > 6456 > qweRt > aAbCC > aAbCC > aabCC > 96 > qwert > 96 > 645 > aAbCC > ~ I'd use awk: awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }' (the name c is not a good one but it does make the code a on-liner). -- Ben.
From: Stephane CHAZELAS on 4 Feb 2010 12:27 2010-02-04, 17:06(+00), lbrtchx(a)gmail.com: [...] > Say you have a file with data on each line and you would like > to know how many times the data was found in it. File could > look like this: [...] Maybe uniq -c? sort < file | uniq -c sorted by number of occurrence: sort < file | uniq -c | sort -rn If you only want the duplicated ones: sort < file | uniq -c | sort -n | awk '$1>1,0' or with GNU uniq: sort < file | uniq -D | uniq -c -- St�phane
From: Albretch Mueller on 4 Feb 2010 13:16 ~ OK, we have three algorithms which need two passes through the original file (even if Ben's looks like a one liner ;-) and mine looks lenghtier). ~ if you use a high level programming language, say java, you will be effectively looping twice anyway, once for the sort and another for the accumulation/counting. Even if you recreate the illusion of having only one loop, for example by using a hash table, the hash table would still internally do the sorting part ~ I can't recall now exactly how is it you can log what the OS is doing in these three cases, but sometimes what looks like a shorter way to do things is not the most effiicient regarding speed and footprint ~ Database algorithms do this all the time I am curious as to how they do it. I mean if they actually use any slick optimizations instead of going the procedural monkey way as I did ~ lbrtchx
From: Seebs on 4 Feb 2010 15:52 On 2010-02-04, Ben Bacarisse <ben.usenet(a)bsb.me.uk> wrote: > I'd use awk: > > awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }' > > (the name c is not a good one but it does make the code a on-liner). That was my first thought, actually. But then it occurred to me that you could also do sort | uniq -c Which is probably (?) faster. I'm actually not sure; I think it depends on the size of the file and number of duplicates. If you have only a few words, which occur millions of times, it will probably be slower. For the vast majority of real-world data sets, I imagine that both will occur fast enough that you don't actually have to wait for the prompt to come back. -s -- Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
|
Next
|
Last
Pages: 1 2 3 Prev: Automatically rename a downloaded file by curl to avoidfilename conflict. Next: Man vs. standards |