Prev: Automatically rename a downloaded file by curl to avoidfilename conflict.
Next: Man vs. standards
From: w_a_x_man on 4 Feb 2010 15:57 On Feb 4, 12:16 pm, Albretch Mueller <lbrt...(a)gmail.com> wrote: > ~ > OK, we have three algorithms which need two passes through the > original file (even if Ben's looks like a one liner ;-) Wrong. > and mine looks > lenghtier). > ~ > if you use a high level programming language, say java, you will be > effectively looping twice anyway, once for the sort and another for > the accumulation/counting. Even if you recreate the illusion of having > only one loop, for example by using a hash table, the hash table would > still internally do the sorting part Wrong.
From: Ben Bacarisse on 4 Feb 2010 17:45 Seebs <usenet-nospam(a)seebs.net> writes: > On 2010-02-04, Ben Bacarisse <ben.usenet(a)bsb.me.uk> wrote: >> I'd use awk: >> >> awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }' >> >> (the name c is not a good one but it does make the code a on-liner). > > That was my first thought, actually. But then it occurred to me that > you could also do > sort | uniq -c Except that is does not do what was requested. :-) (Neither does mine, in fact, but I decided that adding OFS="" and print k, ", ", c[k] just confused matters.) Can be fixed by further filtering (I did not know about uniq -D in GNU uniq so I'd never have thought of that one) but one obvious way to filter counts > 1 is to pipe through awk '$1 > 1' and, well, once you are running awk you might as well do it in awk. I suppose one could do sort | uniq -c | grep -v '^ *1 ' but that seems a bit fussy. <snip> -- Ben.
From: Ed Morton on 4 Feb 2010 18:59 On Feb 4, 12:16 pm, Albretch Mueller <lbrt...(a)gmail.com> wrote: > ~ > OK, we have three algorithms which need two passes through the > original file (even if Ben's looks like a one liner ;-) No, Bens doesn't require 2 passes of the file. He wrote: awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }' That loop in the END section is just once per unique $1 in the file, not once per line. You could further reduce the iteractions in the END section it to just loop through the number of $1s that appeared more than once: awk 'c[$1]++{m[$1]} END {for (k in m) print k, c[k] }' but I doubt if there'd be a noticable difference in execution time. Ed. and mine looks > lenghtier). > ~ > if you use a high level programming language, say java, you will be > effectively looping twice anyway, once for the sort and another for > the accumulation/counting. Even if you recreate the illusion of having > only one loop, for example by using a hash table, the hash table would > still internally do the sorting part > ~ > I can't recall now exactly how is it you can log what the OS is doing > in these three cases, but sometimes what looks like a shorter way to > do things is not the most effiicient regarding speed and footprint > ~ > Database algorithms do this all the time I am curious as to how they > do it. I mean if they actually use any slick optimizations instead of > going the procedural monkey way as I did > ~ > lbrtchx
From: Kenny McCormack on 5 Feb 2010 08:45 In article <3b09f956-70c6-4109-bb86-59b8dcf55925(a)l26g2000yqd.googlegroups.com>, Ed Morton <mortonspam(a)gmail.com> wrote: >On Feb 4, 12:16�pm, Albretch Mueller <lbrt...(a)gmail.com> wrote: >> ~ >> �OK, we have three algorithms which need two passes through the >> original file (even if Ben's looks like a one liner ;-) > >No, Bens doesn't require 2 passes of the file. He wrote: Right. I wondered why the poster got confused on that point. Maybe he never learned AWK. > awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }' > >That loop in the END section is just once per unique $1 in the file, >not once per line. You could further reduce the iteractions in the END >section it to just loop through the number of $1s that appeared more >than once: > > awk 'c[$1]++{m[$1]} END {for (k in m) print k, c[k] }' Cute. I like it. Because I've done exactly this several times in my career (build up the counts, then print the ones where the count is more than 1). I always thought there might be a more clever way to do it. >but I doubt if there'd be a noticable difference in execution time. True. But artistry is what counts.
From: Janis on 5 Feb 2010 09:57 On 4 Feb., 19:16, Albretch Mueller <lbrt...(a)gmail.com> wrote: > ~ > OK, we have three algorithms which need two passes through the > original file (even if Ben's looks like a one liner ;-) and mine looks > lenghtier). This is partly wrong, and partly an, umm, euphemism. Don't use your code! Or fix all the problems first. What does your upthread posted code do if called as, say... sh ./comp00.sh whatever "a file called like this" or even (by accident or maliciously)... sh ./comp00.sh whatever "-rf *" ## DON'T try this !!! In short: quote your variables where necessary. Other (uncommented) hints: [[...]] and #!/bin/sh, and probably also == depending on /bin/sh echo while read line ...etc. BTW, I always wonder why one writes # input file _IFl="$1" instead of input_file="$1" Anyway. Janis > ~ > if you use a high level programming language, say java, you will be > effectively looping twice anyway, once for the sort and another for > the accumulation/counting. Even if you recreate the illusion of having > only one loop, for example by using a hash table, the hash table would > still internally do the sorting part > ~ > I can't recall now exactly how is it you can log what the OS is doing > in these three cases, but sometimes what looks like a shorter way to > do things is not the most effiicient regarding speed and footprint > ~ > Database algorithms do this all the time I am curious as to how they > do it. I mean if they actually use any slick optimizations instead of > going the procedural monkey way as I did > ~ > lbrtchx
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: Automatically rename a downloaded file by curl to avoidfilename conflict. Next: Man vs. standards |