stratifying values in a file ... [Shell]

Prev: Automatically rename a downloaded file by curl to avoidfilename conflict.
Next: Man vs. standards

From: w_a_x_man on 4 Feb 2010 15:57

On Feb 4, 12:16 pm, Albretch Mueller <lbrt...(a)gmail.com> wrote:
> ~
> OK, we have three algorithms which need two passes through the
> original file (even if Ben's looks like a one liner ;-)

Wrong.

> and mine looks
> lenghtier).
> ~
> if you use a high level programming language, say java, you will be
> effectively looping twice anyway, once for the sort and another for
> the accumulation/counting. Even if you recreate the illusion of having
> only one loop, for example by using a hash table, the hash table would
> still internally do the sorting part

Wrong.

From: Ben Bacarisse on 4 Feb 2010 17:45

Seebs <usenet-nospam(a)seebs.net> writes:

> On 2010-02-04, Ben Bacarisse <ben.usenet(a)bsb.me.uk> wrote:
>> I'd use awk:
>>
>> awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }'
>>
>> (the name c is not a good one but it does make the code a on-liner).
>
> That was my first thought, actually. But then it occurred to me that
> you could also do
> sort | uniq -c

Except that is does not do what was requested. :-) (Neither does mine,
in fact, but I decided that adding OFS="" and print k, ", ", c[k] just
confused matters.)

Can be fixed by further filtering (I did not know about uniq -D in GNU
uniq so I'd never have thought of that one) but one obvious way to
filter counts > 1 is to pipe through awk '$1 > 1' and, well, once you
are running awk you might as well do it in awk.

I suppose one could do

sort | uniq -c | grep -v '^ *1 '

but that seems a bit fussy.

<snip>
--
Ben.

From: Ed Morton on 4 Feb 2010 18:59

On Feb 4, 12:16 pm, Albretch Mueller <lbrt...(a)gmail.com> wrote:
> ~
> OK, we have three algorithms which need two passes through the
> original file (even if Ben's looks like a one liner ;-)

No, Bens doesn't require 2 passes of the file. He wrote:

awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }'

That loop in the END section is just once per unique $1 in the file,
not once per line. You could further reduce the iteractions in the END
section it to just loop through the number of $1s that appeared more
than once:

awk 'c[$1]++{m[$1]} END {for (k in m) print k, c[k] }'

but I doubt if there'd be a noticable difference in execution time.

Ed.

and mine looks
> lenghtier).
> ~
> if you use a high level programming language, say java, you will be
> effectively looping twice anyway, once for the sort and another for
> the accumulation/counting. Even if you recreate the illusion of having
> only one loop, for example by using a hash table, the hash table would
> still internally do the sorting part
> ~
> I can't recall now exactly how is it you can log what the OS is doing
> in these three cases, but sometimes what looks like a shorter way to
> do things is not the most effiicient regarding speed and footprint
> ~
> Database algorithms do this all the time I am curious as to how they
> do it. I mean if they actually use any slick optimizations instead of
> going the procedural monkey way as I did
> ~
> lbrtchx

From: Kenny McCormack on 5 Feb 2010 08:45

In article <3b09f956-70c6-4109-bb86-59b8dcf55925(a)l26g2000yqd.googlegroups.com>,
Ed Morton <mortonspam(a)gmail.com> wrote:
>On Feb 4, 12:16�pm, Albretch Mueller <lbrt...(a)gmail.com> wrote:
>> ~
>> �OK, we have three algorithms which need two passes through the
>> original file (even if Ben's looks like a one liner ;-)
>
>No, Bens doesn't require 2 passes of the file. He wrote:

Right. I wondered why the poster got confused on that point.
Maybe he never learned AWK.

> awk '{c[$1]++} END {for (k in c) if (c[k] > 1) print k, c[k] }'
>
>That loop in the END section is just once per unique $1 in the file,
>not once per line. You could further reduce the iteractions in the END
>section it to just loop through the number of $1s that appeared more
>than once:
>
> awk 'c[$1]++{m[$1]} END {for (k in m) print k, c[k] }'

Cute. I like it. Because I've done exactly this several times in my
career (build up the counts, then print the ones where the count is more
than 1). I always thought there might be a more clever way to do it.

>but I doubt if there'd be a noticable difference in execution time.

True. But artistry is what counts.

From: Janis on 5 Feb 2010 09:57

On 4 Feb., 19:16, Albretch Mueller <lbrt...(a)gmail.com> wrote:
> ~
> OK, we have three algorithms which need two passes through the
> original file (even if Ben's looks like a one liner ;-) and mine looks
> lenghtier).

This is partly wrong, and partly an, umm, euphemism.
Don't use your code! Or fix all the problems first.
What does your upthread posted code do if called as, say...

sh ./comp00.sh whatever "a file called like this"

or even (by accident or maliciously)...

sh ./comp00.sh whatever "-rf *" ## DON'T try this !!!

In short: quote your variables where necessary.

Other (uncommented) hints:
[[...]] and #!/bin/sh, and probably also == depending on /bin/sh
echo
while read line
...etc.

BTW, I always wonder why one writes

# input file
_IFl="$1"

instead of

input_file="$1"

Anyway.

Janis

> ~
> if you use a high level programming language, say java, you will be
> effectively looping twice anyway, once for the sort and another for
> the accumulation/counting. Even if you recreate the illusion of having
> only one loop, for example by using a hash table, the hash table would
> still internally do the sorting part
> ~
> I can't recall now exactly how is it you can log what the OS is doing
> in these three cases, but sometimes what looks like a shorter way to
> do things is not the most effiicient regarding speed and footprint
> ~
> Database algorithms do this all the time I am curious as to how they
> do it. I mean if they actually use any slick optimizations instead of
> going the procedural monkey way as I did
> ~
> lbrtchx

First | Prev | Next | Last
Pages: 1 2 3
Prev: Automatically rename a downloaded file by curl to avoidfilename conflict.
Next: Man vs. standards