From: joe chesak on
[Note: parts of this message were removed to make it a legal post.]

Guys! Thanks for the great responses, and clever perspectives on this
problem. I have tried out all of them--got 3 of 4 to work on my real
dataset.

Caleb: I actually had tried to apply Set to this problem and failed because
I thought I should be able to point it at one column of a multi-column csv
file...thinking of a .csv file almost as a full-fledged database table.
Your minor tweak on my code shows just how easy it is to choose between
Array vs. Set.

Brian: That's a raw and fast solution. I will keep this in mind if my
datasets get too large.

Kirk: Thanks for coding that all the way through. Great idea to create an
index file and a data file in one sweep.


Joe Chesak

On Fri, Aug 6, 2010 at 11:16 PM, Brian Candler <b.candler(a)pobox.com> wrote:

> Another option: sort them by id, then walk through them together. If the
> current id on both list A and list B is the same, then skip forward on
> both. If the current id on list A is less than the current id on list B,
> then it exists only in B (so report this, and skip forward on A). And
> vice versa.
>
> The advantage of this is that it works with huge files - you can read
> them one line at a time instead of reading them all into RAM. And there
> are tools which can do an external sort efficiently, if they're not
> already sorted that way.
>
> To avoid coding this, just use the unix shell command 'join' to do the
> work for you (which you can invoke from IO.popen if you like). Just
> beware that it's very fussy about having the files correctly sorted
> first.
>
> $ join -t, -j1 -v1 sunday.csv monday.csv
> 2,curly,tall
> 3,moe,meanie
> $ join -t, -j1 -v2 sunday.csv monday.csv
> 4,shemp,greasy
> --
> Posted via http://www.ruby-forum.com/.
>
>