From: joe chesak on 9 Aug 2010 05:18 [Note: parts of this message were removed to make it a legal post.] Guys! Thanks for the great responses, and clever perspectives on this problem. I have tried out all of them--got 3 of 4 to work on my real dataset. Caleb: I actually had tried to apply Set to this problem and failed because I thought I should be able to point it at one column of a multi-column csv file...thinking of a .csv file almost as a full-fledged database table. Your minor tweak on my code shows just how easy it is to choose between Array vs. Set. Brian: That's a raw and fast solution. I will keep this in mind if my datasets get too large. Kirk: Thanks for coding that all the way through. Great idea to create an index file and a data file in one sweep. Joe Chesak On Fri, Aug 6, 2010 at 11:16 PM, Brian Candler <b.candler(a)pobox.com> wrote: > Another option: sort them by id, then walk through them together. If the > current id on both list A and list B is the same, then skip forward on > both. If the current id on list A is less than the current id on list B, > then it exists only in B (so report this, and skip forward on A). And > vice versa. > > The advantage of this is that it works with huge files - you can read > them one line at a time instead of reading them all into RAM. And there > are tools which can do an external sort efficiently, if they're not > already sorted that way. > > To avoid coding this, just use the unix shell command 'join' to do the > work for you (which you can invoke from IO.popen if you like). Just > beware that it's very fussy about having the files correctly sorted > first. > > $ join -t, -j1 -v1 sunday.csv monday.csv > 2,curly,tall > 3,moe,meanie > $ join -t, -j1 -v2 sunday.csv monday.csv > 4,shemp,greasy > -- > Posted via http://www.ruby-forum.com/. > > |