From: joe chesak on 6 Aug 2010 11:51 [Note: parts of this message were removed to make it a legal post.] I don't have access to the database, but everyday I get a csv dump of all live customer listings. There are around 15,000 live listings. A record contains 3 fields: id, name, description Everyday some new accounts are created while others are deleted. Everyday I want to compare yesterday's dump with today's dump and print the full record of each terminated record and each new record. I am achieving this with the script below. That last line makes the script run very slow. Is there a more elegant way to compare two arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the ride? WORKING CODE: require 'rubygems' require 'fastercsv' sundays_dump = FasterCSV.read("./sunday.csv") mondays_dump = FasterCSV.read("./monday.csv") sundays_ids = sundays_dump.collect {|row| row[1]} mondays_ids = mondays_dump.collect {|row| row[1]} newaccount_ids = mondays_ids - sundays_ids terminated_ids = sundays_ids - mondays_ids sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if terminated_ids.include? row[1]} mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if newaccount_ids.include? row[1]} TEST DATA: sunday.csv id,name,otherstuff 1,larry,bald 2,curly,tall 3,moe,meanie monday.csv id,name,otherstuff 1,larry,bald 4,shemp,greasy OUTPUT: delete,2,curly delete,3,moe create,4,shemp
From: Matthew K. Williams on 6 Aug 2010 11:57 On Sat, 7 Aug 2010, joe chesak wrote: > I don't have access to the database, but everyday I get a csv dump of all > live customer listings. > There are around 15,000 live listings. > A record contains 3 fields: id, name, description > Everyday some new accounts are created while others are deleted. > Everyday I want to compare yesterday's dump with today's dump and print > the full record of each terminated record and each new record. > I am achieving this with the script below. > That last line makes the script run very slow. > > Is there a more elegant way to compare two > arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the > ride? > > > WORKING CODE: > require 'rubygems' > require 'fastercsv' > > sundays_dump = FasterCSV.read("./sunday.csv") > mondays_dump = FasterCSV.read("./monday.csv") > > sundays_ids = sundays_dump.collect {|row| row[1]} > mondays_ids = mondays_dump.collect {|row| row[1]} > > newaccount_ids = mondays_ids - sundays_ids > terminated_ids = sundays_ids - mondays_ids > > sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if > terminated_ids.include? row[1]} > mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if > newaccount_ids.include? row[1]} Rather than go through both sets of data, I'd do something like this: terminated_ids.each do |row| puts "delete,#{sundays_dump[row][0]},#{sundays_dump[row][1]}" end Same thing for the new account (substituting monday, of course) This way you're only outputting the changed records. Matt
From: Kirk Haines on 6 Aug 2010 13:20 On Fri, Aug 6, 2010 at 9:51 AM, joe chesak <rubytalk(a)chesak.com> wrote: > I don't have access to the database, but everyday I get a csv dump of all > live customer listings. > There are around 15,000 live listings. > A record contains 3 fields: id, name, description > Everyday some new accounts are created while others are deleted. > Everyday I want to compare yesterday's dump with today's dump and print > the full record of each terminated record and each new record. > I am achieving this with the script below. > That last line makes the script run very slow. > > Is there a more elegant way to compare two > arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the > ride? > > > WORKING CODE: > require 'rubygems' > require 'fastercsv' > > sundays_dump = FasterCSV.read("./sunday.csv") > mondays_dump = FasterCSV.read("./monday.csv") > > sundays_ids = sundays_dump.collect {|row| row[1]} > mondays_ids = mondays_dump.collect {|row| row[1]} > > newaccount_ids = mondays_ids - sundays_ids > terminated_ids = sundays_ids - mondays_ids > > sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if > terminated_ids.include? row[1]} > mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if > newaccount_ids.include? row[1]} The answer to, "it's too slow", is often, "don't do that, then." If comparing arrays is too slow, don't compare arrays. Instead, do something that's fast. Looking up data in a Hash is fast. And since your data appears to be keyed by unique ids (though your code is written with variable names assuming it is looking at ids, but with array lookups that look at the names, which is confusing and might be a bug), it is easy to stuff that data into a hash for later lookup. ----- require 'rubygems' require 'fastercsv' def extract_data(dumpfile) dump = FasterCSV.read(dumpfile) ids = [] data = {} dump.each do |row| ids << row[0] data[row[0]] = row end [ids,data] end sundays_dump = FasterCSV.read("./sunday.csv") mondays_dump = FasterCSV.read("./monday.csv") sunday_ids, sunday_data = extract_data("./sunday.csv") monday_ids, monday_data = extract_data("./monday.csv") newaccount_ids = monday_ids - sunday_ids terminated_ids = sunday_ids - monday_ids terminated_ids.each do |id| row = sunday_data[id] puts "delete,#{id},#{row[1]}" end newaccount_ids.each do |id| row = monday_data[id] puts "create,#{id},#{row[1]}" end ----- If this were my script, I'd further clean it up so that all of your logic is in a simple library. Then either create a generic executable that takes the file names as arguments, or write your scripts with hard coded file names, but that just calls into your library with those names. It'd be much cleaner unless you are absolutely sure that this code will never be more than a one-off hack. Kirk Haines
From: Caleb Clausen on 6 Aug 2010 13:27 On 8/6/10, joe chesak <rubytalk(a)chesak.com> wrote: > I don't have access to the database, but everyday I get a csv dump of all > live customer listings. > There are around 15,000 live listings. > A record contains 3 fields: id, name, description > Everyday some new accounts are created while others are deleted. > Everyday I want to compare yesterday's dump with today's dump and print > the full record of each terminated record and each new record. > I am achieving this with the script below. > That last line makes the script run very slow. > > Is there a more elegant way to compare two > arrays|hashes|sets|FasterCSVtables while carrying a few fields along for the > ride? > > > WORKING CODE: > require 'rubygems' > require 'fastercsv' > > sundays_dump = FasterCSV.read("./sunday.csv") > mondays_dump = FasterCSV.read("./monday.csv") > > sundays_ids = sundays_dump.collect {|row| row[1]} > mondays_ids = mondays_dump.collect {|row| row[1]} > > newaccount_ids = mondays_ids - sundays_ids > terminated_ids = sundays_ids - mondays_ids > > sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if > terminated_ids.include? row[1]} > mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if > newaccount_ids.include? row[1]} > > > TEST DATA: > sunday.csv > id,name,otherstuff > 1,larry,bald > 2,curly,tall > 3,moe,meanie > > monday.csv > id,name,otherstuff > 1,larry,bald > 4,shemp,greasy > > OUTPUT: > delete,2,curly > delete,3,moe > create,4,shemp Here's an idea... although, Matthew's suggestion is really the better way to go (tho I think his code won't quite work... you'll have to save away array indexes somewhere to make that approach effective). Anyway, it should be much faster if you use Sets instead of arrays for terminated_ids and newaccount_ids. Like this: require 'rubygems' require 'fastercsv' require 'set' sundays_dump = FasterCSV.read("./sunday.csv") mondays_dump = FasterCSV.read("./monday.csv") sundays_ids = sundays_dump.collect {|row| row[1]} mondays_ids = mondays_dump.collect {|row| row[1]} newaccount_ids = Set[*mondays_ids - sundays_ids] terminated_ids = Set[*sundays_ids - mondays_ids] sundays_dump.each {|row| puts 'delete,'+row[0]+','+row[1] if terminated_ids.include? row[1]} mondays_dump.each {|row| puts 'create,'+row[0]+','+row[1] if newaccount_ids.include? row[1]} Turning sundays_ids and mondays_ids into Sets might also make it a tad bit faster yet, but I think that the above code has achieved the lions share of potential speedups.
From: Brian Candler on 6 Aug 2010 17:16
Another option: sort them by id, then walk through them together. If the current id on both list A and list B is the same, then skip forward on both. If the current id on list A is less than the current id on list B, then it exists only in B (so report this, and skip forward on A). And vice versa. The advantage of this is that it works with huge files - you can read them one line at a time instead of reading them all into RAM. And there are tools which can do an external sort efficiently, if they're not already sorted that way. To avoid coding this, just use the unix shell command 'join' to do the work for you (which you can invoke from IO.popen if you like). Just beware that it's very fussy about having the files correctly sorted first. $ join -t, -j1 -v1 sunday.csv monday.csv 2,curly,tall 3,moe,meanie $ join -t, -j1 -v2 sunday.csv monday.csv 4,shemp,greasy -- Posted via http://www.ruby-forum.com/. |