From: mk on 3 Mar 2010 13:20 John Filben wrote: > I am new to Python but have used many other (mostly dead) languages in > the past. I want to be able to process *.txt and *.csv files. I can > now read that and then change them as needed – mostly just take a column > and do some if-then to create a new variable. My problem is sorting > these files: > > 1.) How do I sort file1.txt by position and write out > file1_sorted.txt; for example, if all the records are 100 bytes long and > there is a three digit id in the position 0-2; here would be some sample > data: > > a. 001JohnFilben…… > > b. 002Joe Smith….. Use a dictionary: linedict = {} for line in f: key = line[:3] linedict[key] = line[3:] # or alternatively 'line' if you want to include key in the line anyway sortedlines = [] for key in linedict.keys().sort(): sortedlines.append(linedict[key]) (untested) This is the simplest, and probably inefficient approach. But it should work. > > 2.) How do I sort file1.csv by column name; for example, if all the > records have three column headings, “id”, “first_name”, “last_name”; > here would be some sample data: > > a. Id, first_name,last_name > > b. 001,John,Filben > > c. 002,Joe, Smith This is more complicated: I would make a list of lines, where each line is a list split according to columns (like ['001', 'John', 'Filben']), and then I would sort this list using operator.itemgetter, like this: lines.sort(key = operator.itemgetter(num)) # where num is the number of column, starting with 0 of course Read up on operator.*, it's very useful. > > 3.) What about if I have millions of records and I am processing on a > laptop with a large external drive – basically, are there space > considerations? What are the work arounds. The simplest is to use smth like SQLite: define a table, fill it up, and then do SELECT with ORDER BY. But with a million records I wouldn't worry about it, it should fit in RAM. Observe: >>> a={} >>> for i in range(1000000): .... a[i] = 'spam'*10 .... >>> sys.getsizeof(a) 25165960 So that's what, 25 MB? Although I have to note that TEMPORARY ram usage in Python process on my machine did go up to 113MB. Regards, mk
From: MRAB on 3 Mar 2010 14:59 mk wrote: > John Filben wrote: >> I am new to Python but have used many other (mostly dead) languages in >> the past. I want to be able to process *.txt and *.csv files. I can >> now read that and then change them as needed – mostly just take a >> column and do some if-then to create a new variable. My problem is >> sorting these files: >> >> 1.) How do I sort file1.txt by position and write out >> file1_sorted.txt; for example, if all the records are 100 bytes long >> and there is a three digit id in the position 0-2; here would be some >> sample data: >> >> a. 001JohnFilben…… >> >> b. 002Joe Smith….. > > Use a dictionary: > > linedict = {} > for line in f: > key = line[:3] > linedict[key] = line[3:] # or alternatively 'line' if you want to > include key in the line anyway > > sortedlines = [] > for key in linedict.keys().sort(): > sortedlines.append(linedict[key]) > > (untested) > > This is the simplest, and probably inefficient approach. But it should > work. > [snip] Simpler would be: lines = f.readlines() lines.sort(key=lambda line: line[ : 3]) or even: lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
From: Arnaud Delobelle on 3 Mar 2010 15:58 MRAB <python(a)mrabarnett.plus.com> writes: > mk wrote: >> John Filben wrote: >>> I am new to Python but have used many other (mostly dead) languages >>> in the past. I want to be able to process *.txt and *.csv files. >>> I can now read that and then change them as needed – mostly just >>> take a column and do some if-then to create a new variable. My >>> problem is sorting these files: >>> >>> 1.) How do I sort file1.txt by position and write out >>> file1_sorted.txt; for example, if all the records are 100 bytes >>> long and there is a three digit id in the position 0-2; here would >>> be some sample data: >>> >>> a. 001JohnFilben…… >>> >>> b. 002Joe Smith….. >> >> Use a dictionary: >> >> linedict = {} >> for line in f: >> key = line[:3] >> linedict[key] = line[3:] # or alternatively 'line' if you want >> to include key in the line anyway >> >> sortedlines = [] >> for key in linedict.keys().sort(): >> sortedlines.append(linedict[key]) >> >> (untested) >> >> This is the simplest, and probably inefficient approach. But it >> should work. >> > [snip] > Simpler would be: > > lines = f.readlines() > lines.sort(key=lambda line: line[ : 3]) > > or even: > > lines = sorted(f.readlines(), key=lambda line: line[ : 3])) Or even: lines = sorted(f) -- Arnaud
From: mk on 3 Mar 2010 16:46 John, there's an error in my program, I forgot that list.sort() method doesn't return the list (it sorts in place). So it should look like: #!/usr/bin/python def sortit(fname): fo = open(fname) linedict = {} for line in fo: key = line[:3] linedict[key] = line sortedlines = [] keys = linedict.keys() keys.sort() for key in keys: sortedlines.append(linedict[key]) return sortedlines if __name__ == '__main__': sortit('testfile.txt') MRAB's solution is obviously better, provided you know about Python's lambda. Regards, mk
From: mk on 3 Mar 2010 16:52
MRAB wrote: > [snip] > Simpler would be: > > lines = f.readlines() > lines.sort(key=lambda line: line[ : 3]) > > or even: > > lines = sorted(f.readlines(), key=lambda line: line[ : 3])) Sure, but a complete newbie (I have this impression about OP) doesn't have to know about lambda. I expected my solution to be slower, but it's not (on a file with 100,000 random string lines): # time ./sort1.py real 0m0.386s user 0m0.372s sys 0m0.014s # time ./sort2.py real 0m0.303s user 0m0.286s sys 0m0.017s sort1.py: #!/usr/bin/python def sortit(fname): lines = open(fname).readlines() lines.sort(key = lambda x: x[:3]) if __name__ == '__main__': sortit('testfile.txt') sort2.py: #!/usr/bin/python def sortit(fname): fo = open(fname) linedict = {} for line in fo: key = line[:3] linedict[key] = line sortedlines = [] keys = linedict.keys() keys.sort() for key in keys: sortedlines.append(linedict[key]) return sortedlines if __name__ == '__main__': sortit('testfile.txt') Any idea why? After all, I'm "manually" doing quite a lot: allocating key in a dict, then sorting dict's keys, then iterating over keys and accessing dict value. Regards, mk |