Sort Big File Help [Python]

Prev: Installing Scrapy on Mac OS X 10.6
Next: GZRBOT 0.2 BETA1 released

From: mk on 3 Mar 2010 13:20

John Filben wrote:
> I am new to Python but have used many other (mostly dead) languages in
> the past. I want to be able to process *.txt and *.csv files. I can
> now read that and then change them as needed – mostly just take a column
> and do some if-then to create a new variable. My problem is sorting
> these files:
>
> 1.) How do I sort file1.txt by position and write out
> file1_sorted.txt; for example, if all the records are 100 bytes long and
> there is a three digit id in the position 0-2; here would be some sample
> data:
>
> a. 001JohnFilben……
>
> b. 002Joe Smith…..

Use a dictionary:

linedict = {}
for line in f:
key = line[:3]
linedict[key] = line[3:] # or alternatively 'line' if you want to
include key in the line anyway

sortedlines = []
for key in linedict.keys().sort():
sortedlines.append(linedict[key])

(untested)

This is the simplest, and probably inefficient approach. But it should work.

>
> 2.) How do I sort file1.csv by column name; for example, if all the
> records have three column headings, “id”, “first_name”, “last_name”;
> here would be some sample data:
>
> a. Id, first_name,last_name
>
> b. 001,John,Filben
>
> c. 002,Joe, Smith

This is more complicated: I would make a list of lines, where each line
is a list split according to columns (like ['001', 'John', 'Filben']),
and then I would sort this list using operator.itemgetter, like this:

lines.sort(key = operator.itemgetter(num)) # where num is the number of
column, starting with 0 of course

Read up on operator.*, it's very useful.

>
> 3.) What about if I have millions of records and I am processing on a
> laptop with a large external drive – basically, are there space
> considerations? What are the work arounds.

The simplest is to use smth like SQLite: define a table, fill it up, and
then do SELECT with ORDER BY.

But with a million records I wouldn't worry about it, it should fit in
RAM. Observe:

>>> a={}
>>> for i in range(1000000):
.... a[i] = 'spam'*10
....
>>> sys.getsizeof(a)
25165960

So that's what, 25 MB?

Although I have to note that TEMPORARY ram usage in Python process on my
machine did go up to 113MB.

Regards,
mk

From: MRAB on 3 Mar 2010 14:59

mk wrote:
> John Filben wrote:
>> I am new to Python but have used many other (mostly dead) languages in
>> the past. I want to be able to process *.txt and *.csv files. I can
>> now read that and then change them as needed – mostly just take a
>> column and do some if-then to create a new variable. My problem is
>> sorting these files:
>>
>> 1.) How do I sort file1.txt by position and write out
>> file1_sorted.txt; for example, if all the records are 100 bytes long
>> and there is a three digit id in the position 0-2; here would be some
>> sample data:
>>
>> a. 001JohnFilben……
>>
>> b. 002Joe Smith…..
>
> Use a dictionary:
>
> linedict = {}
> for line in f:
> key = line[:3]
> linedict[key] = line[3:] # or alternatively 'line' if you want to
> include key in the line anyway
>
> sortedlines = []
> for key in linedict.keys().sort():
> sortedlines.append(linedict[key])
>
> (untested)
>
> This is the simplest, and probably inefficient approach. But it should
> work.
>
[snip]
Simpler would be:

lines = f.readlines()
lines.sort(key=lambda line: line[ : 3])

or even:

lines = sorted(f.readlines(), key=lambda line: line[ : 3]))

From: Arnaud Delobelle on 3 Mar 2010 15:58

MRAB <python(a)mrabarnett.plus.com> writes:

> mk wrote:
>> John Filben wrote:
>>> I am new to Python but have used many other (mostly dead) languages
>>> in the past. I want to be able to process *.txt and *.csv files.
>>> I can now read that and then change them as needed – mostly just
>>> take a column and do some if-then to create a new variable. My
>>> problem is sorting these files:
>>>
>>> 1.) How do I sort file1.txt by position and write out
>>> file1_sorted.txt; for example, if all the records are 100 bytes
>>> long and there is a three digit id in the position 0-2; here would
>>> be some sample data:
>>>
>>> a. 001JohnFilben……
>>>
>>> b. 002Joe Smith…..
>>
>> Use a dictionary:
>>
>> linedict = {}
>> for line in f:
>> key = line[:3]
>> linedict[key] = line[3:] # or alternatively 'line' if you want
>> to include key in the line anyway
>>
>> sortedlines = []
>> for key in linedict.keys().sort():
>> sortedlines.append(linedict[key])
>>
>> (untested)
>>
>> This is the simplest, and probably inefficient approach. But it
>> should work.
>>
> [snip]
> Simpler would be:
>
> lines = f.readlines()
> lines.sort(key=lambda line: line[ : 3])
>
> or even:
>
> lines = sorted(f.readlines(), key=lambda line: line[ : 3]))

Or even:

lines = sorted(f)

--
Arnaud

From: mk on 3 Mar 2010 16:46

John, there's an error in my program, I forgot that list.sort() method
doesn't return the list (it sorts in place). So it should look like:

#!/usr/bin/python

def sortit(fname):
fo = open(fname)
linedict = {}
for line in fo:
key = line[:3]
linedict[key] = line
sortedlines = []
keys = linedict.keys()
keys.sort()
for key in keys:
sortedlines.append(linedict[key])
return sortedlines

if __name__ == '__main__':
sortit('testfile.txt')

MRAB's solution is obviously better, provided you know about Python's
lambda.

Regards,
mk

From: mk on 3 Mar 2010 16:52

MRAB wrote:

> [snip]
> Simpler would be:
>
> lines = f.readlines()
> lines.sort(key=lambda line: line[ : 3])
>
> or even:
>
> lines = sorted(f.readlines(), key=lambda line: line[ : 3]))

Sure, but a complete newbie (I have this impression about OP) doesn't
have to know about lambda.

I expected my solution to be slower, but it's not (on a file with
100,000 random string lines):

# time ./sort1.py

real 0m0.386s
user 0m0.372s
sys 0m0.014s

# time ./sort2.py

real 0m0.303s
user 0m0.286s
sys 0m0.017s

sort1.py:

#!/usr/bin/python

def sortit(fname):
lines = open(fname).readlines()
lines.sort(key = lambda x: x[:3])

if __name__ == '__main__':
sortit('testfile.txt')

sort2.py:

#!/usr/bin/python

def sortit(fname):
fo = open(fname)
linedict = {}
for line in fo:
key = line[:3]
linedict[key] = line
sortedlines = []
keys = linedict.keys()
keys.sort()
for key in keys:
sortedlines.append(linedict[key])
return sortedlines

if __name__ == '__main__':
sortit('testfile.txt')

Any idea why? After all, I'm "manually" doing quite a lot: allocating
key in a dict, then sorting dict's keys, then iterating over keys and
accessing dict value.

Regards,
mk

| Next | Last
Pages: 1 2
Prev: Installing Scrapy on Mac OS X 10.6
Next: GZRBOT 0.2 BETA1 released