Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation
From: Karin Lagesen on 29 Apr 2010 05:38 Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? TIA, Karin
From: Peter Otten on 29 Apr 2010 06:06 Karin Lagesen wrote: > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? Do you need all matches or do you just have to know whether there are any? Can the search string be shorter than 14 characters? One simple improvement over the list may be using one big string instead of the 83 million short ones and then search it using string methods. Peter
From: Stefan Behnel on 29 Apr 2010 06:23 Karin Lagesen, 29.04.2010 11:38: > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? Try one of the dbm modules in the stdlib. They give you dictionary-like lookups on top of a persistent database. Stefan
From: Mark Tolonen on 29 Apr 2010 07:00 "Karin Lagesen" <karin.lagesen(a)bio.uio.no> wrote in message news:416f727c6f5b0edb932b425db9579808.squirrel(a)webmail.uio.no... > Hello. > > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? You may be overthinking it. How fast does it need to be? I generated the following file: >>> f=open('test.txt','wt') >>> for i in xrange(83000000): ... f.write('0123456789ABCD\n') ... >>> f.close() It took about 15 seconds to search that file with a command line search tool (Windows, and a slow IDE hard drive): findstr xyz test.txt It took about twice that to search via Python with 'in': >>> for line in open('test.txt'): ... if 'xyz' in line: ... print line ... Reading only a line at a time has the advantage of using very little memory. Storing 83 million 14-character strings in memory requires a minimum of about 1.2GB not counting overhead for lists/sets/dictionaries. -Mark
From: MRAB on 29 Apr 2010 10:02 Karin Lagesen wrote: > Hello. > > I have approx 83 million strings, all 14 characters long. I need to be > able to take another string and find out whether this one is present > within the 83 million strings. > > Now, I have tried storing these strings as a list, a set and a dictionary. > I know that finding things in a set and a dictionary is very much faster > than working with a list, so I tried those first. However, I run out of > memory building both the set and the dictionary, so what I seem to be left > with is the list, and using the in method. > > I imagine that there should be a faster/better way than this? > You could sort the list and then use a binary chop (Google can tell you what that is if you don't know already).
|
Next
|
Last
Pages: 1 2 3 4 Prev: dynamic function add to an instance of a class Next: using python2.6 on windows without installation |