matching strings in a large set of strings [Python]

Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation

From: Karin Lagesen on 29 Apr 2010 05:38

Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?

TIA,

Karin

From: Peter Otten on 29 Apr 2010 06:06

Karin Lagesen wrote:

> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?

Do you need all matches or do you just have to know whether there are any?
Can the search string be shorter than 14 characters?

One simple improvement over the list may be using one big string instead of
the 83 million short ones and then search it using string methods.

Peter

From: Stefan Behnel on 29 Apr 2010 06:23

Karin Lagesen, 29.04.2010 11:38:
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?

Try one of the dbm modules in the stdlib. They give you dictionary-like
lookups on top of a persistent database.

Stefan

From: Mark Tolonen on 29 Apr 2010 07:00

"Karin Lagesen" <karin.lagesen(a)bio.uio.no> wrote in message
news:416f727c6f5b0edb932b425db9579808.squirrel(a)webmail.uio.no...
> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?

You may be overthinking it. How fast does it need to be? I generated the
following file:

>>> f=open('test.txt','wt')
>>> for i in xrange(83000000):
... f.write('0123456789ABCD\n')
...
>>> f.close()

It took about 15 seconds to search that file with a command line search tool
(Windows, and a slow IDE hard drive):

findstr xyz test.txt

It took about twice that to search via Python with 'in':

>>> for line in open('test.txt'):
... if 'xyz' in line:
... print line
...

Reading only a line at a time has the advantage of using very little memory.
Storing 83 million 14-character strings in memory requires a minimum of
about 1.2GB not counting overhead for lists/sets/dictionaries.

-Mark

From: MRAB on 29 Apr 2010 10:02

Karin Lagesen wrote:
> Hello.
>
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
>
> Now, I have tried storing these strings as a list, a set and a dictionary.
> I know that finding things in a set and a dictionary is very much faster
> than working with a list, so I tried those first. However, I run out of
> memory building both the set and the dictionary, so what I seem to be left
> with is the list, and using the in method.
>
> I imagine that there should be a faster/better way than this?
>
You could sort the list and then use a binary chop (Google can tell you
what that is if you don't know already).

| Next | Last
Pages: 1 2 3 4
Prev: dynamic function add to an instance of a class
Next: using python2.6 on windows without installation