constructing and using large lexicon in a program [Python]

Prev: Behavior of re.split on empty strings is unexpected
Next: namespaces, scoping and variables

From: Michael Torrie on 2 Aug 2010 13:52

On 08/02/2010 11:46 AM, Majdi Sawalha wrote:
> I am developing a morphological analyzer that depends on a large lexicon. i
> construct a Lexicon class that reades a text file and construct a dictionary of
> the lexicon entries.
> the other class will use the lexicon class to chech if the word is found in the
> lexicon. the problem that this takes long time as each time an object of that
> class created, then it needs to call the lexicon many times. then when the
> lexicon is called it re-construct the lexicon again. is there any way to
> construct the lexicon one time during the execution of the program? and then the
> other modules will search the already constructed lexicon.

Can you not create a module that, upon import, initializes this lexicon
as a module attribute? Modules are by definition singleton objects,
which is the pattern that you probably need. Any other module could
import this module and get the already-created lexicon object.

From: Peter Otten on 3 Aug 2010 04:00

Majdi Sawalha wrote:

> I am developing a morphological analyzer that depends on a large lexicon.
> i construct a Lexicon class that reades a text file and construct a
> dictionary of the lexicon entries.
> the other class will use the lexicon class to chech if the word is found
> in the lexicon. the problem that this takes long time as each time an
> object of that class created, then it needs to call the lexicon many
> times. then when the lexicon is called it re-construct the lexicon again.
> is there any way to construct the lexicon one time during the execution of
> the program? and then the other modules will search the already
> constructed lexicon.

Normally you just structure your application accordingly. Load the dictionary
once and then pass it around explicitly:

import loader
import user_one
import user_two

filename = ...
large_dict = loader.load(filename)

user_one.use_dict(large_dict)
user_two.use_dict(large_dict)

You may also try a caching scheme to avoid parsing the text file unless it has
changed. Here's a simple example:

$ cat cachedemo.py
import cPickle as pickle
import os

def load_from_text(filename):
# replace with your code
with open(filename) as instream:
return dict(line.strip().split(None, 1) for line in instream)

def load(filename, cached=None):
if cached is None:
cached = filename + ".pickle"
if os.path.exists(cached) and os.path.getmtime(filename) <= os.path.getmtime(cached):
print "using pickle"
with open(cached, "rb") as instream:
return pickle.load(instream)
else:
print "loading from text"
d = load_from_text(filename)
with open(cached, "wb") as out:
pickle.dump(d, out, pickle.HIGHEST_PROTOCOL)
return d

if __name__ == "__main__":
if not os.path.exists("tmp.txt"):
print "creating example data"
with open("tmp.txt", "w") as out:
out.write("""\
alpha value for alpha
beta BETA
gamma GAMMA
""")
print load("tmp.txt")

$ python cachedemo.py
creating example data
loading from text
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA'}
$ python cachedemo.py
using pickle
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA'}
$ echo 'delta modified text' >> tmp.txt
$ python cachedemo.py
loading from text
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA', 'delta': 'modified text'}
$ python cachedemo.py
using pickle
{'alpha': 'value for alpha', 'beta': 'BETA', 'gamma': 'GAMMA', 'delta': 'modified text'}

Peter

|
Pages: 1
Prev: Behavior of re.split on empty strings is unexpected
Next: namespaces, scoping and variables