Prev: External Hashing
Next: Access class from staticmethod
From: Tim Chase on 30 Apr 2010 15:45 On 04/30/2010 12:51 PM, Helmut Jarausch wrote: > I think one could apply an external hashing technique which would require only > very few disk accesses per lookup. > Unfortunately, I'm now aware of an implementation in Python. > Does anybody know about a Python implementation of external hashing? While you don't detail what you're hashing, Stephan Behnel already suggested (in the parent thread) using one of Python's native dbm modules (I just use anydbm and let it choose). The underlying implementations should be fairly efficient assuming you don't use the dumbdbm last-resort fallback). With the anydbm interface, you can implement dict/set semantics as long as you take care that everything is marshalled into and out of strings for keys/values. -tkc
From: Dave Angel on 30 Apr 2010 16:44 Helmut Jarausch wrote: > I think one could apply an external hashing technique which would require only > very few disk accesses per lookup. > Unfortunately, I'm now aware of an implementation in Python. > Does anybody know about a Python implementation of external hashing? > > Thanks, > Helmut. > > That's what databases are for. But if you're trying to streamline it further than a database, you need to give some constraints. In an earlier thread, you mentioned strings with a constant size. I don't think you ever actually said that the match string was also of the same size, but I'll assume it is. Final question is how often you'll be updating the list, versus searching it. If you do both frequently, stick to a database. On the other hand, if the list on disk is fixed, you could thoroughly optimize its makeup. Perhaps the easiest efficient form would be to have a file in two parts. The first part is the hash table, where each entry has a seek-offset and a size. And the rest of the file is just a "list" of items, sorted by hash value. If you make a hash large enough that the maximum collision list is a couple of k, then two seeks would get any record. First seek to the hash entry, then use that to decide where to seek for the data, and how much to read. Then you simply search that block in memory. You could also play games with a two-level hash, where each block of records has its own mini hash table. Doesn't seem worth it for a mere 83million records. DaveA
|
Pages: 1 Prev: External Hashing Next: Access class from staticmethod |