Prev: freewrap is awesome!
Next: Tcl and .NET
From: Nick Hounsome on 2 Dec 2009 00:09 On 1 Dec, 19:01, Georgios Petasis <peta...(a)iit.demokritos.gr> wrote: > O/H drscr...(a)gmail.com Ýãñáøå: > > > > > Georgios Petasis wrote: > >> Hi all, > > >> I have a large hash table, whose keys are words, and the values are > >> dicts, that contain integer pairs. > >> I am creating this structure in memory, taking care to reuse objects > >> as much as possible, with the result occupying ~ 1.3GB of memory. > > > Do you really need all of the data in memory at once? If the > > requirements are flexible and you can work with a relatively large chunk > > but one at a time (like 500MB), then I would suggest sqlite. > > > DrS > > SQLite was something I haven't thought of. > I am giving it now a try. It seems that it is way too slow if I use a > file on disk, but is quite fast if I keep it in memory (using :memory" > as a filename). > > The faster code I could get was a combination of SQL and hash table > lookups in a single place. For some reason, string comparison seem to be > too expensive in time. For this table: > > $database eval { > CREATE TABLE words ( > id INTEGER PRIMARY KEY AUTOINCREMENT, > word TEXT NOT NULL > ); > } > > This query seems to be a bottleneck: > > $database onecolumn "SELECT id FROM words WHERE word='$word'" > > The task has not finished yet (it seems that the code based on sqlite > needs twice the time over the dict approach), but seems to use much less > memory (about half). We will see :-) You need indexes on anything that you want to search on otherwise it must scan the whole table (alternatively make word the primary key).
From: Helmut Giese on 2 Dec 2009 08:46
Hi George, >Again a proposal I didn't think of :-) well, that's why we are here :) >I am not sure I have the courage to test it though, as this would be the >6th implementation from scratch of the same task... Yeah, I understand that. > >I am currently running something with sqlite. I will check timings and >decide... Well, while waiting for the loading and processing to proceed you may want to ponder the following idea: Use a tree to index into <something> (<something> being a DB or e.g. a MetaKit). - The tree would be constructed such that any node identifies a particular character of a word. - The node which corresponds to the end of a word would contain this word's index into the above mentioned <something>. An example: Starting with the word 'mankind' the tree would look like this: m \ a \ n \ k \ i \ n \ d* where the '*' denotes mankind's index. If you now add the word 'man' you wouldn't add any nodes since this word is already "contained" in the tree, but just fill in its index. Result: m \ a \ n* \ k \ i \ n \ d* And adding the word 'manner' would produce m \ a \ n* \ \ k n \ \ i e \ \ n r* \ d* where you see (modulo visual representation of your news reader) a second sub-tree originating off of the first 'n'. The procedure I imagine would be like so: - Taking your set of words you build the tree and populate the <something> at the same time (this gives you the indices). - You save both. (Use the 'tree' package in TclLib) Then - Loading the tree should be relatively fast - it's only the smaller part of the total data. - Traversing the tree in search of a word's corresponding index should be extremely fast. - Handling the <something>:If you can handle it as a memory mapped file it may not be necessary to load it into memory up front but rather access it 'on demand' when locating the entry which corresponds to a particular index. The OS should help in virtualising the memory actually used by the mem mapped file - so this should help with the overall memory footprint. (ISTR that MetaKit supports mem mapped files but I am not sure.) What I would probably do: - Implement the tree. - As said above build the tree and re-build the DB omitting the words and keeping only the corresponding data. -- 2nd thought: You can even leave the DB's structure as is - just ignore the 'word' when retrieving a 'row'. - Use any index you retrieve from the tree to directly access the wanted 'row'. This way you can leave most of your current infrastructure in place - and maybe later replace the (backend) DB with a MetaKit. Sort of sketchy and many details lacking, but maybe thinking about it can fill the time while your current solution is loading :) Good luck Helmut Giese |