Prev: freewrap is awesome!
Next: Tcl and .NET
From: drscrypt on 1 Dec 2009 14:19 Georgios Petasis wrote: > This query seems to be a bottleneck: > > $database onecolumn "SELECT id FROM words WHERE word='$word'" > > The task has not finished yet (it seems that the code based on sqlite > needs twice the time over the dict approach), but seems to use much less > memory (about half). We will see :-) > This is one of the things about sqlite. It processes the whole query and returns it in one chunk. Given your data, you may sometimes face the same situation as before if you leave out the where clause. On this particular speed issue, you can try creating an index on the column (word) and see how it helps. DrS
From: Robert Heller on 1 Dec 2009 15:21 At Tue, 01 Dec 2009 21:01:15 +0200 Georgios Petasis <petasis(a)iit.demokritos.gr> wrote: > > O/H drscrypt(a)gmail.com � > ����: > > Georgios Petasis wrote: > >> Hi all, > >> > >> I have a large hash table, whose keys are words, and the values are > >> dicts, that contain integer pairs. > >> I am creating this structure in memory, taking care to reuse objects > >> as much as possible, with the result occupying ~ 1.3GB of memory. > > > > > > > > Do you really need all of the data in memory at once? If the > > requirements are flexible and you can work with a relatively large chunk > > but one at a time (like 500MB), then I would suggest sqlite. > > > > > > DrS > > > > SQLite was something I haven't thought of. > I am giving it now a try. It seems that it is way too slow if I use a > file on disk, but is quite fast if I keep it in memory (using :memory" > as a filename). > > The faster code I could get was a combination of SQL and hash table > lookups in a single place. For some reason, string comparison seem to be > too expensive in time. For this table: > > $database eval { > CREATE TABLE words ( > id INTEGER PRIMARY KEY AUTOINCREMENT, > word TEXT NOT NULL > ); > } > > This query seems to be a bottleneck: > > $database onecolumn "SELECT id FROM words WHERE word='$word'" Yes, this would do a linear search. Is there some reason for the *id* to be the PRIMARY KEY and not the word? Are the words unique? What sort of performance does this table yield: $database eval { CREATE TABLE words ( id INTEGER AUTOINCREMENT, word TEXT NOT NULL UNIQUE PRIMARY KEY ); } > > The task has not finished yet (it seems that the code based on sqlite > needs twice the time over the dict approach), but seems to use much less > memory (about half). We will see :-) You need to find the bottlenecks and work out solutions to them. > > Regards, > > George > -- Robert Heller -- 978-544-6933 Deepwoods Software -- Download the Model Railroad System http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows heller(a)deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/
From: Will Duquette on 1 Dec 2009 15:27 On Dec 1, 11:01 am, Georgios Petasis <peta...(a)iit.demokritos.gr> wrote: > > $database onecolumn "SELECT id FROM words WHERE word='$word'" > Others have already mentioned adding an index, which you'll definitely want to do. I just wanted to point out that the usual way to write this query is $database onecolumn {SELECT id FROM words WHERE word=$word} SQLite will do the variable interpolation for you, according to SQL rules rather than Tcl rules, which generally speaking is what you want. Among other things, it prevents SQL injection attacks/errors. For example, in your version if $word is some'word you'll get an SQL syntax error.
From: Donald Arseneau on 1 Dec 2009 15:40 On Dec 1, 9:05 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Dec 1, 5:33 pm, Donald Arseneau <a...(a)triumf.ca> wrote: > > > > > On Nov 30, 1:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr> > > wrote: > > > > However, I don't know how to serialise and restore such a large > > > structure. Just using "array get" needs much more memory, and tcl needs > > > more than the 2GB a 32-bit application can use. So, I wrote some code > > > that serialises all elements without requiring conversion to strings. > > > ... array nextelement ... > > Ahem, the question is about serialization, not iteration, and reuse > (sharing) of values. What does the array iterator have to do with > that ? It lets you save the contents of a 1.3GB Tcl array to a file without overflowing process memory as [array get] would. I was presuming that "serialise and restore" meant "serialize for writing, and restore from a file". I agree that a Tcl array is not ideal for such a big hash table, and something more like a database is more appropriate. Donald Arseneau
From: Helmut Giese on 1 Dec 2009 16:39
Hi George, it could be that MetaKit is your friend. It is not a database (just "persistent storage"), but it seems to me that you don't really need true DB capabilities (which for me ist the possibility to formulate complex queries). Its performance can be quite astonishing and it probably has less of a memory overhead than a database solution. You already have it installed - it's part of ActiveState's Tcl. I haven't used it for a couple of years so I cannot off hand produce an example, but if you want to check if it fits your needs, there are probably enough knowledgable people around here to help you get going. Good luck Helmut Giese >Hi all, > >I have a large hash table, whose keys are words, and the values are >dicts, that contain integer pairs. >I am creating this structure in memory, taking care to reuse objects as >much as possible, with the result occupying ~ 1.3GB of memory. > >However, I don't know how to serialise and restore such a large >structure. Just using "array get" needs much more memory, and tcl needs >more than the 2GB a 32-bit application can use. So, I wrote some code >that serialises all elements without requiring conversion to strings. >The format I chose was as tcl code, to be asy to load it back: > >set dict [dict create] >dict set dict 48422 1 >set word {tenjin} >set word_matrix($word) $dict >set dict [dict create] >dict set dict 4779 1 >dict set dict 29113 2 >dict set dict 44221 1 >set word {lightyear} >set word_matrix($word) $dict >set dict [dict create] >dict set dict 25399 1 >set word {salary?} >set word_matrix($word) $dict >set dict [dict create] >dict set dict 366 1 >dict set dict 819 1 >dict set dict 1154 2 >dict set dict 2580 1 >dict set dict 3164 1 >dict set dict 3244 2 >dict set dict 3420 2 >dict set dict 3833 1 >... 313 MB of similar data. > >However, I cannot load back the data from this file. The problem is that >a new object is created for every number in the file, which is memory >expensive since there is some repetition. > >I tried to enclose the data in a proc (hoping that tcl will compile the >proc into bytecode internally, and end up reusing the same objects for >the same integers), but it didn't work (wish terminated around 1.3 GB >with a message of not being able to re-alloc a large memory piece). > >Any ideas? > >George |