Prev: freewrap is awesome!
Next: Tcl and .NET
From: Alexandre Ferrieux on 1 Dec 2009 12:05 On Dec 1, 5:33 pm, Donald Arseneau <a...(a)triumf.ca> wrote: > On Nov 30, 1:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr> > wrote: > > > I have a large hash table, whose keys are words, and the values are > > dicts, that contain integer pairs. > > I am creating this structure in memory, taking care to reuse objects as > > much as possible, with the result occupying ~ 1.3GB of memory. > > > However, I don't know how to serialise and restore such a large > > structure. Just using "array get" needs much more memory, and tcl needs > > more than the 2GB a 32-bit application can use. So, I wrote some code > > that serialises all elements without requiring conversion to strings. > > I've never actually had cause to use them, but this sounds like > a case for: > > array startsearch > array nextelement > array anymore > array donesearch Ahem, the question is about serialization, not iteration, and reuse (sharing) of values. What does the array iterator have to do with that ? -Alex
From: Georgios Petasis on 1 Dec 2009 13:51 O/H Bruce Hartweg ������: > Bruce Hartweg wrote: > >> >> stuff > > forgot to add unset VALUE_MAP when lodaing is complete > to remove the overhead of the map itself > > Bruce > This was also my first guess, but it will make loading quite slow. But it seems that there is no better alternative if I want to stay at the Tcl level... Thank you, George
From: Georgios Petasis on 1 Dec 2009 13:54 O/H Alexandre Ferrieux έγραψε: > On Nov 30, 10:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr> > wrote: >> Hi all, >> >> I have a large hash table, whose keys are words, and the values are >> dicts, that contain integer pairs. >> I am creating this structure in memory, taking care to reuse objects as >> much as possible, with the result occupying ~ 1.3GB of memory. >> >> However, I don't know how to serialise and restore such a large > > As a side note, see a few exchanges about a similar project on the tcl > core archive, titled "Serializing and Mmapping dicts", Feb. 2009. > Never got the time to do it though, only preliminary analysis on value > sharing (surprise !)... > > >> structure. Just using "array get" needs much more memory, and tcl needs >> more than the 2GB a 32-bit application can use. So, I wrote some code >> that serialises all elements without requiring conversion to strings. >> The format I chose was as tcl code, to be asy to load it back: >> >> set dict [dict create] >> dict set dict 48422 1 >> set word {tenjin} >> set word_matrix($word) $dict >> set dict [dict create] >> dict set dict 4779 1 >> dict set dict 29113 2 >> dict set dict 44221 1 >> set word {lightyear} >> set word_matrix($word) $dict >> set dict [dict create] >> dict set dict 25399 1 >> set word {salary?} >> set word_matrix($word) $dict >> set dict [dict create] >> dict set dict 366 1 >> dict set dict 819 1 >> dict set dict 1154 2 >> dict set dict 2580 1 >> dict set dict 3164 1 >> dict set dict 3244 2 >> dict set dict 3420 2 >> dict set dict 3833 1 >> ... 313 MB of similar data. >> >> However, I cannot load back the data from this file. The problem is that >> a new object is created for every number in the file, which is memory >> expensive since there is some repetition. >> >> I tried to enclose the data in a proc (hoping that tcl will compile the >> proc into bytecode internally, and end up reusing the same objects for >> the same integers), but it didn't work (wish terminated around 1.3 GB >> with a message of not being able to re-alloc a large memory piece). >> >> Any ideas? > > OK two possibilities here. > > (a) you stick to the text form and arrange for the loader to "intern" > the strings with an extra load-time hashtable. > > (b) you switch to a binary form with explicitly shared values, in > which case the "interning" happens at write time. > > If you chose (a), what about something like this: > > proc i x { > global int > if {[info exists int($x)} {return $int($x)} > set int($x) $x > return $x > } > > dict set dict 1234 [i 1] > dict set dict 6789 [i 1] > ... > > Then the [i 1] above is guaranteed to return a shared value. > Of course this needs temporary extra memory for the "int" array. > But it seems hard to avoid it anyway: either it's an explicit > hashtable of yours, or it's the Tcl interp's table of literals... > Note that the above uses a global array and not a dict. The reason is > that with the value semantics of dicts, careful steps must be taken to > guarantee in-place operations (you don't want to duplicate that big > beast). In particular, having just one reference to the dict in the > system is not easy when you're inside a proc body. > > For (b), the idea is basically to do the same "projection" at write > time, and then serialize (as Tclish UTF8+C080) the unique values, > using offsets as references. When looking up a value, the C code > generates values of a dedicated Tcl_ObjType that I'd call > "ROMstrings", containing just the offset in the mmap'ed binary file, > but able to produce a true string rep on first notice by simple copy > of the UTF8+C080. > > -Alex Dear Alex, I saw your postings on TclCore list. Thanks for the info. Regarding the two solutions, I am confused about (b). Is this still doable from the Tcl level? Regards, George
From: Georgios Petasis on 1 Dec 2009 14:01 O/H drscrypt(a)gmail.com ������: > Georgios Petasis wrote: >> Hi all, >> >> I have a large hash table, whose keys are words, and the values are >> dicts, that contain integer pairs. >> I am creating this structure in memory, taking care to reuse objects >> as much as possible, with the result occupying ~ 1.3GB of memory. > > > > Do you really need all of the data in memory at once? If the > requirements are flexible and you can work with a relatively large chunk > but one at a time (like 500MB), then I would suggest sqlite. > > > DrS > SQLite was something I haven't thought of. I am giving it now a try. It seems that it is way too slow if I use a file on disk, but is quite fast if I keep it in memory (using :memory" as a filename). The faster code I could get was a combination of SQL and hash table lookups in a single place. For some reason, string comparison seem to be too expensive in time. For this table: $database eval { CREATE TABLE words ( id INTEGER PRIMARY KEY AUTOINCREMENT, word TEXT NOT NULL ); } This query seems to be a bottleneck: $database onecolumn "SELECT id FROM words WHERE word='$word'" The task has not finished yet (it seems that the code based on sqlite needs twice the time over the dict approach), but seems to use much less memory (about half). We will see :-) Regards, George
From: Georgios Petasis on 1 Dec 2009 14:05
O/H Georgios Petasis ������: > Hi all, > > I have a large hash table, whose keys are words, and the values are > dicts, that contain integer pairs. > I am creating this structure in memory, taking care to reuse objects as > much as possible, with the result occupying ~ 1.3GB of memory. > > However, I don't know how to serialise and restore such a large > structure. Just using "array get" needs much more memory, and tcl needs > more than the 2GB a 32-bit application can use. So, I wrote some code > that serialises all elements without requiring conversion to strings. > The format I chose was as tcl code, to be asy to load it back: > > set dict [dict create] > dict set dict 48422 1 > set word {tenjin} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 4779 1 > dict set dict 29113 2 > dict set dict 44221 1 > set word {lightyear} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 25399 1 > set word {salary?} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 366 1 > dict set dict 819 1 > dict set dict 1154 2 > dict set dict 2580 1 > dict set dict 3164 1 > dict set dict 3244 2 > dict set dict 3420 2 > dict set dict 3833 1 > ... 313 MB of similar data. > > However, I cannot load back the data from this file. The problem is that > a new object is created for every number in the file, which is memory > expensive since there is some repetition. > > I tried to enclose the data in a proc (hoping that tcl will compile the > proc into bytecode internally, and end up reusing the same objects for > the same integers), but it didn't work (wish terminated around 1.3 GB > with a message of not being able to re-alloc a large memory piece). > > Any ideas? > > George Off-topic, but I want to mention that when wish runs out of memory, the "Fatal Error in Wish" dialog that shows up has no content inside it. Nothing is drawn, not even buttons. However, pressing return works, and after 1-4 more similar windows, you get the usual windows crash message. George |