Prev: freewrap is awesome!
Next: Tcl and .NET
From: Georgios Petasis on 30 Nov 2009 16:59 Hi all, I have a large hash table, whose keys are words, and the values are dicts, that contain integer pairs. I am creating this structure in memory, taking care to reuse objects as much as possible, with the result occupying ~ 1.3GB of memory. However, I don't know how to serialise and restore such a large structure. Just using "array get" needs much more memory, and tcl needs more than the 2GB a 32-bit application can use. So, I wrote some code that serialises all elements without requiring conversion to strings. The format I chose was as tcl code, to be asy to load it back: set dict [dict create] dict set dict 48422 1 set word {tenjin} set word_matrix($word) $dict set dict [dict create] dict set dict 4779 1 dict set dict 29113 2 dict set dict 44221 1 set word {lightyear} set word_matrix($word) $dict set dict [dict create] dict set dict 25399 1 set word {salary?} set word_matrix($word) $dict set dict [dict create] dict set dict 366 1 dict set dict 819 1 dict set dict 1154 2 dict set dict 2580 1 dict set dict 3164 1 dict set dict 3244 2 dict set dict 3420 2 dict set dict 3833 1 .... 313 MB of similar data. However, I cannot load back the data from this file. The problem is that a new object is created for every number in the file, which is memory expensive since there is some repetition. I tried to enclose the data in a proc (hoping that tcl will compile the proc into bytecode internally, and end up reusing the same objects for the same integers), but it didn't work (wish terminated around 1.3 GB with a message of not being able to re-alloc a large memory piece). Any ideas? George
From: Bruce Hartweg on 30 Nov 2009 17:42 Georgios Petasis wrote: > Hi all, > > I have a large hash table, whose keys are words, and the values are > dicts, that contain integer pairs. > I am creating this structure in memory, taking care to reuse objects as > much as possible, with the result occupying ~ 1.3GB of memory. > is this done in tcl or in C? > However, I don't know how to serialise and restore such a large > structure. Just using "array get" needs much more memory, and tcl needs > more than the 2GB a 32-bit application can use. So, I wrote some code > that serialises all elements without requiring conversion to strings. > The format I chose was as tcl code, to be asy to load it back: > > set dict [dict create] > dict set dict 48422 1 > set word {tenjin} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 4779 1 .......... > dict set dict 3833 1 > ... 313 MB of similar data. > > However, I cannot load back the data from this file. The problem is that > a new object is created for every number in the file, which is memory > expensive since there is some repetition. > > I tried to enclose the data in a proc (hoping that tcl will compile the > proc into bytecode internally, and end up reusing the same objects for > the same integers), but it didn't work (wish terminated around 1.3 GB > with a message of not being able to re-alloc a large memory piece). > Tcl doesn't auto find objects that can be shared, so if you want to make sure all of these are shared, you need to handle that yourself > Any ideas? > if your original code for building it is in C I would create companion procedures to rebuild it that does all the figuring out of what can be shared and have your tcl script call that proc to do it all in tcl would be tricky to make sure all possible refs are shared, you would instead of create new objects directly, you would have to maintain another map of objects and keep looking them up to make sure you are using the shared rep. and then you would get N copies of everything, but you would have have the overhead of the number map itself so you still might get bit something like this.. your save file would contain set word_matrix(keystr) [myLoader 1234 1 2345 2 847 3 ...] and you code would define proc myLoader {args} { set dict [dict create] foreach {k v} $args { dict set dict [sharedValue $k] [sharedValue $v] } } set VALUE_MAP(1) 1 ;# init array proc sharedValue {inVal} { global VALUE_MAP if {![info exists VALUE_MAP($inVal)]} { set VALUE_MAP($inVal) $inVal } return $VALUE_MAP($inVal) } note didn't run this, may have typos, may have errors and even if it works syntactically, no guarantees that this solves you gigantic data issue Bruce
From: peter.devoil on 30 Nov 2009 17:51 I've had a lot of success with netcdf files for long timeseries (ie lists of floats). You can play with the files dimensions to your hearts content. And there's a (very old) tcl wrapper floating around. Yours, pdev > However, I don't know how to serialise and restore such a large > structure. Just using "array get" needs much more memory, and tcl needs > more than the 2GB a 32-bit application can use. So, I wrote some code > that serialises all elements without requiring conversion to strings.
From: Bruce Hartweg on 30 Nov 2009 17:44 Bruce Hartweg wrote: > > stuff forgot to add unset VALUE_MAP when lodaing is complete to remove the overhead of the map itself Bruce
From: Alexandre Ferrieux on 30 Nov 2009 18:45
On Nov 30, 10:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr> wrote: > Hi all, > > I have a large hash table, whose keys are words, and the values are > dicts, that contain integer pairs. > I am creating this structure in memory, taking care to reuse objects as > much as possible, with the result occupying ~ 1.3GB of memory. > > However, I don't know how to serialise and restore such a large As a side note, see a few exchanges about a similar project on the tcl core archive, titled "Serializing and Mmapping dicts", Feb. 2009. Never got the time to do it though, only preliminary analysis on value sharing (surprise !)... > structure. Just using "array get" needs much more memory, and tcl needs > more than the 2GB a 32-bit application can use. So, I wrote some code > that serialises all elements without requiring conversion to strings. > The format I chose was as tcl code, to be asy to load it back: > > set dict [dict create] > dict set dict 48422 1 > set word {tenjin} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 4779 1 > dict set dict 29113 2 > dict set dict 44221 1 > set word {lightyear} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 25399 1 > set word {salary?} > set word_matrix($word) $dict > set dict [dict create] > dict set dict 366 1 > dict set dict 819 1 > dict set dict 1154 2 > dict set dict 2580 1 > dict set dict 3164 1 > dict set dict 3244 2 > dict set dict 3420 2 > dict set dict 3833 1 > ... 313 MB of similar data. > > However, I cannot load back the data from this file. The problem is that > a new object is created for every number in the file, which is memory > expensive since there is some repetition. > > I tried to enclose the data in a proc (hoping that tcl will compile the > proc into bytecode internally, and end up reusing the same objects for > the same integers), but it didn't work (wish terminated around 1.3 GB > with a message of not being able to re-alloc a large memory piece). > > Any ideas? OK two possibilities here. (a) you stick to the text form and arrange for the loader to "intern" the strings with an extra load-time hashtable. (b) you switch to a binary form with explicitly shared values, in which case the "interning" happens at write time. If you chose (a), what about something like this: proc i x { global int if {[info exists int($x)} {return $int($x)} set int($x) $x return $x } dict set dict 1234 [i 1] dict set dict 6789 [i 1] ... Then the [i 1] above is guaranteed to return a shared value. Of course this needs temporary extra memory for the "int" array. But it seems hard to avoid it anyway: either it's an explicit hashtable of yours, or it's the Tcl interp's table of literals... Note that the above uses a global array and not a dict. The reason is that with the value semantics of dicts, careful steps must be taken to guarantee in-place operations (you don't want to duplicate that big beast). In particular, having just one reference to the dict in the system is not easy when you're inside a proc body. For (b), the idea is basically to do the same "projection" at write time, and then serialize (as Tclish UTF8+C080) the unique values, using offsets as references. When looking up a value, the C code generates values of a dedicated Tcl_ObjType that I'd call "ROMstrings", containing just the offset in the mmap'ed binary file, but able to produce a true string rep on first notice by simple copy of the UTF8+C080. -Alex |