From: Georgios Petasis on
Hi all,

I have a large hash table, whose keys are words, and the values are
dicts, that contain integer pairs.
I am creating this structure in memory, taking care to reuse objects as
much as possible, with the result occupying ~ 1.3GB of memory.

However, I don't know how to serialise and restore such a large
structure. Just using "array get" needs much more memory, and tcl needs
more than the 2GB a 32-bit application can use. So, I wrote some code
that serialises all elements without requiring conversion to strings.
The format I chose was as tcl code, to be asy to load it back:

set dict [dict create]
dict set dict 48422 1
set word {tenjin}
set word_matrix($word) $dict
set dict [dict create]
dict set dict 4779 1
dict set dict 29113 2
dict set dict 44221 1
set word {lightyear}
set word_matrix($word) $dict
set dict [dict create]
dict set dict 25399 1
set word {salary?}
set word_matrix($word) $dict
set dict [dict create]
dict set dict 366 1
dict set dict 819 1
dict set dict 1154 2
dict set dict 2580 1
dict set dict 3164 1
dict set dict 3244 2
dict set dict 3420 2
dict set dict 3833 1
.... 313 MB of similar data.

However, I cannot load back the data from this file. The problem is that
a new object is created for every number in the file, which is memory
expensive since there is some repetition.

I tried to enclose the data in a proc (hoping that tcl will compile the
proc into bytecode internally, and end up reusing the same objects for
the same integers), but it didn't work (wish terminated around 1.3 GB
with a message of not being able to re-alloc a large memory piece).

Any ideas?

George
From: Bruce Hartweg on

Georgios Petasis wrote:

> Hi all,
>
> I have a large hash table, whose keys are words, and the values are
> dicts, that contain integer pairs.
> I am creating this structure in memory, taking care to reuse objects as
> much as possible, with the result occupying ~ 1.3GB of memory.
>

is this done in tcl or in C?

> However, I don't know how to serialise and restore such a large
> structure. Just using "array get" needs much more memory, and tcl needs
> more than the 2GB a 32-bit application can use. So, I wrote some code
> that serialises all elements without requiring conversion to strings.
> The format I chose was as tcl code, to be asy to load it back:
>
> set dict [dict create]
> dict set dict 48422 1
> set word {tenjin}
> set word_matrix($word) $dict
> set dict [dict create]
> dict set dict 4779 1
..........
> dict set dict 3833 1
> ... 313 MB of similar data.
>
> However, I cannot load back the data from this file. The problem is that
> a new object is created for every number in the file, which is memory
> expensive since there is some repetition.
>
> I tried to enclose the data in a proc (hoping that tcl will compile the
> proc into bytecode internally, and end up reusing the same objects for
> the same integers), but it didn't work (wish terminated around 1.3 GB
> with a message of not being able to re-alloc a large memory piece).
>
Tcl doesn't auto find objects that can be shared, so if you want to
make sure all of these are shared, you need to handle that yourself

> Any ideas?
>

if your original code for building it is in C I would create companion
procedures to rebuild it that does all the figuring out of what can be
shared and have your tcl script call that proc


to do it all in tcl would be tricky to make sure all possible refs are
shared, you would instead of create new objects directly, you would have
to maintain another map of objects and keep looking them up to make sure
you are using the shared rep. and then you would get N copies of
everything, but you would have have the overhead of the number map
itself so you still might get bit

something like this..

your save file would contain

set word_matrix(keystr) [myLoader 1234 1 2345 2 847 3 ...]

and you code would define

proc myLoader {args} {
set dict [dict create]
foreach {k v} $args {
dict set dict [sharedValue $k] [sharedValue $v]
}
}
set VALUE_MAP(1) 1 ;# init array
proc sharedValue {inVal} {
global VALUE_MAP
if {![info exists VALUE_MAP($inVal)]} {
set VALUE_MAP($inVal) $inVal
}
return $VALUE_MAP($inVal)
}

note didn't run this, may have typos, may have errors and
even if it works syntactically, no guarantees that this solves
you gigantic data issue


Bruce
From: peter.devoil on
I've had a lot of success with netcdf files for long timeseries (ie
lists of floats). You can play with the files dimensions to your
hearts content. And there's a (very old) tcl wrapper floating around.

Yours,
pdev

> However, I don't know how to serialise and restore such a large
> structure. Just using "array get" needs much more memory, and tcl needs
> more than the 2GB a 32-bit application can use. So, I wrote some code
> that serialises all elements without requiring conversion to strings.
From: Bruce Hartweg on
Bruce Hartweg wrote:

>
> stuff

forgot to add unset VALUE_MAP when lodaing is complete
to remove the overhead of the map itself

Bruce

From: Alexandre Ferrieux on
On Nov 30, 10:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr>
wrote:
> Hi all,
>
> I have a large hash table, whose keys are words, and the values are
> dicts, that contain integer pairs.
> I am creating this structure in memory, taking care to reuse objects as
> much as possible, with the result occupying ~ 1.3GB of memory.
>
> However, I don't know how to serialise and restore such a large

As a side note, see a few exchanges about a similar project on the tcl
core archive, titled "Serializing and Mmapping dicts", Feb. 2009.
Never got the time to do it though, only preliminary analysis on value
sharing (surprise !)...


> structure. Just using "array get" needs much more memory, and tcl needs
> more than the 2GB a 32-bit application can use. So, I wrote some code
> that serialises all elements without requiring conversion to strings.
> The format I chose was as tcl code, to be asy to load it back:
>
> set dict [dict create]
> dict set dict 48422 1
> set word {tenjin}
> set word_matrix($word) $dict
> set dict [dict create]
> dict set dict 4779 1
> dict set dict 29113 2
> dict set dict 44221 1
> set word {lightyear}
> set word_matrix($word) $dict
> set dict [dict create]
> dict set dict 25399 1
> set word {salary?}
> set word_matrix($word) $dict
> set dict [dict create]
> dict set dict 366 1
> dict set dict 819 1
> dict set dict 1154 2
> dict set dict 2580 1
> dict set dict 3164 1
> dict set dict 3244 2
> dict set dict 3420 2
> dict set dict 3833 1
> ... 313 MB of similar data.
>
> However, I cannot load back the data from this file. The problem is that
> a new object is created for every number in the file, which is memory
> expensive since there is some repetition.
>
> I tried to enclose the data in a proc (hoping that tcl will compile the
> proc into bytecode internally, and end up reusing the same objects for
> the same integers), but it didn't work (wish terminated around 1.3 GB
> with a message of not being able to re-alloc a large memory piece).
>
> Any ideas?

OK two possibilities here.

(a) you stick to the text form and arrange for the loader to "intern"
the strings with an extra load-time hashtable.

(b) you switch to a binary form with explicitly shared values, in
which case the "interning" happens at write time.

If you chose (a), what about something like this:

proc i x {
global int
if {[info exists int($x)} {return $int($x)}
set int($x) $x
return $x
}

dict set dict 1234 [i 1]
dict set dict 6789 [i 1]
...

Then the [i 1] above is guaranteed to return a shared value.
Of course this needs temporary extra memory for the "int" array.
But it seems hard to avoid it anyway: either it's an explicit
hashtable of yours, or it's the Tcl interp's table of literals...
Note that the above uses a global array and not a dict. The reason is
that with the value semantics of dicts, careful steps must be taken to
guarantee in-place operations (you don't want to duplicate that big
beast). In particular, having just one reference to the dict in the
system is not easy when you're inside a proc body.

For (b), the idea is basically to do the same "projection" at write
time, and then serialize (as Tclish UTF8+C080) the unique values,
using offsets as references. When looking up a value, the C code
generates values of a dedicated Tcl_ObjType that I'd call
"ROMstrings", containing just the offset in the mmap'ed binary file,
but able to produce a true string rep on first notice by simple copy
of the UTF8+C080.

-Alex
 |  Next  |  Last
Pages: 1 2 3 4 5 6 7
Prev: freewrap is awesome!
Next: Tcl and .NET