How to serialize/restore a large data structure? [TCL]

Prev: freewrap is awesome!
Next: Tcl and .NET

From: drscrypt on 1 Dec 2009 14:19

Georgios Petasis wrote:
> This query seems to be a bottleneck:
>
> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>
> The task has not finished yet (it seems that the code based on sqlite
> needs twice the time over the dict approach), but seems to use much less
> memory (about half). We will see :-)
>

This is one of the things about sqlite. It processes the whole query
and returns it in one chunk. Given your data, you may sometimes face
the same situation as before if you leave out the where clause.

On this particular speed issue, you can try creating an index on the
column (word) and see how it helps.

DrS

From: Robert Heller on 1 Dec 2009 15:21

At Tue, 01 Dec 2009 21:01:15 +0200 Georgios Petasis <petasis(a)iit.demokritos.gr> wrote:

>
> O/H drscrypt(a)gmail.com �
> ��:
> > Georgios Petasis wrote:
> >> Hi all,
> >>
> >> I have a large hash table, whose keys are words, and the values are
> >> dicts, that contain integer pairs.
> >> I am creating this structure in memory, taking care to reuse objects
> >> as much as possible, with the result occupying ~ 1.3GB of memory.
> >
> >
> >
> > Do you really need all of the data in memory at once? If the
> > requirements are flexible and you can work with a relatively large chunk
> > but one at a time (like 500MB), then I would suggest sqlite.
> >
> >
> > DrS
> >
>
> SQLite was something I haven't thought of.
> I am giving it now a try. It seems that it is way too slow if I use a
> file on disk, but is quite fast if I keep it in memory (using :memory"
> as a filename).
>
> The faster code I could get was a combination of SQL and hash table
> lookups in a single place. For some reason, string comparison seem to be
> too expensive in time. For this table:
>
> $database eval {
> CREATE TABLE words (
> id INTEGER PRIMARY KEY AUTOINCREMENT,
> word TEXT NOT NULL
> );
> }
>
> This query seems to be a bottleneck:
>
> $database onecolumn "SELECT id FROM words WHERE word='$word'"

Yes, this would do a linear search.

Is there some reason for the *id* to be the PRIMARY KEY and not the
word? Are the words unique? What sort of performance does this table
yield:

$database eval {
CREATE TABLE words (
id INTEGER AUTOINCREMENT,
word TEXT NOT NULL UNIQUE PRIMARY KEY
);
}

>
> The task has not finished yet (it seems that the code based on sqlite
> needs twice the time over the dict approach), but seems to use much less
> memory (about half). We will see :-)

You need to find the bottlenecks and work out solutions to them.

>
> Regards,
>
> George
>

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Download the Model Railroad System
http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows
heller(a)deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/

From: Will Duquette on 1 Dec 2009 15:27

On Dec 1, 11:01 am, Georgios Petasis <peta...(a)iit.demokritos.gr>
wrote:
>
> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>

Others have already mentioned adding an index, which you'll definitely
want to do. I just wanted to point out that the usual way to write
this query is

$database onecolumn {SELECT id FROM words WHERE word=$word}

SQLite will do the variable interpolation for you, according to SQL
rules rather than Tcl rules, which generally speaking is what you
want. Among other things, it prevents SQL injection attacks/errors.
For example, in your version if $word is

some'word

you'll get an SQL syntax error.

From: Donald Arseneau on 1 Dec 2009 15:40

On Dec 1, 9:05 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Dec 1, 5:33 pm, Donald Arseneau <a...(a)triumf.ca> wrote:
>
>
>
> > On Nov 30, 1:59 pm, Georgios Petasis <peta...(a)iit.demokritos.gr>
> > wrote:
>
> > > However, I don't know how to serialise and restore such a large
> > > structure. Just using "array get" needs much more memory, and tcl needs
> > > more than the 2GB a 32-bit application can use. So, I wrote some code
> > > that serialises all elements without requiring conversion to strings.
>
> > ... array nextelement ...
>
> Ahem, the question is about serialization, not iteration, and reuse
> (sharing) of values. What does the array iterator have to do with
> that ?

It lets you save the contents of a 1.3GB Tcl array to a file without
overflowing process memory as [array get] would. I was presuming that
"serialise and restore" meant "serialize for writing, and restore
from
a file".

I agree that a Tcl array is not ideal for such a big hash table, and
something more like a database is more appropriate.

Donald Arseneau

From: Helmut Giese on 1 Dec 2009 16:39

Hi George,
it could be that MetaKit is your friend. It is not a database (just
"persistent storage"), but it seems to me that you don't really need
true DB capabilities (which for me ist the possibility to formulate
complex queries).
Its performance can be quite astonishing and it probably has less of a
memory overhead than a database solution.

You already have it installed - it's part of ActiveState's Tcl. I
haven't used it for a couple of years so I cannot off hand produce an
example, but if you want to check if it fits your needs, there are
probably enough knowledgable people around here to help you get going.

Good luck
Helmut Giese

>Hi all,
>
>I have a large hash table, whose keys are words, and the values are
>dicts, that contain integer pairs.
>I am creating this structure in memory, taking care to reuse objects as
>much as possible, with the result occupying ~ 1.3GB of memory.
>
>However, I don't know how to serialise and restore such a large
>structure. Just using "array get" needs much more memory, and tcl needs
>more than the 2GB a 32-bit application can use. So, I wrote some code
>that serialises all elements without requiring conversion to strings.
>The format I chose was as tcl code, to be asy to load it back:
>
>set dict [dict create]
>dict set dict 48422 1
>set word {tenjin}
>set word_matrix($word) $dict
>set dict [dict create]
>dict set dict 4779 1
>dict set dict 29113 2
>dict set dict 44221 1
>set word {lightyear}
>set word_matrix($word) $dict
>set dict [dict create]
>dict set dict 25399 1
>set word {salary?}
>set word_matrix($word) $dict
>set dict [dict create]
>dict set dict 366 1
>dict set dict 819 1
>dict set dict 1154 2
>dict set dict 2580 1
>dict set dict 3164 1
>dict set dict 3244 2
>dict set dict 3420 2
>dict set dict 3833 1
>... 313 MB of similar data.
>
>However, I cannot load back the data from this file. The problem is that
>a new object is created for every number in the file, which is memory
>expensive since there is some repetition.
>
>I tried to enclose the data in a proc (hoping that tcl will compile the
>proc into bytecode internally, and end up reusing the same objects for
>the same integers), but it didn't work (wish terminated around 1.3 GB
>with a message of not being able to re-alloc a large memory piece).
>
>Any ideas?
>
>George

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: freewrap is awesome!
Next: Tcl and .NET