How to serialize/restore a large data structure? [TCL]

Prev: freewrap is awesome!
Next: Tcl and .NET

From: Georgios Petasis on 1 Dec 2009 17:51

O/H drscrypt(a)gmail.com ��:
> Georgios Petasis wrote:
>> This query seems to be a bottleneck:
>>
>> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>>
>> The task has not finished yet (it seems that the code based on sqlite
>> needs twice the time over the dict approach), but seems to use much
>> less memory (about half). We will see :-)
>>
>
>
> This is one of the things about sqlite. It processes the whole query
> and returns it in one chunk. Given your data, you may sometimes face
> the same situation as before if you leave out the where clause.
>
> On this particular speed issue, you can try creating an index on the
> column (word) and see how it helps.
>
>
> DrS

Creating an index reduces the time to half. However, indexing doubled
the database size. But since there is repetition in my data (and calls),
caching results in a tcl hash table makes it much more fast than
indexing, and speed up as more stuff is cached (with acceptable memory
footprint).

So, my current best performing solution is sqlite & caching the searches
that compare words.

Lets see if this combination beats the tcl dict implementation (~6:30
hours for all data).

George

From: Georgios Petasis on 1 Dec 2009 17:54

O/H Robert Heller έγραψε:
> At Tue, 01 Dec 2009 21:01:15 +0200 Georgios Petasis <petasis(a)iit.demokritos.gr> wrote:
>
>> O/H drscrypt(a)gmail.com �
>> סברו:
>>> Georgios Petasis wrote:
>>>> Hi all,
>>>>
>>>> I have a large hash table, whose keys are words, and the values are
>>>> dicts, that contain integer pairs.
>>>> I am creating this structure in memory, taking care to reuse objects
>>>> as much as possible, with the result occupying ~ 1.3GB of memory.
>>>
>>>
>>> Do you really need all of the data in memory at once? If the
>>> requirements are flexible and you can work with a relatively large chunk
>>> but one at a time (like 500MB), then I would suggest sqlite.
>>>
>>>
>>> DrS
>>>
>> SQLite was something I haven't thought of.
>> I am giving it now a try. It seems that it is way too slow if I use a
>> file on disk, but is quite fast if I keep it in memory (using :memory"
>> as a filename).
>>
>> The faster code I could get was a combination of SQL and hash table
>> lookups in a single place. For some reason, string comparison seem to be
>> too expensive in time. For this table:
>>
>> $database eval {
>> CREATE TABLE words (
>> id INTEGER PRIMARY KEY AUTOINCREMENT,
>> word TEXT NOT NULL
>> );
>> }
>>
>> This query seems to be a bottleneck:
>>
>> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>
> Yes, this would do a linear search.
>
> Is there some reason for the *id* to be the PRIMARY KEY and not the
> word? Are the words unique? What sort of performance does this table
> yield:
>
> $database eval {
> CREATE TABLE words (
> id INTEGER AUTOINCREMENT,
> word TEXT NOT NULL UNIQUE PRIMARY KEY
> );
> }
No, there is not. Yes words are unique. Do you think that making "word"
a primary key will yield better results in performance?
(I liked the fact that sqlite "autoincrement" the id, so I didn't have
to handle also this. In reality, both fields can be primary key, as both
are unique...

>
>> The task has not finished yet (it seems that the code based on sqlite
>> needs twice the time over the dict approach), but seems to use much less
>> memory (about half). We will see :-)
>
> You need to find the bottlenecks and work out solutions to them.
>
>> Regards,
>>
>> George
>>
>

From: Georgios Petasis on 1 Dec 2009 17:56

O/H Will Duquette έγραψε:
> On Dec 1, 11:01 am, Georgios Petasis <peta...(a)iit.demokritos.gr>
> wrote:
>> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>>
>
> Others have already mentioned adding an index, which you'll definitely
> want to do. I just wanted to point out that the usual way to write
> this query is
>
> $database onecolumn {SELECT id FROM words WHERE word=$word}
>
> SQLite will do the variable interpolation for you, according to SQL
> rules rather than Tcl rules, which generally speaking is what you
> want. Among other things, it prevents SQL injection attacks/errors.
> For example, in your version if $word is
>
> some'word
>
> you'll get an SQL syntax error.

This is brilliant!! I couldn't imagine about this, so I converted
manually single quotes to '' before the sql statements!

Many thanks,

George

From: Will Duquette on 1 Dec 2009 18:02

On Dec 1, 2:54Â pm, Georgios Petasis <peta...(a)iit.demokritos.gr> wrote:
> O/H Robert Heller ÎÎ³ÏÎ±ÏÎµ:
>
> > At Tue, 01 Dec 2009 21:01:15 +0200 Georgios Petasis <peta...(a)iit.demokritos.gr> wrote:
>
> >> O/H drscr...(a)gmail.com
> >> ×¡××¨×:
> >>> Georgios Petasis wrote:
> >>>> Hi all,
>
> >>>> I have a large hash table, whose keys are words, and the values are
> >>>> dicts, that contain integer pairs.
> >>>> I am creating this structure in memory, taking care to reuse objects
> >>>> as much as possible, with the result occupying ~ 1.3GB of memory.
>
> >>> Do you really need all of the data in memory at once? Â If the
> >>> requirements are flexible and you can work with a relatively large chunk
> >>> but one at a time (like 500MB), then I would suggest sqlite.
>
> >>> DrS
>
> >> SQLite was something I haven't thought of.
> >> I am giving it now a try. It seems that it is way too slow if I use a
> >> file on disk, but is quite fast if I keep it in memory (using :memory"
> >> as a filename).
>
> >> The faster code I could get was a combination of SQL and hash table
> >> lookups in a single place. For some reason, string comparison seem to be
> >> too expensive in time. For this table:
>
> >> $database eval {
> >> Â Â Â Â CREATE TABLE words (
> >> Â Â Â Â Â id Â INTEGER PRIMARY KEY AUTOINCREMENT,
> >> Â Â Â Â Â word TEXT Â Â NOT NULL
> >> Â Â Â Â );
> >> Â Â Â }
>
> >> This query seems to be a bottleneck:
>
> >> $database onecolumn "SELECT id FROM words WHERE word='$word'"
>
> > Yes, this would do a linear search.
>
> > Is there some reason for the *id* to be the PRIMARY KEY and not the
> > word? Â Are the words unique? Â What sort of performance does this table
> > yield:
>
> > $database eval {
> > Â Â CREATE TABLE words (
> > Â Â Â id Â INTEGER AUTOINCREMENT,
> > Â Â Â word TEXT Â Â NOT NULL UNIQUE PRIMARY KEY
> > Â Â );
> > Â Â Â }
>
> No, there is not. Yes words are unique. Do you think that making "word"
> a primary key will yield better results in performance?
> (I liked the fact that sqlite "autoincrement" the id, so I didn't have
> to handle also this. In reality, both fields can be primary key, as both
> are unique...
>
>
>
> >> The task has not finished yet (it seems that the code based on sqlite
> >> needs twice the time over the dict approach), but seems to use much less
> >> memory (about half). We will see :-)
>
> > You need to find the bottlenecks and work out solutions to them.
>
> >> Regards,
>
> >> George

I'd leave the ID as the primary key, and simply declare the word
"UNIQUE". That will do two things:

* The table will be indexed on the "word" column.
* You'll get an error if you try to insert the same word twice.

Making the word the primary key shouldn't speed it up, but leaving the
"id" as the primary key might--since SQLite gives every table an
integer rowid anyway, and if your "id" is the primary key it will
simply be an alias to the predefined rowid.

From: Robert Heller on 1 Dec 2009 18:14

At Wed, 02 Dec 2009 00:54:11 +0200 Georgios Petasis <petasis(a)iit.demokritos.gr> wrote:

>
> O/H Robert Heller έγραψε:
> > At Tue, 01 Dec 2009 21:01:15 +0200 Georgios Petasis <petasis(a)iit.demokritos.gr> wrote:
> >
> >> O/H drscrypt(a)gmail.com �
> >> סברו:
> >>> Georgios Petasis wrote:
> >>>> Hi all,
> >>>>
> >>>> I have a large hash table, whose keys are words, and the values are
> >>>> dicts, that contain integer pairs.
> >>>> I am creating this structure in memory, taking care to reuse objects
> >>>> as much as possible, with the result occupying ~ 1.3GB of memory.
> >>>
> >>>
> >>> Do you really need all of the data in memory at once? If the
> >>> requirements are flexible and you can work with a relatively large chunk
> >>> but one at a time (like 500MB), then I would suggest sqlite.
> >>>
> >>>
> >>> DrS
> >>>
> >> SQLite was something I haven't thought of.
> >> I am giving it now a try. It seems that it is way too slow if I use a
> >> file on disk, but is quite fast if I keep it in memory (using :memory"
> >> as a filename).
> >>
> >> The faster code I could get was a combination of SQL and hash table
> >> lookups in a single place. For some reason, string comparison seem to be
> >> too expensive in time. For this table:
> >>
> >> $database eval {
> >> CREATE TABLE words (
> >> id INTEGER PRIMARY KEY AUTOINCREMENT,
> >> word TEXT NOT NULL
> >> );
> >> }
> >>
> >> This query seems to be a bottleneck:
> >>
> >> $database onecolumn "SELECT id FROM words WHERE word='$word'"
> >
> > Yes, this would do a linear search.
> >
> > Is there some reason for the *id* to be the PRIMARY KEY and not the
> > word? Are the words unique? What sort of performance does this table
> > yield:
> >
> > $database eval {
> > CREATE TABLE words (
> > id INTEGER AUTOINCREMENT,
> > word TEXT NOT NULL UNIQUE PRIMARY KEY
> > );
> > }
> No, there is not. Yes words are unique. Do you think that making "word"
> a primary key will yield better results in performance?
> (I liked the fact that sqlite "autoincrement" the id, so I didn't have
> to handle also this. In reality, both fields can be primary key, as both
> are unique...

Making the word the PRIMARY key and declaring it UNIQUE will
allow/cause the database system to implement things more efficently (eg
use a hash table structure or an ordered table or something) -- basicly
it allows the database system to optimize things for certain classes of
queries, since the UNIQUEness has some implications (the database
system can use a storage method that does not need to store any given
word more than once) and the the fact that it is a PRIMARY KEY implies
that you will be doing things like 'select ... where word=...' and the
database system can take steps to make this faster, etc.

>
> >
> >> The task has not finished yet (it seems that the code based on sqlite
> >> needs twice the time over the dict approach), but seems to use much less
> >> memory (about half). We will see :-)
> >
> > You need to find the bottlenecks and work out solutions to them.
> >
> >> Regards,
> >>
> >> George
> >>
> >
>

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Download the Model Railroad System
http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows
heller(a)deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: freewrap is awesome!
Next: Tcl and .NET