Data organization & access procs dilemma [TCL]

Prev: "HUH???" and/or "wiki broken, VERY broken?" was [Re: Plotting3D surfaces]
Next: How to send a virtual event from C++

From: cattaghia on 11 Apr 2010 01:18

I got myself in a big curiosity and only the Tcl masters could give me
the answers.

PROBLEM: The program needs to store in memory a variable-size
collection of dicts, which can themselves have variable sizes, in
spite of having always the same set of keys (the data in each key can
be a string, or a nested dict, or a list, whatever). This collection
is manipulated by access procedures, which by their turn must be able
to search and gather a number of these dicts according to different
criteria. For example, if these dicts have keys like "name" and
"salary", among others, the access procedures must be able to search
those dicts by the values found in any of these keys. Preferably, the
solution would be based in pure and "vanilla" Tcl (i.e., no Tcllibs
and alikes).

QUESTION: According to your experience, what would be the best way to
organize this collection, in terms of speed in access and in easiness
to implement the access routines? Supposing the employees data
previously mentioned, I imagined a number of possible combinations of
data types and Tcl facilities:

1) Having a big dict in which the primary keys would be the employees
names, then subdicts with the rest of data. I presume this would make
access harder and slow if I need to retrieve a number of dicts by one
of the inner dicts keys -- for example, by the "city" key.

2) Having a central dict in which each key would be an index and would
have as value a subdict containing all of each employee's data. Then
arrays named by search criteria and value -- like "City(London)",
"City(Rio)", or "Gender(Female)" -- would hold secondary lists
containing the indexes to subdicts in the central dict which match the
given criteria. It seems complex but fast.

3) Have the dicts containing each employee's data stored in different
dicts, nested according to different values for certain keys. For
example, a dict in which the "City" value would be the primary key,
other dict in which the "Gender" would be the primary key, and so on
as needed. I presume this could allow fast searches, but seems quite
inefficient and a waste of memory.

So, what do you suggest??

Thanks and regards!

Fabricio Rocha
Brasilia, Brasil

From: Andrew Mangogna on 11 Apr 2010 16:35

cattaghia wrote:

> I got myself in a big curiosity and only the Tcl masters could give me
> the answers.
>
> PROBLEM: The program needs to store in memory a variable-size
> collection of dicts, which can themselves have variable sizes, in
> spite of having always the same set of keys (the data in each key can
> be a string, or a nested dict, or a list, whatever). This collection
> is manipulated by access procedures, which by their turn must be able
> to search and gather a number of these dicts according to different
> criteria. For example, if these dicts have keys like "name" and
> "salary", among others, the access procedures must be able to search
> those dicts by the values found in any of these keys. Preferably, the
> solution would be based in pure and "vanilla" Tcl (i.e., no Tcllibs
> and alikes).
>
> QUESTION: According to your experience, what would be the best way to
> organize this collection, in terms of speed in access and in easiness
> to implement the access routines? Supposing the employees data
> previously mentioned, I imagined a number of possible combinations of
> data types and Tcl facilities:
>
> 1) Having a big dict in which the primary keys would be the employees
> names, then subdicts with the rest of data. I presume this would make
> access harder and slow if I need to retrieve a number of dicts by one
> of the inner dicts keys -- for example, by the "city" key.
>
> 2) Having a central dict in which each key would be an index and would
> have as value a subdict containing all of each employee's data. Then
> arrays named by search criteria and value -- like "City(London)",
> "City(Rio)", or "Gender(Female)" -- would hold secondary lists
> containing the indexes to subdicts in the central dict which match the
> given criteria. It seems complex but fast.
>
> 3) Have the dicts containing each employee's data stored in different
> dicts, nested according to different values for certain keys. For
> example, a dict in which the "City" value would be the primary key,
> other dict in which the "Gender" would be the primary key, and so on
> as needed. I presume this could allow fast searches, but seems quite
> inefficient and a waste of memory.
>
> So, what do you suggest??
>
> Thanks and regards!
>
> Fabricio Rocha
> Brasilia, Brasil

The best way, IMHO, to organize any complicated data set is as a normalized
relational schema. In this way you may use some relational management scheme in
which case there are no access routines to write (or at least they are very
shallow wrappers) and you will automatically support any ad hoc query that the
organization of the data set allows. Alternatively, if the application has a
fixed set of operations on the data set that are known in advance, those
operations can be analyzed and a conventional data structuring can be derived
to support those particular operations. It will depend very much on the precise
nature of the schema and operations performed (which were not entirely clear to
me from your question).

If you are intent that the solution be "pure and vanilla Tcl" then the first
alternative is probably not open to you. However, TclRAL and SQLite both
provide a full complement of relational data structuring capabilities and the
requisite relational algebra for the query operations on any data set. TclRAL
provides this in the form of Tcl commands and SQLite does so by providing a
means to execute SQL language statements. Both, however, require a binary
extension to Tcl. Otherwise, it seems that you are favoring an implementation
based on dictionaries, which when nested provide a convenient organization for
things that are related in a one to many fashion, but can be less suitable when
other operations are required.

Andrew Mangogna

|
Pages: 1
Prev: "HUH???" and/or "wiki broken, VERY broken?" was [Re: Plotting3D surfaces]
Next: How to send a virtual event from C++