From: vanekl on
Raffael Cavallaro wrote:
> On 2010-03-17 04:03:01 -0400, Alex Mizrahi said:
>
>> Emm... People use databases when they have shitloads of records
>
> You do realize that the title of the thread is "lightweight database"
> right?
> I suggested cl-prevalence precisely because it is useful for light
> duty, when the entire dataset fits easily in memory, not because it
> scales to what, to put this in technical parlance, is sometimes
> referred to as "shitloads of records."
>
> warmest regards,
>
> Ralph

"Shitload" has been quantified by a dictionary. The precision leaves a
little to be desired, however.

1. shitload
more than an assload but still less than a fuckton.
[Urban Dictionary]


From: Alex Mizrahi on
??>> Emm... People use databases when they have shitloads of records

RC> You do realize that the title of the thread is "lightweight database"
RC> right?

Tamas have described a very simple database:
"The database consists of timestamp/count pairs, a timestamp is a date&time,
while the count is a single integer."

So I thought that "lightweight" here means that it doesn't need to have lots
of features and shouldn't be complex to install/setup/etc.

That is, its code base should be lightweight, not number of rows it
supports.

Also he have mentioned: "server which runs the database collects the data
continuously". Unless continuously means "once a day", that suggests that it
needs to deal with significant amounts of data.

RC> I suggested cl-prevalence precisely because it is useful for light
RC> duty, when the entire dataset fits easily in memory,

You can fit hundreds millions records into memory, but going through all of
them on each query might be a bad idea if you're doing more than a few
queries.

RC> not because it scales to what, to put this in technical parlance, is
RC> sometimes referred to as "shitloads of records."

It looks like you fail to understand that each database management system
has many different characteristics rather than one which you call
"scalability".
Do you think that software needs to be complex to support large quantities
of data?
It isn't true. You can make a very simple piece of software -- like, couple
hundred lines of code -- to deal with lots of data with very good
performance.
A trick is that it will miss some features, like a support for complex
queries, transactions, etc. But as long as those features are not in
requirements, that's ok.

Once again, I'm not against in-memory databases, I've been using things like
that myself.

But if OP explicitly asks about range queries, and cl-prevalence absolutely
does not address them in any way, that means that probably cl-prevalence
isn't relevant here.

From: Raffael Cavallaro on
On 2010-03-17 13:49:06 -0400, Alex Mizrahi said:

> But if OP explicitly asks about range queries, and cl-prevalence
> absolutely does not address them in any way, that means that probably
> cl-prevalence isn't relevant here.

Once again, if *his* code addresses range queries of his own in-memory
data, then he doesn't need a database to do this, and cl-prevalence is
relevant. I find it hard to belive that someone would be working with
any sort of data and not have in place functions to evaluate and query
that data.

You seem to be missing the point of the whole in-memory thing - that
you don't need an extra layer on top of your existing code because the
in-memory data *is* the database. You just need to add transactions (if
you need them) and persistence. cl-prevalence adds both of these.

The principal reason that people have traditionally used databases is
that, as you put it, they have a "shitload of records." Since these
records would not in the past all fit in memory, one had no choice but
to keep them in some sort of mass storage (e.g., disk), and then put in
place a system for querying that mass storage of one's data. If all
your data fits in memory, you can use perfectly ordinary common lisp
arrays, clos objects, structs, hash tables, etc. to hold your data, and
perfectly ordinary common lisp functions and clos generic functions to
"query" it.

Moreover, you gain the benefit of having everything defined in common
lisp, so all of your queries can be arbitrary lisp code, not restricted
to what a particular database query language allows.

warmest regards,

Ralph

--
Raffael Cavallaro

From: Tamas K Papp on
On Wed, 17 Mar 2010 14:58:35 -0400, Raffael Cavallaro wrote:

Hi Raffael,

First, let me thank you for all the posts you have written in this
thread, they were very informative. Since my original spec comes up a
lot, I think I should add some comments.

> On 2010-03-17 13:49:06 -0400, Alex Mizrahi said:
>
>> But if OP explicitly asks about range queries, and cl-prevalence
>> absolutely does not address them in any way, that means that probably
>> cl-prevalence isn't relevant here.
>
> Once again, if *his* code addresses range queries of his own in-memory
> data, then he doesn't need a database to do this, and cl-prevalence is
> relevant. I find it hard to belive that someone would be working with
> any sort of data and not have in place functions to evaluate and query
> that data.

You are right, but I am lazy and I want to avoid addressing range
queries etc in Common Lisp! I was not super explicit about this, but
the workflow will be the following: a timestamp/count pairs come in
continuously, they are saved into a database by an external program.
This program is unlikely to be written in CL, but there is nothing to
prevent that. But the simpler the better.

Occassionally, I sit down and analyze the data. I want to be able to
make queries like "Give me all the records between dates A and B.",
and then I play around with these. In CL. And the whole thing might
fit into memory.

> You seem to be missing the point of the whole in-memory thing - that you
> don't need an extra layer on top of your existing code because the
> in-memory data *is* the database. You just need to add transactions (if
> you need them) and persistence. cl-prevalence adds both of these.

Having stuff in memory is certainly advantageous, and probably
that's how I will work. But I don't even need persistence! I just
need a single (query start-date end-date) call, which gives me a
vector or similar, which I can then process in CL. No prevalence
needed: I can hide all the database-specific access in a single
function and be done with it, so I can concentrate on the data
analysis.

> The principal reason that people have traditionally used databases is
> that, as you put it, they have a "shitload of records." Since these
> records would not in the past all fit in memory, one had no choice but
> to keep them in some sort of mass storage (e.g., disk), and then put in
> place a system for querying that mass storage of one's data. If all your
> data fits in memory, you can use perfectly ordinary common lisp arrays,
> clos objects, structs, hash tables, etc. to hold your data, and
> perfectly ordinary common lisp functions and clos generic functions to
> "query" it.

For me, keeping everything in CL would not really be a benefit; quite
the opposite: a simple database would allow me to modularize my setup
like this:

( data collection ) => [ database ] => ( analysis, done in CL )

I want to stress that the one central requirement is that the system
is really robust: I can always screw up something in the analysis and
remedy that later, but if data isn't collected, it is lost.

The good thing about sqlite is that

1. the [database] part will use a robust, widely tested solution, and
2. the data collection can be done by a really simple command-line
program (maybe written in C or something, it doesn't matter)

Sure, maybe I could hack together some extensible storage structure in
CL, but I don't really see the point when one is already available.

> Moreover, you gain the benefit of having everything defined in common
> lisp, so all of your queries can be arbitrary lisp code, not restricted
> to what a particular database query language allows.

Certainly. But I don't really need arbitrary queries, just plain
vanilla stuff. And I want it to be really robust (I am not implying that
pure CL solutions aren't, I just have no way of being sure, so I prefer
sqlite).

I am sure that all your points are valid, and in certain
circumstances, your approach would be the way to go. I just wanted to
clarify what my specs are, and why I think that sqlite + CL is the
preferable solution in my case.

Thanks for all the help,

Tamas
From: Thomas A. Russ on
"vanekl" <vanek(a)acd.net> writes:

> Alex Mizrahi wrote:
> >
> > Emm... People use databases when they have shitloads of records, and
> > so O(N) operations are not acceptable.
>
> That type of thinking is only relevant when working with old hardware, and
> is quickly turning into a fiction. All databases will be stored in primary
> memory and machines will commonly work in parallel. It's just a matter of
> time. With this new capability algorithms and database programming will
> become simpler. The database will exist in multiple locations
> simultaneously. O(n) ops will become normal. We're already seeing this

Well, this is a really poor argument. In essence, it says we don't have
to be efficient in the design of our algorithms because the hardware
will save us.

But that isn't really true. The amount of information being stored is
also going up, so N is rapidly becoming larger. So the difference
between an O(N) and an O(log N) algorithm can be really decisive.
Consider the difference between the values when N is 10^6 or 10^9

N = 10^6 log N ~= 23
N = 10^9 log N ~= 30

The difference between 1,000,000,000 and 30 is simply huge! So
something as simple as storing the data in a balance tree can greatly
speed up the retrieval process. Or, particularly if the data arrive in
order, simply storing it in order and using binary search on the
resulting vector would be great.

Even today, you can easily afford to put a terabyte of disk storage on
your computer, but you would be sorely pressed to have even 256
gibibytes of RAM. So having technology that doesn't rely on in-memory
storage is needed if you really want to tackle big problems.

--
Thomas A. Russ, USC/Information Sciences Institute