From: vanekl on
Andrew Poelstra wrote:
> On 2010-03-17, vanekl <vanek(a)acd.net> wrote:
>> Andrew Poelstra wrote:
>>>
>>> To suggest otherwise - specifically, to suggest that an O(n)
>>> algorithm is preferable to an O(1) one because it is "simpler to
>>> use", is foolhardy and dangerous.
>>
>> Unsupported.
>>
>
> Proper abstractions mean that good algorithms and bad algorithms
> will look the same from a programmer's (or user's) perspective.
> You can have brilliantly fast code with a horrible interface and
> horrible O(n^4) or whatever code with a clean interface.
>
> In other words, the two are unrelated.

And I never suggested otherwise. I said a simple underlying architecture can
be leveraged to solve problems that current databases are having difficulty
with. And as an example, I said that Lisp is based on the single-linked list
(including the language itself!), and due to this one primary constraint the
language can be highly leveraged. The simple db architecture I am advocating
is full-table-scan via map/reduce/merge. The advantages are 1) minimal
schema, 2) easier replication, 3) easier db updates, 4) easier scalability.
Of course there are trade-offs, one of them being data consistency being
more of a challenge. If you've spelunked through a large number of
databases, however, you'll see there's a lot of sewage being stored. This
feature -- while academically useful -- is routinely ignored in the real
world.


> Choosing a higher-order algorithm, however, restricts the amount
> (and type of) data you can feed into a program.

Again, you make broad, unsupported assertions that I, frankly, find
unbelievable.


> It makes things
> slower, and slowness restricts portability (to embedded devices,
> netbooks, phones, virtual machines, "green" computers, etc, etc)

I'm not sure why you are going off on this tangent.


> and can also cause other bugs - exposing race conditions, timeouts,
> and watchdog resets.

On the contrary, simple, robust algorithms (such as map/reduce/merge) tend
to reduce the number of bugs. The exact opposite conclusion of what you
advocate.


> Not to mention the user frustration and corresponding financial
> loss. Plus /his/ electrical and cooling costs.

Yes! As I already mentioned.


>>> Needing to replace your algorithm every time your data scales is
>>> pretty damn costly, too. Not to mention that doing this once negates
>>> any simplicity gained from starting out with a retarded algorithm.
>>
>> I don't understand how you can assert that the algorithm must be
>> revised every time the data grows. My position is that if you choose
>> an inherently scalable algorithm the algorithm remains relatively
>> stable -- the hardware changes, not so much the software.
>>
>
> That's not your position. You claimed that an O(n) and O(log n)
> are interchangable since computing power will increase. This is
> the /opposite/ of scalability.

So you are claiming that an O(n) algorithm is impossible to scale? Even with
Moore's law in effect? That if you choose a parallel algorithm that can run
on separate machines that plugging in additional boxes is simply
unreasonable? Sorry, I don't buy it. You failed to convince me.

My position is that if you /can/ handle a database load at time t_1 using an
O(n) algorithm, you will be able to handle a database load at time t_2 using
the same algorithm assuming you choose a scalable one (as previously
discussed), simply because hardware is becoming affordable at roughly the
same rate that information is growing.


>> Moore's law is making hardware affordable as fast as data is
>> growing. Sure there is a battle going on between the two, but
>> Moore's law is keeping up just fine, thank you very much.
>>
>
> No, it isn't. Clock speed maxed out ten years ago.

Clock speed? You want to pin your argument on clock speed when the number of
transistors continues to increase exponentially? Look back at what I said. I
said the scalable algorithm would have to rely upon being parallelized,
something that clock speed has little to do with.


> The dominant
> trends now are memory size, small form factors, low power usage
> and parallel computing (which does /not/ magically make poorly-
> chosen algorithms smart).

Not magically, no. Thank God there's one thing we can agree on. But this
technique does both parallel and compliment the growth in data well.


>>
>>> Check out
>>> http://projecteuler.net/index.php?section=problems&id=67
>>>
>>> and let me know what kind of hardware you needed in order to avoid
>>> using an intelligent algorithm.
>>
>> So your answer is to change the problem domain from databases to
>> NP-Complete problems? And you think that's fair?
>
> Yes, I do. Databases must respond to arbitrarily-complex queries,
> and this in general is an NP-complete problem.

Yikes. You're way off. The worst-case scenario is O(m.n),
where:
m is the number of records in the database, and
n is the number of joins.
I'm sure that if I gave this some thought I could do better.


> You can make a
> brilliantly-fast SELECT statement, and maybe you can do it with
> only linked lists, but good luck joining tables and filtering on
> all sorts of data in all sorts of ways, sorting and storing the
> results as you go.

Good luck? That's what map/reduce/merge does now!


>> I made no claim that brute-force is the
>> preferred algorithm in all situations and all domains. Come up with a
>> stronger argument and I'll defend my position. I would advise not
>> reading your biases into what I have said.

Look, I don't think we are getting anywhere. I've said everything that I
think needs being said, and you have not been very convincing to me, so I
hope you don't mind if we stop this conversation.


From: Rob Warnock on
Thomas A. Russ <tar(a)sevak.isi.edu> wrote:
+---------------
| Even today, you can easily afford to put a terabyte of disk storage
| on your computer, but you would be sorely pressed to have even 256
| gibibytes of RAM.
+---------------

Unless your[1] computer is, say, an SGI Altix[2], which supports
off-the-shelf up to 16 TiB of main RAM [global, cache-coherent,
single system image Linux] with up to 2048 Xeon-64 cores (256 sockets). ;-}

But your main point is still valid: With surveillance [ELINT],
oil & gas, space imagery, multimedia, and other large datasets
becoming so common [some customers at a PPoE routinely threw
around numbers in the large PB range!], even 16 TiB of RAM isn't
"enough" for some apps.


-Rob

[1] For organizationally-appropriate values of "your". ;-} ;-}

[2] Yes, they still make/sell them! See:

http://www.sgi.com/products/servers/altix/uv/

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Tim X on
Tamas K Papp <tkpapp(a)gmail.com> writes:

> On Wed, 17 Mar 2010 14:58:35 -0400, Raffael Cavallaro wrote:
>
> Hi Raffael,
>
> First, let me thank you for all the posts you have written in this
> thread, they were very informative. Since my original spec comes up a
> lot, I think I should add some comments.
>
>> On 2010-03-17 13:49:06 -0400, Alex Mizrahi said:
>>
>>> But if OP explicitly asks about range queries, and cl-prevalence
>>> absolutely does not address them in any way, that means that probably
>>> cl-prevalence isn't relevant here.
>>
>> Once again, if *his* code addresses range queries of his own in-memory
>> data, then he doesn't need a database to do this, and cl-prevalence is
>> relevant. I find it hard to belive that someone would be working with
>> any sort of data and not have in place functions to evaluate and query
>> that data.
>
> You are right, but I am lazy and I want to avoid addressing range
> queries etc in Common Lisp! I was not super explicit about this, but
> the workflow will be the following: a timestamp/count pairs come in
> continuously, they are saved into a database by an external program.
> This program is unlikely to be written in CL, but there is nothing to
> prevent that. But the simpler the better.
>
> Occassionally, I sit down and analyze the data. I want to be able to
> make queries like "Give me all the records between dates A and B.",
> and then I play around with these. In CL. And the whole thing might
> fit into memory.
>
>> You seem to be missing the point of the whole in-memory thing - that you
>> don't need an extra layer on top of your existing code because the
>> in-memory data *is* the database. You just need to add transactions (if
>> you need them) and persistence. cl-prevalence adds both of these.
>
> Having stuff in memory is certainly advantageous, and probably
> that's how I will work. But I don't even need persistence! I just
> need a single (query start-date end-date) call, which gives me a
> vector or similar, which I can then process in CL. No prevalence
> needed: I can hide all the database-specific access in a single
> function and be done with it, so I can concentrate on the data
> analysis.
>
>> The principal reason that people have traditionally used databases is
>> that, as you put it, they have a "shitload of records." Since these
>> records would not in the past all fit in memory, one had no choice but
>> to keep them in some sort of mass storage (e.g., disk), and then put in
>> place a system for querying that mass storage of one's data. If all your
>> data fits in memory, you can use perfectly ordinary common lisp arrays,
>> clos objects, structs, hash tables, etc. to hold your data, and
>> perfectly ordinary common lisp functions and clos generic functions to
>> "query" it.
>
> For me, keeping everything in CL would not really be a benefit; quite
> the opposite: a simple database would allow me to modularize my setup
> like this:
>
> ( data collection ) => [ database ] => ( analysis, done in CL )
>
> I want to stress that the one central requirement is that the system
> is really robust: I can always screw up something in the analysis and
> remedy that later, but if data isn't collected, it is lost.
>
> The good thing about sqlite is that
>
> 1. the [database] part will use a robust, widely tested solution, and
> 2. the data collection can be done by a really simple command-line
> program (maybe written in C or something, it doesn't matter)
>
> Sure, maybe I could hack together some extensible storage structure in
> CL, but I don't really see the point when one is already available.
>
>> Moreover, you gain the benefit of having everything defined in common
>> lisp, so all of your queries can be arbitrary lisp code, not restricted
>> to what a particular database query language allows.
>
> Certainly. But I don't really need arbitrary queries, just plain
> vanilla stuff. And I want it to be really robust (I am not implying that
> pure CL solutions aren't, I just have no way of being sure, so I prefer
> sqlite).
>
> I am sure that all your points are valid, and in certain
> circumstances, your approach would be the way to go. I just wanted to
> clarify what my specs are, and why I think that sqlite + CL is the
> preferable solution in my case.
>

Tamas,

based on your additional explination of the requirements, I wonder if
you really need a database at all!

Your application is sounding more and more like a typical log processing
app and what you really need is an efficient way to parse log records. A
simple database (and I mean simple) might be useful for tracking past
analysis/summary of the data, but I don't htink you need a database for
the raw data itself.

You mentioned that it needs to be reliable. Just logging your data to a
file is likely to be more reliable than even sqlite. You could even use
things like logrotate to rotate the log files after every x
days/hours/minutes or after the file reaches y bytes in size. This also
eliminates the need to find a method that works for both the data
generating program (which you mentioned may not be CL) and the program
that will process the data. Depending on how rapidly new data items are
generated, existing logging approaches, such as syslog/rsyslog might be
useful and could take care of many of the smaller,mostly uninteresting
but important details associated with logging data.

Then, all you need is some functions to parse your log file to create
whatever type of data structure abstraction suits your analysis
requirements. No need for SQL, transactions, persistance or anyting
else.

If necessary, you can even have a parse and summarise stage, which could
store summary data for later processing etc.
>
just another 2 cents worth.....

Tim


--
tcross (at) rapttech dot com dot au
From: Nicolas Neuss on
"vanekl" <vanek(a)acd.net> writes:

> Alex Mizrahi wrote:
>> a bunch of stuff
>
> You have been added to my kill file. If I don't respond, you know
> why. I suggest you do likewise.

You should better kiss his feet for giving you that much time. Please
reread everything he has said, until you understand it. And then come
back and repent.

Nicolas
From: Raymond Wiker on
"vanekl" <vanek(a)acd.net> writes:

> Yikes. You're way off. The worst-case scenario is O(m.n),
> where:
> m is the number of records in the database, and
> n is the number of joins.
> I'm sure that if I gave this some thought I could do better.

O(m^n), surely?

Good luck getting that to scale linearly.

map/reduce works because you don't have joins... the "table" is
effectively a completely de-normalized view of your data.

Also (and this should have been mentioned earlier): quoting
Moore's law is fine as long as you are talking about sequential
processing. When you switch to parallel processing, you should also
consider Amdahl's law.