From: Thomas A. Russ on
"vanekl" <vanek(a)acd.net> writes:

> (defun between (i lower-bound upper-bound)
> "Is number 'i' between the two bounds (inclusive)?"
> (and (<= lower-bound i)
> (<= i upper-bound)))

Silly to define a function for this, since it is trivially written
in-line as

(<= lower-bound i upper-bound)

since CL allows more than two arguments to the comparison functions.
It only gets tricky if you want to mix inclusive and exclusive bounds.

> (let* ((db ({} 1 "one" 2 "two" 3 "three" 4 "four" 5 "five"))
> search-results)
>
> (time
> (dotimes (idx 10000)
> (setq search-results
> (iter (for (key val) in-hashtable db)
> (when (between key 2 3)
> (collect key))))))
>
> (iter (for answer in search-results)
> (print answer)))
>
> Evaluation took:
> 0.010 seconds of real time
> 0.010000 seconds of total run time (0.010000 user, 0.000000 system)
> 100.00% CPU
> 11,524,904 processor cycles
> 159,744 bytes consed

What happens when you create your DB with something like

(dotimes (i 1000000)
(setf (gethash db i) (format nil "~R" i)))

That would start to take a bunch more time, I would think.

--
Thomas A. Russ, USC/Information Sciences Institute
From: vanekl on
Thomas A. Russ wrote:
> "vanekl" <vanek(a)acd.net> writes:
>
>> (defun between (i lower-bound upper-bound)
>> "Is number 'i' between the two bounds (inclusive)?"
>> (and (<= lower-bound i)
>> (<= i upper-bound)))
>
> Silly to define a function for this, since it is trivially written
> in-line as
>
> (<= lower-bound i upper-bound)

I disagree. The 'between function can easily be revised to work as a generic
function and thus handle multiple data types. Adding abstractions makes for
a more robust program.



> since CL allows more than two arguments to the comparison functions.
> It only gets tricky if you want to mix inclusive and exclusive bounds.
>
>> (let* ((db ({} 1 "one" 2 "two" 3 "three" 4 "four" 5 "five"))
>> search-results)
>>
>> (time
>> (dotimes (idx 10000)
>> (setq search-results
>> (iter (for (key val) in-hashtable db)
>> (when (between key 2 3)
>> (collect key))))))
>>
>> (iter (for answer in search-results)
>> (print answer)))
>>
>> Evaluation took:
>> 0.010 seconds of real time
>> 0.010000 seconds of total run time (0.010000 user, 0.000000 system)
>> 100.00% CPU
>> 11,524,904 processor cycles
>> 159,744 bytes consed
>
> What happens when you create your DB with something like
>
> (dotimes (i 1000000)
> (setf (gethash db i) (format nil "~R" i)))
>
> That would start to take a bunch more time, I would think.

Yes, unless you changed the algorithm from map-hash/reduce to
map-hash/reduce/combine, i.e., using multiple processors in parallel. My
point I was trying to make is that in-memory databases are so much faster
than disk-based databases that you can get away with more inefficient
algorithms.


From: vanekl on
Thomas A. Russ wrote:
> "vanekl" <vanek(a)acd.net> writes:
>
>> Alex Mizrahi wrote:
>>>
>>> Emm... People use databases when they have shitloads of records, and
>>> so O(N) operations are not acceptable.
>>
>> That type of thinking is only relevant when working with old
>> hardware, and is quickly turning into a fiction. All databases will
>> be stored in primary memory and machines will commonly work in
>> parallel. It's just a matter of time. With this new capability
>> algorithms and database programming will become simpler. The
>> database will exist in multiple locations simultaneously. O(n) ops
>> will become normal. We're already seeing this
>
> Well, this is a really poor argument. In essence, it says we don't
> have to be efficient in the design of our algorithms because the
> hardware will save us.

Yes, that is essentially where the future is going. It's not going to "save"
us as much as change the style of computing. Instead of the bottleneck being
disk it will be the energy and heat generated from RAM storing the database.
(I'm using the word "bottleneck" loosely, but it will be a major
constraint.)


> But that isn't really true. The amount of information being stored is
> also going up, so N is rapidly becoming larger.

I don't doubt the amount of information in the world is increasing, but at
typical corporate sites I take issue with your statement that it is "rapidly
becoming larger," simply based on my empirical knowledge at a number of
installations. The size of the data is not the problem. There are about a
dozen other problems that eclipse data size as a problem.


> So the difference
> between an O(N) and an O(log N) algorithm can be really decisive.
> Consider the difference between the values when N is 10^6 or 10^9
>
> N = 10^6 log N ~= 23
> N = 10^9 log N ~= 30
>
> The difference between 1,000,000,000 and 30 is simply huge! So
> something as simple as storing the data in a balance tree can greatly
> speed up the retrieval process. Or, particularly if the data arrive
> in order, simply storing it in order and using binary search on the
> resulting vector would be great.

Too bad Google didn't take your advice.

I'm not arguing that there are more efficient algorithms. I agree with your
point that there are mechanisms that allow us to structure the data in
complicated ways to make data retrieval optimal. I'm saying those days are
slowly going away.

There are advantages to using a simplified data structure. Why is Lisp
timeless? It is built on a simple, ubiquitous data structure that is
leveraged to the hilt. You can apply that same concept to databases, and
people are.


> Even today, you can easily afford to put a terabyte of disk storage on
> your computer, but you would be sorely pressed to have even 256
> gibibytes of RAM.

Do you know what the typical corporate database size is? I think you would
be surprised. One of my gigs involved me porting the corporate database that
the _entire_ company based all their work on. Today, storing this database
would cost about $750 for the RAM. Believe me, the long pole in the tent is
not going to be hardware, it's going to be the people and the software using
the hardware. Even with such a small database the DBA still couldn't get the
database to work. What was the problem? Not the hardware. The database
software was so complex that he didn't set it up right. There was just too
many knobs, buttons, and levers and it took more than a month to fix this
one problem. Complicated underlying structure requires a more complicated
control and set-up, and makes changes difficult and costly.

> So having technology that doesn't rely on in-memory
> storage is needed if you really want to tackle big problems.

Most corporate problems (where my interest lies) are not terabyte-sized.
Most problems walk on two legs, or involve overly complex, brittle software
that calcifies the database.


From: vanekl on
Alex Mizrahi wrote:
> a bunch of stuff

You have been added to my kill file. If I don't respond, you know why. I
suggest you do likewise.


From: Andrew Poelstra on
On 2010-03-17, vanekl <vanek(a)acd.net> wrote:
> Andrew Poelstra wrote:
>>
>> To suggest otherwise - specifically, to suggest that an O(n) algorithm
>> is preferable to an O(1) one because it is "simpler to use", is
>> foolhardy and dangerous.
>
> Unsupported.
>

Proper abstractions mean that good algorithms and bad algorithms
will look the same from a programmer's (or user's) perspective.
You can have brilliantly fast code with a horrible interface and
horrible O(n^4) or whatever code with a clean interface.

In other words, the two are unrelated.

Choosing a higher-order algorithm, however, restricts the amount
(and type of) data you can feed into a program. It makes things
slower, and slowness restricts portability (to embedded devices,
netbooks, phones, virtual machines, "green" computers, etc, etc)
and can also cause other bugs - exposing race conditions, timeouts,
and watchdog resets.

Not to mention the user frustration and corresponding financial
loss. Plus /his/ electrical and cooling costs.

>
>> Needing to replace your algorithm every time your data scales is
>> pretty damn costly, too. Not to mention that doing this once negates
>> any simplicity gained from starting out with a retarded algorithm.
>
> I don't understand how you can assert that the algorithm must be revised
> every time the data grows. My position is that if you choose an inherently
> scalable algorithm the algorithm remains relatively stable -- the hardware
> changes, not so much the software.
>

That's not your position. You claimed that an O(n) and O(log n)
are interchangable since computing power will increase. This is
the /opposite/ of scalability.

> Moore's law is making hardware affordable as fast as data is growing. Sure
> there is a battle going on between the two, but Moore's law is keeping up
> just fine, thank you very much.
>

No, it isn't. Clock speed maxed out ten years ago. The dominant
trends now are memory size, small form factors, low power usage
and parallel computing (which does /not/ magically make poorly-
chosen algorithms smart).

>
>> Check out
>> http://projecteuler.net/index.php?section=problems&id=67
>>
>> and let me know what kind of hardware you needed in order to avoid
>> using an intelligent algorithm.
>
> So your answer is to change the problem domain from databases to NP-Complete
> problems? And you think that's fair?

Yes, I do. Databases must respond to arbritarily-complex queries,
and this in general is an NP-complete problem. You can make a
brilliantly-fast SELECT statement, and maybe you can do it with
only linked lists, but good luck joining tables and filtering on
all sorts of data in all sorts of ways, sorting and storing the
results as you go.

> I made no claim that brute-force is the
> preferred algorithm in all situations and all domains. Come up with a
> stronger argument and I'll defend my position. I would advise not reading
> your biases into what I have said.
>

--
Andrew Poelstra
http://www.wpsoftware.net/andrew