From: vanekl on
Raymond Wiker wrote:
> "vanekl" <vanek(a)acd.net> writes:
>
>> Yikes. You're way off. The worst-case scenario is O(m.n),
>> where:
>> m is the number of records in the database, and
>> n is the number of joins.
>> I'm sure that if I gave this some thought I could do better.
>
> O(m^n), surely?

God no, not if ^ represents the exponential operator. Why did you add an
exponent? To be clear, I guess I should have explicitly added a multiply
sign: O(m*n).


> Good luck getting that to scale linearly.

The analysis is linear, not exponential, not even polynomial. I also gave an
upper bound based on a minute and a half of thinking. Surely somebody can do
better when they sit down and apply themselves for more than that.


> map/reduce works because you don't have joins... the "table" is
> effectively a completely de-normalized view of your data.

I take issue with your presumption that the data must be completely
denormalized. It doesn't. Joins can be simulated if the full denormalized
record is not incorporated within one atomic record. That's why I said that
the analysis required (an upper bound of) O(m*_n_) time, because a
brute-force algorithm would have to search the entire db for each join. If
the db was completely denormalized, as you suggest, the time analysis
reduces to O(m).


> Also (and this should have been mentioned earlier): quoting
> Moore's law is fine as long as you are talking about sequential
> processing. When you switch to parallel processing, you should also
> consider Amdahl's law.

The debate on that would take me on a tangent I'm not willing to spend time
on at the moment. It's an interesting point you raise, but, like I said, a
tangent, especially considering you dispute the main points of my position.


From: Raymond Wiker on
"vanekl" <vanek(a)acd.net> writes:

> Raymond Wiker wrote:
>> "vanekl" <vanek(a)acd.net> writes:
>>
>>> Yikes. You're way off. The worst-case scenario is O(m.n),
>>> where:
>>> m is the number of records in the database, and
>>> n is the number of joins.
>>> I'm sure that if I gave this some thought I could do better.
>>
>> O(m^n), surely?
>
> God no, not if ^ represents the exponential operator. Why did you add an
> exponent? To be clear, I guess I should have explicitly added a multiply
> sign: O(m*n).

^ is the exponential operator, right. What do you think the cost
of a self join on a table of m records is?

My guess would be

O(m^2) if the table is not indexed on the (single) join column, and
O(m*log(m)) if the table is indexed.

This does not take into account actual data retrieval, which would carry
a non-negligible cost if the data is distributed.

I would further expect this to extend to
O(m^n) and O(m*(log(m) ^ (n - 1))) for the indexed and non-indexed
case (with a reduction based on elements that are thrown away
during the processing of the join).

There are other possibilities, too, like splitting the full join
into a set of pairs, and combining those. I do not see any obvious
reasons that this should have any lower complexity, but it may be a
better fit for parallel processing.

>> Good luck getting that to scale linearly.
>
> The analysis is linear, not exponential, not even polynomial. I also gave an
> upper bound based on a minute and a half of thinking. Surely somebody can do
> better when they sit down and apply themselves for more than that.

I'd like to know the details of your analysis - I cannot see
that a join can be a linear operation. If the data is denormalized, the
join is not necessary (duh), and you also have a data set that is a
natural fit for map/reduce (no relationship between elements of the
mapped data set).

Note: I'm not saying that you're wrong, but you haven't produced
any analysis (yet).

>> map/reduce works because you don't have joins... the "table" is
>> effectively a completely de-normalized view of your data.
>
> I take issue with your presumption that the data must be completely
> denormalized. It doesn't. Joins can be simulated if the full denormalized
> record is not incorporated within one atomic record. That's why I said that
> the analysis required (an upper bound of) O(m*_n_) time, because a
> brute-force algorithm would have to search the entire db for each join. If
> the db was completely denormalized, as you suggest, the time analysis
> reduces to O(m).

If the data is denormalized, the join becomes O(1). A
brute-force single (self-) join is O(m^2).

>> Also (and this should have been mentioned earlier): quoting
>> Moore's law is fine as long as you are talking about sequential
>> processing. When you switch to parallel processing, you should also
>> consider Amdahl's law.
>
> The debate on that would take me on a tangent I'm not willing to spend time
> on at the moment. It's an interesting point you raise, but, like I said, a
> tangent, especially considering you dispute the main points of my position.

*You* started off with the parallel processing tangent; it would
be good if you also tried to keep in mind the problems particular to
parallel processing.
From: Alex Mizrahi on
RW> I'd like to know the details of your analysis - I cannot see
RW> that a join can be a linear operation.

As I understand you can get something like O(N*M) complexity if one table is
not indexed and all other tables are indexed.
That is, you do a full scan on a table with N elements, doing M operations
on each element. Well, you also have a (log N) somewhere there, but who
cares?

I don't think any sane person with seriously think about this case, so
chances that vanekl is trolling are very high.

From: Raymond Wiker on
"Alex Mizrahi" <udodenko(a)users.sourceforge.net> writes:

> RW> I'd like to know the details of your analysis - I cannot see
> RW> that a join can be a linear operation.
>
> As I understand you can get something like O(N*M) complexity if one
> table is not indexed and all other tables are indexed.
> That is, you do a full scan on a table with N elements, doing M
> operations on each element. Well, you also have a (log N) somewhere
> there, but who cares?

The lookup of the indexed element would typically be an
O(log(m)) operation, unless the lookup used a perfect hash (maybe even
an identitity hash :-) Thus, the full scan for the single join should be
a O(m*log(m)) - as I stated earlier.

> I don't think any sane person with seriously think about this case, so
> chances that vanekl is trolling are very high.

I'm waiting for vanekl's analysis - best case, I learn something
new...
From: vanekl on
Raymond Wiker wrote:
> "vanekl" <vanek(a)acd.net> writes:
>
>> Raymond Wiker wrote:
>>> "vanekl" <vanek(a)acd.net> writes:
>>>
>>>> Yikes. You're way off. The worst-case scenario is O(m.n),
>>>> where:
>>>> m is the number of records in the database, and
>>>> n is the number of joins.
>>>> I'm sure that if I gave this some thought I could do better.
>>>
>>> O(m^n), surely?
>>
>> God no, not if ^ represents the exponential operator. Why did you
>> add an exponent? To be clear, I guess I should have explicitly added
>> a multiply sign: O(m*n).
>
> ^ is the exponential operator, right. What do you think the cost
> of a self join on a table of m records is?
>
> My guess would be
>
> O(m^2) if the table is not indexed on the (single) join column, and
> O(m*log(m)) if the table is indexed.
>
> This does not take into account actual data retrieval, which would
> carry a non-negligible cost if the data is distributed.
>
> I would further expect this to extend to
> O(m^n) and O(m*(log(m) ^ (n - 1))) for the indexed and non-indexed
> case (with a reduction based on elements that are thrown away
> during the processing of the join).
>
> There are other possibilities, too, like splitting the full join
> into a set of pairs, and combining those. I do not see any obvious
> reasons that this should have any lower complexity, but it may be a
> better fit for parallel processing.
>
>>> Good luck getting that to scale linearly.
>>
>> The analysis is linear, not exponential, not even polynomial. I also
>> gave an upper bound based on a minute and a half of thinking. Surely
>> somebody can do better when they sit down and apply themselves for
>> more than that.
>
> I'd like to know the details of your analysis - I cannot see
> that a join can be a linear operation. If the data is denormalized,
> the join is not necessary (duh), and you also have a data set that is
> a natural fit for map/reduce (no relationship between elements of the
> mapped data set).
>
> Note: I'm not saying that you're wrong, but you haven't produced
> any analysis (yet).


Yes, you're right, I waved my hands there and you deserve an explanation. My
minute and a half assessment was hasty.

The way I was picturing the algorithm that performs joins across the entire
database is as a merge sort performed n times.

m, number of records in database
n, number of joins
s, average number of records in a selection set

I envisioned one pass through the entire db would be able to produce n
selection sets of size s.

So, where O(s log s) is the time complexity of merge sort,

O(m + n(s log s))

Since n is typically a small constant, I drop it. I acknowledge this isn't a
worst case analysis.

O(m + s log s)

I believe we are both in agreement that this reduces to O(m) when working
with a completely denormalized database.