From: vanekl on 18 Mar 2010 12:41 Raymond Wiker wrote: > "vanekl" <vanek(a)acd.net> writes: > >> Yikes. You're way off. The worst-case scenario is O(m.n), >> where: >> m is the number of records in the database, and >> n is the number of joins. >> I'm sure that if I gave this some thought I could do better. > > O(m^n), surely? God no, not if ^ represents the exponential operator. Why did you add an exponent? To be clear, I guess I should have explicitly added a multiply sign: O(m*n). > Good luck getting that to scale linearly. The analysis is linear, not exponential, not even polynomial. I also gave an upper bound based on a minute and a half of thinking. Surely somebody can do better when they sit down and apply themselves for more than that. > map/reduce works because you don't have joins... the "table" is > effectively a completely de-normalized view of your data. I take issue with your presumption that the data must be completely denormalized. It doesn't. Joins can be simulated if the full denormalized record is not incorporated within one atomic record. That's why I said that the analysis required (an upper bound of) O(m*_n_) time, because a brute-force algorithm would have to search the entire db for each join. If the db was completely denormalized, as you suggest, the time analysis reduces to O(m). > Also (and this should have been mentioned earlier): quoting > Moore's law is fine as long as you are talking about sequential > processing. When you switch to parallel processing, you should also > consider Amdahl's law. The debate on that would take me on a tangent I'm not willing to spend time on at the moment. It's an interesting point you raise, but, like I said, a tangent, especially considering you dispute the main points of my position.
From: Raymond Wiker on 18 Mar 2010 16:03 "vanekl" <vanek(a)acd.net> writes: > Raymond Wiker wrote: >> "vanekl" <vanek(a)acd.net> writes: >> >>> Yikes. You're way off. The worst-case scenario is O(m.n), >>> where: >>> m is the number of records in the database, and >>> n is the number of joins. >>> I'm sure that if I gave this some thought I could do better. >> >> O(m^n), surely? > > God no, not if ^ represents the exponential operator. Why did you add an > exponent? To be clear, I guess I should have explicitly added a multiply > sign: O(m*n). ^ is the exponential operator, right. What do you think the cost of a self join on a table of m records is? My guess would be O(m^2) if the table is not indexed on the (single) join column, and O(m*log(m)) if the table is indexed. This does not take into account actual data retrieval, which would carry a non-negligible cost if the data is distributed. I would further expect this to extend to O(m^n) and O(m*(log(m) ^ (n - 1))) for the indexed and non-indexed case (with a reduction based on elements that are thrown away during the processing of the join). There are other possibilities, too, like splitting the full join into a set of pairs, and combining those. I do not see any obvious reasons that this should have any lower complexity, but it may be a better fit for parallel processing. >> Good luck getting that to scale linearly. > > The analysis is linear, not exponential, not even polynomial. I also gave an > upper bound based on a minute and a half of thinking. Surely somebody can do > better when they sit down and apply themselves for more than that. I'd like to know the details of your analysis - I cannot see that a join can be a linear operation. If the data is denormalized, the join is not necessary (duh), and you also have a data set that is a natural fit for map/reduce (no relationship between elements of the mapped data set). Note: I'm not saying that you're wrong, but you haven't produced any analysis (yet). >> map/reduce works because you don't have joins... the "table" is >> effectively a completely de-normalized view of your data. > > I take issue with your presumption that the data must be completely > denormalized. It doesn't. Joins can be simulated if the full denormalized > record is not incorporated within one atomic record. That's why I said that > the analysis required (an upper bound of) O(m*_n_) time, because a > brute-force algorithm would have to search the entire db for each join. If > the db was completely denormalized, as you suggest, the time analysis > reduces to O(m). If the data is denormalized, the join becomes O(1). A brute-force single (self-) join is O(m^2). >> Also (and this should have been mentioned earlier): quoting >> Moore's law is fine as long as you are talking about sequential >> processing. When you switch to parallel processing, you should also >> consider Amdahl's law. > > The debate on that would take me on a tangent I'm not willing to spend time > on at the moment. It's an interesting point you raise, but, like I said, a > tangent, especially considering you dispute the main points of my position. *You* started off with the parallel processing tangent; it would be good if you also tried to keep in mind the problems particular to parallel processing.
From: Alex Mizrahi on 18 Mar 2010 16:45 RW> I'd like to know the details of your analysis - I cannot see RW> that a join can be a linear operation. As I understand you can get something like O(N*M) complexity if one table is not indexed and all other tables are indexed. That is, you do a full scan on a table with N elements, doing M operations on each element. Well, you also have a (log N) somewhere there, but who cares? I don't think any sane person with seriously think about this case, so chances that vanekl is trolling are very high.
From: Raymond Wiker on 18 Mar 2010 16:59 "Alex Mizrahi" <udodenko(a)users.sourceforge.net> writes: > RW> I'd like to know the details of your analysis - I cannot see > RW> that a join can be a linear operation. > > As I understand you can get something like O(N*M) complexity if one > table is not indexed and all other tables are indexed. > That is, you do a full scan on a table with N elements, doing M > operations on each element. Well, you also have a (log N) somewhere > there, but who cares? The lookup of the indexed element would typically be an O(log(m)) operation, unless the lookup used a perfect hash (maybe even an identitity hash :-) Thus, the full scan for the single join should be a O(m*log(m)) - as I stated earlier. > I don't think any sane person with seriously think about this case, so > chances that vanekl is trolling are very high. I'm waiting for vanekl's analysis - best case, I learn something new...
From: vanekl on 18 Mar 2010 17:06
Raymond Wiker wrote: > "vanekl" <vanek(a)acd.net> writes: > >> Raymond Wiker wrote: >>> "vanekl" <vanek(a)acd.net> writes: >>> >>>> Yikes. You're way off. The worst-case scenario is O(m.n), >>>> where: >>>> m is the number of records in the database, and >>>> n is the number of joins. >>>> I'm sure that if I gave this some thought I could do better. >>> >>> O(m^n), surely? >> >> God no, not if ^ represents the exponential operator. Why did you >> add an exponent? To be clear, I guess I should have explicitly added >> a multiply sign: O(m*n). > > ^ is the exponential operator, right. What do you think the cost > of a self join on a table of m records is? > > My guess would be > > O(m^2) if the table is not indexed on the (single) join column, and > O(m*log(m)) if the table is indexed. > > This does not take into account actual data retrieval, which would > carry a non-negligible cost if the data is distributed. > > I would further expect this to extend to > O(m^n) and O(m*(log(m) ^ (n - 1))) for the indexed and non-indexed > case (with a reduction based on elements that are thrown away > during the processing of the join). > > There are other possibilities, too, like splitting the full join > into a set of pairs, and combining those. I do not see any obvious > reasons that this should have any lower complexity, but it may be a > better fit for parallel processing. > >>> Good luck getting that to scale linearly. >> >> The analysis is linear, not exponential, not even polynomial. I also >> gave an upper bound based on a minute and a half of thinking. Surely >> somebody can do better when they sit down and apply themselves for >> more than that. > > I'd like to know the details of your analysis - I cannot see > that a join can be a linear operation. If the data is denormalized, > the join is not necessary (duh), and you also have a data set that is > a natural fit for map/reduce (no relationship between elements of the > mapped data set). > > Note: I'm not saying that you're wrong, but you haven't produced > any analysis (yet). Yes, you're right, I waved my hands there and you deserve an explanation. My minute and a half assessment was hasty. The way I was picturing the algorithm that performs joins across the entire database is as a merge sort performed n times. m, number of records in database n, number of joins s, average number of records in a selection set I envisioned one pass through the entire db would be able to produce n selection sets of size s. So, where O(s log s) is the time complexity of merge sort, O(m + n(s log s)) Since n is typically a small constant, I drop it. I acknowledge this isn't a worst case analysis. O(m + s log s) I believe we are both in agreement that this reduces to O(m) when working with a completely denormalized database. |