From: Janis Papanagnou on 4 Jun 2010 11:22 pk wrote: > Janis Papanagnou wrote: > >> It would be interesting if there'd be a tool to directly access entities >> in a file by seek'ing to the respective position to avoid such performance >> degradation; say, as intelligent tail(1) program may operate on character >> entities. But that seems impossible without having a "structured file"; a >> database, maybe, or using file concepts of higher abstraction levels which >> I haven't heard to be in use in practise. > > Well, to nitpick a bit, tail can certainly be told to count bytes rather > than lines (using -c), and if all the "cells" are the same width, it may be > possible to calculate the offsets for a certain [row,column] cell. I was referring to something else. I've heard (not checked the tail(1) source code) that modern implementations would make some guessed seek close to the point where the start might be. I was specifically not speaking of equal-width elements which would certainly make the task quite trivial. Janis > In the > same way, dd can seek to arbitrary positions in the file, so (just for fun!) > one could do, say > > # assuming each element is 2 characters wide, 7 rows, 5 columns > > $ cat matrix > 01 02 03 04 05 > 06 07 08 09 10 > 11 12 13 14 15 > 16 17 18 19 20 > 21 22 23 24 25 > 26 27 28 29 30 > 31 32 33 34 35 > > # matrix[row,column] is at offset row * 15 + col * 3 > # in the file, if row and col are 0-based > > # bash > for col in {0..5}; do > line="" > sep="" > for row in {0..6}; do > offset=$(( row * 15 + col * 3 )) > elem=$(dd if=matrix skip=$offset bs=1 count=2) > line="$line$sep$elem" > sep=" " > done > printf "%s\n" "$line" > done > transposed > > $ cat transposed > > > Again, it's just for fun. With a large file that ends up calling dd possibly > millions of times. And although seeks may be fast, I would not bet that in > terms of pure runtime it would be faster than the pure O^2 approach (but I > haven't tried).
From: Janis Papanagnou on 4 Jun 2010 14:36 Thomas 'PointedEars' Lahn wrote: > Janis Papanagnou wrote: > >> Maxwell Lol wrote: >>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >>>> Janis Papanagnou wrote: >> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] > > Isn't it a bit pointless to include that attribution line then? Tell that Maxwell if you're happy playing the control freak. (Stripping context is fine if done right and if it's not misleading.) It certainly is *not* pointless to make clear that I haven't wrote anything that YOU, specifically, wrote. No big issue, but clarification should be in place. > >>>> Temporary files and arrays are unnecessary. You can read and print the >>>> columns as rows one by one. >> I didn't say they are necessary; rather I suggested to use one of the two >> approaches, depending on the actual problem domain. Your suggestion seems >> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which >> is inacceptable if you're processing Gigabytes of data as the OP actually >> does. > > AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP > asked for a solution that would save memory, hence my suggestion. Suggestions, especially sensible ones, are always welcome. > > As to whether that iterative approach would be inacceptable, that depends on > the machine speed, and the other processes it is supposed to be running. > ISTM nowadays processing a large text file multiple times is to be preferred > over using a multiple of the amount of memory required to do the former in > order to store the user data of the text file. You seem to not know or haven't thought much about the computational complexity, about the _practical relevance_ between O(N^2) and O(N), I suppose? > > Using an implementation of my suggestion, transposing a 1000×1000 matrix in > a 6.6 MiB text file and storing the result in another text file took only 84 > seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already > had an estimated combined load of 50-75% primarily due to a multimedia > application running in the background. The shell process that did it and > its child processes combined never required more than 4 MiB of memory. You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're going to compare an O(N^2) algorithm to an O(N) algorithm. If you've not yet learned about computational complexity I suggest to try a linear O(N) algorithm and compare the timing with your O(N^2) algorithm for a certain data set of size 10^6 Byte, then increase that to 4*10^6 Byte, and finally increase it to the data set that is topic here; which is approx. 1358 times larger than your toy data, and that factor enters the equation squared (1358^2). You won't see any results soon[*] with your approach or with any other approach of complexity O(N^2). Another point you may recognize is; those "Core2"/"GHz"/... terminology that you are celebrating above is good if you want to sell computers, not if you're searching a good algorithm that performs well and scales well with real world application data. Don't expect commercial system development/evolution to solve real world performance issues of a significant magnitude. If you want to compare algorithms of *different* complexity classes it's not that helpful. OTOH, your data *is* helpful if you make the effort to calculate - or try it out - the difference to the data size relevant in this thread; you've omitted this crucial step. This just BTW. Janis [*] And while you wait for the results you can do the math; it's surely an eye opener, I promise. > > > PointedEars
From: Maxwell Lol on 4 Jun 2010 16:21 Janis Papanagnou <janis_papanagnou(a)hotmail.com> writes: > It would be interesting if there'd be a tool to directly access entities > in a file by seek'ing to the respective position to avoid such performance > degradation; say, as intelligent tail(1) program may operate on character > entities. But that seems impossible without having a "structured file"; a > database, maybe, or using file concepts of higher abstraction levels which > I haven't heard to be in use in practise. There are databases that are designed to do this. I was reading an article in the July 2010 Linux Journal that talk about the NoSQL databases (BigTable, HBase, Cassandra, Redis, MongoDB, Voldemort, CouchDB, DYnomite, HyperTable, etc.) The just us e a friggin large database, and a simple key to value lookup. As a simple form, use awk's associative array: a[row "," column]=value; In perl (which might perform better), it's $a{"$row $column"}=$value; This keeps the entire data in memory, but with virual memory, it should work. You just get pages swapped in and out. I have no idea of the performance. But if you have a 4GB file, and 4GB of memory... Well, you can also get a machine with BIG ram. It might be cheaper than a MATLAB license.
From: Maxwell Lol on 4 Jun 2010 16:30 Janis Papanagnou <janis_papanagnou(a)hotmail.com> writes: > Tell that Maxwell if you're happy playing the control freak. Oops. If I screwed up, I apologize.
From: Thomas 'PointedEars' Lahn on 4 Jun 2010 17:03
Janis Papanagnou wrote: > Thomas 'PointedEars' Lahn wrote: >> Janis Papanagnou wrote: >>> Maxwell Lol wrote: >>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >>>>> Janis Papanagnou wrote: >>> [To make sure; I didn't write the subsequent two lines; that was Mr >>> [Lahn.] >> >> Isn't it a bit pointless to include that attribution line then? > > Tell that Maxwell if you're happy playing the control freak. I am telling it *you* because I am referring to *your* reply. This has nothing to do with me being a control freak. Point, kettle, black. > (Stripping context is fine if done right and if it's not misleading.) It > certainly is *not* pointless to make clear that I haven't wrote anything > that YOU, specifically, wrote. No big issue, but clarification should be > in place. Why, you could have simple removed that attribution line and spared yourself the explanation. Since computational complexity appears to be the most important to you, you could have scored there. >>>>> Temporary files and arrays are unnecessary. You can read and print >>>>> the columns as rows one by one. >>> I didn't say they are necessary; rather I suggested to use one of the >>> two approaches, depending on the actual problem domain. Your suggestion >>> seems to me to be much inferiors WRT performance; O(N^2) instead of >>> O(N), which is inacceptable if you're processing Gigabytes of data as >>> the OP actually does. >> >> AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP >> asked for a solution that would save memory, hence my suggestion. > > Suggestions, especially sensible ones, are always welcome. You should face the simple fact that what you consider sensible might not be considered sensible by others, depending on the circumstances. In this case, one might consider it sensible to do what I suggested because one does not have the choice. >> As to whether that iterative approach would be inacceptable, that depends >> on the machine speed, and the other processes it is supposed to be >> running. ISTM nowadays processing a large text file multiple times is to >> be preferred over using a multiple of the amount of memory required to do >> the former in order to store the user data of the text file. > > You seem to not know or haven't thought much about the computational > complexity, about the _practical relevance_ between O(N^2) and O(N), > I suppose? No, you are mistaken. As you could have noticed, complexity is exactly what I have in mind in this case. You don't seem to understand that reducing computational complexity is not always the ultimate goal. It certainly is not in this particular case. >> Using an implementation of my suggestion, transposing a 1000×1000 matrix >> in a 6.6 MiB text file and storing the result in another text file took >> only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) >> that already had an estimated combined load of 50-75% primarily due to a >> multimedia application running in the background. The shell process that >> did it and its child processes combined never required more than 4 MiB of >> memory. > > You've tried "toy data"; 6.6 MiB is nothing compared to 9.4 GB if you're > going to compare an O(N^2) algorithm to an O(N) algorithm. I am well aware that it is only a fraction of the data that the OP wants to process. You could multiply the time required to get an idea of how low it would probably take with 9.4 GiB on the same machine (ca. 17 to 34 standard hours); however, you should also note that the matrix was described to be rectangular, not quadratic. So it might have a lot of columns and measurably fewer rows, which would reduce the number of rows that cut(1) or awk(1) would have to process considerably. Compare this against the computational complexity of storing the entire matrix in memory. In any case, the point that you don't seem to get is that the OP does not want to store 9.4 GiB of data in memory (or a database), maybe because they just cannot do that. So they could not care less how long it takes (i.e., the computational complexity of the algorithm) as long as it can be done. > [snip more pointless babbling] PointedEars |