From: Ed Morton on 4 Jun 2010 09:15 On 6/4/2010 3:12 AM, Greg Russell wrote: > "Janis Papanagnou"<janis_papanagnou(a)hotmail.com> wrote in message > news:hua1jh$4j3$1(a)news.m-online.net... > >> You're already aware of the problem that you have in principle with >> this type of task. One more thought; are your matrices fully filled >> or sparsely populated? > > Excellemt point. > >> In the latter case you might be able to use >> the all-in-memory approach anyway, because you'd need just allocate >> the actual values and leave the zero-values away. > > Are you assuming that zero values comstitute "sparse" elements of the array? > The OP made no such indication, only that "... columns [are] separated by > tabs, rows [are] separated by newlines." > > Null values e.g. ...\t\t... are far more reasonable for the sparse entries, > but of course the OP can define the data structure that is of question. > >> (In any language, like awk, that supports associative arrays this >> would require just a few lines of code.) > > Please educate us in such simplicity ... please. > > Even such well-known routines as the R/S/SPlus http://cran.r-project.org > tramspose function, or the Fortran transpose routines from http://ornl.gov > (U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such > arrays, are more than "a few lines of code." > > If you can provide the "few lines" of awk necessary then you too can be > famous rather than merely flippant. > I don't think Janis was being flippant, it's just that, as he showed in a followup, it's a trivial solution as long as we can identify a "null" value for input fields (e.g. either zero or a null string). I didn't read the references - if they make it sound difficult then they must be dealing with a different problem or coded in a language that doesn't support associative arrays or addressing input that has no specific null value (?) or..... Ed.
From: Janis Papanagnou on 4 Jun 2010 09:59 Maxwell Lol wrote: > Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: > >> Janis Papanagnou wrote: [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] >> >> >> Temporary files and arrays are unnecessary. You can read and print the >> columns as rows one by one. I didn't say they are necessary; rather I suggested to use one of the two approaches, depending on the actual problem domain. Your suggestion seems to me to be much inferiors WRT performance; O(N^2) instead of O(N), which is inacceptable if you're processing Gigabytes of data as the OP actually does. > > Yes, by re-reading the input file each time you want a new row. Indeed. It would be interesting if there'd be a tool to directly access entities in a file by seek'ing to the respective position to avoid such performance degradation; say, as intelligent tail(1) program may operate on character entities. But that seems impossible without having a "structured file"; a database, maybe, or using file concepts of higher abstraction levels which I haven't heard to be in use in practise. Janis
From: pk on 4 Jun 2010 10:25 Janis Papanagnou wrote: > It would be interesting if there'd be a tool to directly access entities > in a file by seek'ing to the respective position to avoid such performance > degradation; say, as intelligent tail(1) program may operate on character > entities. But that seems impossible without having a "structured file"; a > database, maybe, or using file concepts of higher abstraction levels which > I haven't heard to be in use in practise. Well, to nitpick a bit, tail can certainly be told to count bytes rather than lines (using -c), and if all the "cells" are the same width, it may be possible to calculate the offsets for a certain [row,column] cell. In the same way, dd can seek to arbitrary positions in the file, so (just for fun!) one could do, say # assuming each element is 2 characters wide, 7 rows, 5 columns $ cat matrix 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # matrix[row,column] is at offset row * 15 + col * 3 # in the file, if row and col are 0-based # bash for col in {0..5}; do line="" sep="" for row in {0..6}; do offset=$(( row * 15 + col * 3 )) elem=$(dd if=matrix skip=$offset bs=1 count=2) line="$line$sep$elem" sep=" " done printf "%s\n" "$line" done > transposed $ cat transposed Again, it's just for fun. With a large file that ends up calling dd possibly millions of times. And although seeks may be fast, I would not bet that in terms of pure runtime it would be faster than the pure O^2 approach (but I haven't tried).
From: pk on 4 Jun 2010 10:34 pk wrote: > Again, it's just for fun. With a large file that ends up calling dd > possibly millions of times. And although seeks may be fast, I would not > bet that in terms of pure runtime it would be faster than the pure O^2 > approach (but I haven't tried). Just to follow up a bit on that, still assuming fixed widths, the same algorithm could also be implemented using some programming language like perl pr python, which would probably be much more efficient as it won't have all the overhead of spawning processes. Here's an example in Perl: open(M,"+<",$ARGV[0]); for $col (0..4) { $line=$sep=""; for $row (0..6) { seek(M, $row * 15 + $col * 3, 0); read(M, $elem, 2); print $sep, $elem; $sep = " "; } print "\n"; } close(M); (all error checking omitted)
From: Thomas 'PointedEars' Lahn on 4 Jun 2010 10:41
Janis Papanagnou wrote: > Maxwell Lol wrote: >> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >>> Janis Papanagnou wrote: > [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] Isn't it a bit pointless to include that attribution line then? >>> Temporary files and arrays are unnecessary. You can read and print the >>> columns as rows one by one. > > I didn't say they are necessary; rather I suggested to use one of the two > approaches, depending on the actual problem domain. Your suggestion seems > to me to be much inferiors WRT performance; O(N^2) instead of O(N), which > is inacceptable if you're processing Gigabytes of data as the OP actually > does. AISB, it is a tradeoff: Memory vs. runtime efficiency/complexity. The OP asked for a solution that would save memory, hence my suggestion. As to whether that iterative approach would be inacceptable, that depends on the machine speed, and the other processes it is supposed to be running. ISTM nowadays processing a large text file multiple times is to be preferred over using a multiple of the amount of memory required to do the former in order to store the user data of the text file. Using an implementation of my suggestion, transposing a 1000×1000 matrix in a 6.6 MiB text file and storing the result in another text file took only 84 seconds on an Intel Core2 Duo SU9400 CPU (1.4 GHz, 800 MHz FSB) that already had an estimated combined load of 50-75% primarily due to a multimedia application running in the background. The shell process that did it and its child processes combined never required more than 4 MiB of memory. PointedEars |