From: pg on 4 Jun 2010 20:27 I use the approach with intermediate files, using Unix "split", "tr", then "paste", admittedly on much smaller data sets. This is faster than "cut" because cutting each column iteratively means reading n lines, each till the first delimiter, then n lines till the second delimiter .... n lines to the end of line. Thus, it requires reading O(n + 2n + ... + mn) = O(n m^2) amount of data to produce m rows, then O(mn) amount to read those and write them as a single file. Using "split" requires 1 pass O(mn) to produce n rows, 1 pass to read each row, transpose it and write to a column file, then 1 pass for "paste" to read one line from each column file and concatenate it to the final output file. On datasets of several MB, split was also 5-6 times faster than a "while read" loop to achieve the same thing. So here is the code I use (more or less): #!/bin/sh d1=$(mktemp -d) d2=$(mktemp -d) split -l 1 -a 20 "$1" $d1/ # Assuming m < 26^20 for f in $d1/*; do tr '\t' '\n' < $f > $d2/$f done paste $d2/* > "$1".t
From: Robert Bonomi on 24 Jun 2010 14:08 In article <86rqqmF14dU1(a)mid.individual.net>, Greg Russell <grussell(a)example.con> wrote: >"Janis Papanagnou" <janis_papanagnou(a)hotmail.com> wrote in message >news:hua1jh$4j3$1(a)news.m-online.net... > >> You're already aware of the problem that you have in principle with >> this type of task. One more thought; are your matrices fully filled >> or sparsely populated? > >Excellemt point. > >> In the latter case you might be able to use >> the all-in-memory approach anyway, because you'd need just allocate >> the actual values and leave the zero-values away. > >Are you assuming that zero values comstitute "sparse" elements of the array? >The OP made no such indication, only that "... columns [are] separated by >tabs, rows [are] separated by newlines." > >Null values e.g. ...\t\t... are far more reasonable for the sparse entries, >but of course the OP can define the data structure that is of question. > >> (In any language, like awk, that supports associative arrays this >> would require just a few lines of code.) > >Please educate us in such simplicity ... please. > >Even such well-known routines as the R/S/SPlus http://cran.r-project.org >tramspose function, or the Fortran transpose routines from http://ornl.gov >(U.S. Oak Ridge National Laboratories) that we used 20+ years ago on such >arrays, are more than "a few lines of code." Yup. they don't handle arrays the way AWK does, either. Awk has *BUILT*IN* support for sparse arrays, >If you can provide the "few lines" of awk necessary then you too can be >famous rather than merely flippant. > In awk it *IS* trivial. you pull the input value into a simple variable. Then you compare the value of that variable against the 'missing data' value. if *mismatch* assign to array[row,col] Awk arrays are _dynamic_ in size, when you reference a previously un-referenced element, the array grows by the size of *one* element. regardless of whether it is contiguous to another element or not.
From: Robert Bonomi on 24 Jun 2010 14:20 In article <hub0s6$fkp$1(a)news.m-online.net>, Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote: >Maxwell Lol wrote: >> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >> >>> Janis Papanagnou wrote: >[To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] >>> >>> >>> Temporary files and arrays are unnecessary. You can read and print the >>> columns as rows one by one. > >I didn't say they are necessary; rather I suggested to use one of the two >approaches, depending on the actual problem domain. Your suggestion seems >to me to be much inferiors WRT performance; O(N^2) instead of O(N), which >is inacceptable if you're processing Gigabytes of data as the OP actually >does. > >> >> Yes, by re-reading the input file each time you want a new row. > >Indeed. > >It would be interesting if there'd be a tool to directly access entities >in a file by seek'ing to the respective position to avoid such performance >degradation; say, as intelligent tail(1) program may operate on character >entities. But that seems impossible without having a "structured file"; a >database, maybe, or using file concepts of higher abstraction levels which >I haven't heard to be in use in practise. Actually, 'dd' will do that. IF the fields are fixed-width (which they probably are *NOT*, in the OP's case) one could do "dd bs={fieldwidth} iskip={fieldindex} sourcefile" to extract a single field. I suspect that this would be _seriously_ slower than the 'read a row, write to per-column tempfiles. Would eliminate the need for additional 9+ gigs of disk-space for the tempfile, though.
From: Janis Papanagnou on 24 Jun 2010 15:18 On 24/06/10 20:20, Robert Bonomi wrote: > In article <hub0s6$fkp$1(a)news.m-online.net>, > Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote: >> Maxwell Lol wrote: >>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >>> >>>> Janis Papanagnou wrote: >> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] >>>> >>>> >>>> Temporary files and arrays are unnecessary. You can read and print the >>>> columns as rows one by one. >> >> I didn't say they are necessary; rather I suggested to use one of the two >> approaches, depending on the actual problem domain. Your suggestion seems >> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which >> is inacceptable if you're processing Gigabytes of data as the OP actually >> does. >> >>> >>> Yes, by re-reading the input file each time you want a new row. >> >> Indeed. >> >> It would be interesting if there'd be a tool to directly access entities >> in a file by seek'ing to the respective position to avoid such performance >> degradation; say, as intelligent tail(1) program may operate on character >> entities. But that seems impossible without having a "structured file"; a >> database, maybe, or using file concepts of higher abstraction levels which >> I haven't heard to be in use in practise. > > Actually, 'dd' will do that. IF the fields are fixed-width That's exactly the crux why it doesn't apply in the general case we have. Janis > (which they > probably are *NOT*, in the OP's case) one could do > "dd bs={fieldwidth} iskip={fieldindex} sourcefile" > to extract a single field. I suspect that this would be _seriously_ slower > than the 'read a row, write to per-column tempfiles. Would eliminate the > need for additional 9+ gigs of disk-space for the tempfile, though. >
From: Robert Bonomi on 24 Jun 2010 15:38
In article <i00b2s$cof$1(a)news.m-online.net>, Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote: >On 24/06/10 20:20, Robert Bonomi wrote: >> In article <hub0s6$fkp$1(a)news.m-online.net>, >> Janis Papanagnou <janis_papanagnou(a)hotmail.com> wrote: >>> Maxwell Lol wrote: >>>> Thomas 'PointedEars' Lahn <PointedEars(a)web.de> writes: >>>> >>>>> Janis Papanagnou wrote: >>> [To make sure; I didn't write the subsequent two lines; that was Mr Lahn.] >>>>> >>>>> >>>>> Temporary files and arrays are unnecessary. You can read and print the >>>>> columns as rows one by one. >>> >>> I didn't say they are necessary; rather I suggested to use one of the two >>> approaches, depending on the actual problem domain. Your suggestion seems >>> to me to be much inferiors WRT performance; O(N^2) instead of O(N), which >>> is inacceptable if you're processing Gigabytes of data as the OP actually >>> does. >>> >>>> >>>> Yes, by re-reading the input file each time you want a new row. >>> >>> Indeed. >>> >>> It would be interesting if there'd be a tool to directly access entities >>> in a file by seek'ing to the respective position to avoid such performance >>> degradation; say, as intelligent tail(1) program may operate on character >>> entities. But that seems impossible without having a "structured file"; a >>> database, maybe, or using file concepts of higher abstraction levels which >>> I haven't heard to be in use in practise. >> >> Actually, 'dd' will do that. IF the fields are fixed-width > >That's exactly the crux why it doesn't apply in the general case we have. just playing devil's advocate -- it would be "somewhat" (may be a record for understatement!) painful, but one *could* still use dd, if you had another tool that made a pass through the file, and counted the start-position and length of each value. Seriously, your best bet is going to be to take a 'strip' of rows that will fit into memory, transpose, and output to a column file. repeat for as many additional strips as it takes to process the file, then concatenate th column files. Or, if you know the maximum width of a field, you can write a little program that reads sequential values from the source file, and plugs them into a calculated position in a transposition file, being careful to include an appropriate delimiter with each field written. Then use another hand-crafted program that reads that transposition file, and 'squeezes out' all the 'null' values (this converts back to the variable-width fields of the original). |