From: kj on 3 Jun 2010 14:19 (1) I have a relatively large file (9.4GB) that containing a rectangular matrix (columns separated by tabs, rows separated by newlines). I want to generate a file that contains the transpose of this matrix, and do so without slurping the entire matrix into memory all at once. Is there a utility that would be helpful for this task? (2) The only approach I can think of is to write temporary files containing the transposes of the submatrices corresponding to "strips" of n consecutive rows, and then using /usr/bin/paste to glue all these submatrices into a single file. Still, even this strategy requires transposing the n rows. I can do this easily with a Python or Perl script, but I was wondering if there is some Unix utility to do it? Any suggestions for accomplishing (1) or (2) from the command line using Unix utilities would be appreciated. (FWIW, I use zsh.) TIA! ~K
From: Thomas 'PointedEars' Lahn on 3 Jun 2010 17:35 kj wrote: > (1) I have a relatively large file (9.4GB) that containing a > rectangular matrix (columns separated by tabs, rows separated by > newlines). I want to generate a file that contains the transpose > of this matrix, and do so without slurping the entire matrix into ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > memory all at once. ^^^^^^^^^^^^^^^^^^^ The only way to do this that I can see is to cut out the i-th column, convert newlines to tabs, print newline, and repeat that for every column with increasing `i'. IOW, you would be trading memory efficiency against runtime efficiency, since you would have to read the entire file as often as you have columns. Utilities that would come in handy then are cut(1) or awk(1), and tr(1). Shell redirection can optionally write the result into a new file. HTH PointedEars
From: pk on 3 Jun 2010 17:48 kj wrote: > (1) I have a relatively large file (9.4GB) that containing a > rectangular matrix (columns separated by tabs, rows separated by > newlines). I want to generate a file that contains the transpose > of this matrix, and do so without slurping the entire matrix into > memory all at once. > > Is there a utility that would be helpful for this task? > > (2) The only approach I can think of is to write temporary files > containing the transposes of the submatrices corresponding to > "strips" of n consecutive rows, and then using /usr/bin/paste to > glue all these submatrices into a single file. > > Still, even this strategy requires transposing the n rows. I can > do this easily with a Python or Perl script, but I was wondering > if there is some Unix utility to do it? > > Any suggestions for accomplishing (1) or (2) from the command line > using Unix utilities would be appreciated. AFAICT, you can't even write a single complete line of the transposition without having read up to the last line of the original matrix. While on 64-bit machines with enough RAM a process could probably keep 9.4GB of data in memory, I think the approach of reading the file line-by-line, and append each item to its own "column" file, and then just cat all these files together to get the transposed matrix. For example, with a small matrix using awk: $ cat matrix a b c d e f g h i j k l m n o p q r s t $ awk 'NR==1{ for(i=1;i<=NF;i++)names[i]=sprintf("column%03d", i) } { for(i=1;i<=NF;i++){ printf "%s%s", s[i], $i > names[i] s[i]=FS } } END{ # add terminating newlines for(i=1;i<=NF;i++) print "" > names[i] }' matrix (adapt to your actual number of columns, separator, etc.) To avoid the overhead of calling sprintf() every time, you could probably save the names in an array at the beginning, eg and then redirect output to names[i]. When that has run, $ cat column001 a e i m q $ cat column002 b f j n r $ cat column* > transposed $ cat transposed a e i m q b f j n r c g k o s d h l p t
From: Seebs on 3 Jun 2010 22:08 On 2010-06-03, kj <no.email(a)please.post> wrote: > (1) I have a relatively large file (9.4GB) that containing a > rectangular matrix (columns separated by tabs, rows separated by > newlines). I want to generate a file that contains the transpose > of this matrix, and do so without slurping the entire matrix into > memory all at once. > Is there a utility that would be helpful for this task? There are utilities that would be helpful, but I don't think it can be done without temporary files. Quite simply: So far as I can tell, before you can finish the first line of your output, you have to have read the last line, so either you're jumping around a lot, which will be excruciatingly slow, or you have it all in memory, or... > (2) The only approach I can think of is to write temporary files > containing the transposes of the submatrices corresponding to > "strips" of n consecutive rows, and then using /usr/bin/paste to > glue all these submatrices into a single file. Well, pragmatically, the shortest path is probably to do something much like this. A thought: Do you have any expectations about the contents of the fields? Say, are they of fixed length? Could they be padded to a fixed length without undue hardship? Do you know in advance the number of rows and columns? It wouldn't be exceptionally hard to write a new file containing a fixed-size grid of fixed-size fields, then write a tiny little C program to go through populating fields appropriately. Finally, last but not least: Use sqlite, shove everything into a table, extract from the table. It will use a ton of disk space and CPU time, but it will work within available memory and be surprisingly zippy, I'd guess. -s -- Copyright 2010, all wrongs reversed. Peter Seebach / usenet-nospam(a)seebs.net http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
From: Janis Papanagnou on 4 Jun 2010 01:05 kj wrote: > (1) I have a relatively large file (9.4GB) that containing a > rectangular matrix (columns separated by tabs, rows separated by > newlines). I want to generate a file that contains the transpose > of this matrix, and do so without slurping the entire matrix into > memory all at once. > > Is there a utility that would be helpful for this task? > > (2) The only approach I can think of is to write temporary files > containing the transposes of the submatrices corresponding to > "strips" of n consecutive rows, and then using /usr/bin/paste to > glue all these submatrices into a single file. > > Still, even this strategy requires transposing the n rows. I can > do this easily with a Python or Perl script, but I was wondering > if there is some Unix utility to do it? > > Any suggestions for accomplishing (1) or (2) from the command line > using Unix utilities would be appreciated. You're already aware of the problem that you have in principle with this type of task. One more thought; are your matrices fully filled or sparsely populated? In the latter case you might be able to use the all-in-memory approach anyway, because you'd need just allocate the actual values and leave the zero-values away. (In any language, like awk, that supports associative arrays this would require just a few lines of code.) If your matrices not sparsely populated then follow the way of using temporary files if your memory is limited. Janis > > (FWIW, I use zsh.) > > TIA! > > ~K
|
Next
|
Last
Pages: 1 2 3 4 5 6 Prev: Simple hack to get $500 to your home. Next: Access to the output of the last command |