From: Claus Yeh on
Dear SAS users,

Hi, I have an ascii file that looks like this (about 2 million columns
and 300 rows)

A C G T...
C T C A...

Basically Each letter is separated by a space

However, I want my SAS dataset to look this

Var1 Var2
A C G T
C T C A

where each variable has two letters. Is there a way to force SAS to
read in the increment of 2 letters even though they are all separated
by a space? I am a little hesitant to use @ because there are 2
million columns.

thank you so much,
claus
From: Alex on
On Mar 10, 10:35 pm, Claus Yeh <phoebe.caulfiel...(a)gmail.com> wrote:
> Dear SAS users,
>
> Hi, I have an ascii file that looks like this (about 2 million columns
> and 300 rows)
>
> A C G T...
> C T C A...
>
> Basically Each letter is separated by a space
>
> However, I want my SAS dataset to look this
>
> Var1    Var2
> A C      G T
> C T      C A
>
> where each variable has two letters.   Is there a way to force SAS to
> read in the increment of 2 letters even though they are all separated
> by a space?   I am a little hesitant to use @ because there are 2
> million columns.
>
> thank you so much,
> claus

Hi Claus,

I'm not an expert in reading raw data, but you could create your
variables directly from the input buffer. I'm not sure, how this will
perform with millions of variables, but it works with a small example
file. Please see the code below.

Best,
Alex


filename have 'd:\temp\test.txt' ;

data _null_;
file have ;
put 'A C G T' ;
put 'C T C A' ;
put 'A T C T' ;
put 'T T C A' ;
run;

%let n_vars = 2 ;

data want (keep = Var: );
infile have ;
input ;

length Var1-Var&n_vars $ 3 ;
array Var ( &n_vars ) $;

do i = 1 to dim(Var) ;
Var(i) = catx(' ', scan( _infile_, i, ' ' ), scan( _infile_, i+1,
' ' ) );
end;
run;

proc print;
run;
From: Alex on
On Mar 11, 4:17 pm, Alex <alexander.k...(a)gmail.com> wrote:
> On Mar 10, 10:35 pm, Claus Yeh <phoebe.caulfiel...(a)gmail.com> wrote:
>
>
>
> > Dear SAS users,
>
> > Hi, I have an ascii file that looks like this (about 2 million columns
> > and 300 rows)
>
> > A C G T...
> > C T C A...
>
> > Basically Each letter is separated by a space
>
> > However, I want my SAS dataset to look this
>
> > Var1    Var2
> > A C      G T
> > C T      C A
>
> > where each variable has two letters.   Is there a way to force SAS to
> > read in the increment of 2 letters even though they are all separated
> > by a space?   I am a little hesitant to use @ because there are 2
> > million columns.
>
> > thank you so much,
> > claus
>
> Hi Claus,
>
> I'm not an expert in reading raw data, but you could create your
> variables directly from the input buffer. I'm not sure, how this will
> perform with millions of variables, but it works with a small example
> file. Please see the code below.
>
> Best,
> Alex
>
> filename have 'd:\temp\test.txt' ;
>
> data _null_;
>         file have ;
>         put 'A C G T' ;
>         put 'C T C A' ;
>         put 'A T C T' ;
>         put 'T T C A' ;
> run;
>
> %let n_vars = 2 ;
>
> data want (keep = Var: );
>         infile have ;
>         input ;
>
>         length Var1-Var&n_vars $ 3 ;
>         array Var ( &n_vars ) $;
>
>         do i = 1 to dim(Var) ;
>            Var(i) = catx(' ', scan( _infile_, i, ' ' ), scan( _infile_, i+1,
> ' ' ) );
>         end;
> run;
>
> proc print;
> run;

Oops, the position argument in the scan()s was incorrect. Please use
this code instead:

filename have 'd:\temp\test.txt' ;

data _null_;
file have ;
put 'A C G T' ;
put 'C T C A' ;
put 'A T C T' ;
put 'T T C A' ;
run;

%let n_vars = 2 ;

data want (keep = Var: );
infile have ;
input ;

length Var1-Var&n_vars $ 3 ;
array Var ( &n_vars ) $;

do i = 1 to dim(Var) ;
Var(i) = catx(' ', scan( _infile_, i*2-1, ' ' ), scan( _infile_,
i*2, ' ' ) );
end;
run;

proc print;
run;
From: Claus Yeh on
On Mar 11, 7:43 am, Alex <alexander.k...(a)gmail.com> wrote:
> On Mar 11, 4:17 pm, Alex <alexander.k...(a)gmail.com> wrote:
>
>
>
> > On Mar 10, 10:35 pm, Claus Yeh <phoebe.caulfiel...(a)gmail.com> wrote:
>
> > > Dear SAS users,
>
> > > Hi, I have an ascii file that looks like this (about 2 million columns
> > > and 300 rows)
>
> > > A C G T...
> > > C T C A...
>
> > > Basically Each letter is separated by a space
>
> > > However, I want my SAS dataset to look this
>
> > > Var1    Var2
> > > A C      G T
> > > C T      C A
>
> > > where each variable has two letters.   Is there a way to force SAS to
> > > read in the increment of 2 letters even though they are all separated
> > > by a space?   I am a little hesitant to use @ because there are 2
> > > million columns.
>
> > > thank you so much,
> > > claus
>
> > Hi Claus,
>
> > I'm not an expert in reading raw data, but you could create your
> > variables directly from the input buffer. I'm not sure, how this will
> > perform with millions of variables, but it works with a small example
> > file. Please see the code below.
>
> > Best,
> > Alex
>
> > filename have 'd:\temp\test.txt' ;
>
> > data _null_;
> >         file have ;
> >         put 'A C G T' ;
> >         put 'C T C A' ;
> >         put 'A T C T' ;
> >         put 'T T C A' ;
> > run;
>
> > %let n_vars = 2 ;
>
> > data want (keep = Var: );
> >         infile have ;
> >         input ;
>
> >         length Var1-Var&n_vars $ 3 ;
> >         array Var ( &n_vars ) $;
>
> >         do i = 1 to dim(Var) ;
> >            Var(i) = catx(' ', scan( _infile_, i, ' ' ), scan( _infile_, i+1,
> > ' ' ) );
> >         end;
> > run;
>
> > proc print;
> > run;
>
> Oops, the position argument in the scan()s was incorrect. Please use
> this code instead:
>
> filename have 'd:\temp\test.txt' ;
>
> data _null_;
>         file have ;
>         put 'A C G T' ;
>         put 'C T C A' ;
>         put 'A T C T' ;
>         put 'T T C A' ;
> run;
>
> %let n_vars = 2 ;
>
> data want (keep = Var: );
>         infile have ;
>         input ;
>
>         length Var1-Var&n_vars $ 3 ;
>         array Var ( &n_vars ) $;
>
>         do i = 1 to dim(Var) ;
>            Var(i) = catx(' ', scan( _infile_, i*2-1, ' ' ), scan( _infile_,
> i*2, ' ' ) );
>         end;
> run;
>
> proc print;
> run;

Thank you so much Alex. Your method works really well. _infile_ has
a limit of 32000 so maybe we can add a nested loop? The lrecl length
per record is about 2.4 million (600K variables times 4)

thanks,
claus