Extracting Variables from very Large SAS Dataset

Prev: a question in PROC GENMOD
Next: Spearman cc's confidence interval

From: Arthur Tabachneck on 7 Oct 2009 21:27

Claus,

I haven't kept up with this thread but, from what I did see, I didn't notice
a brute force approach. Would something like the following work?:

/*Create test file*/
data have;
array stuff(1000);
do i= 1 to 1000;
do j=1 to 1000;
stuff(j)=ranuni(543210);
end;
output;
end;
run;

/*Create driver macro variables*/
data steps;
do start=1 to 900 by 100;
sequence+1;
end=start+99;
output;
end;
run;
proc sql noprint;
select catt('part',sequence),
catt('stuff',start,'-stuff',end)
into :files separated by ',',
:contents separated by ','
from steps;
quit;

/*Macro to run all of the data steps*/
%macro doit;
%local i;
%let i = %eval(&i + 1);
%do %while (%scan(%quote(&files),&i) ne);
data %scan(%quote(&files),&i);
set have (keep=%qscan(%quote(&contents),&i,','));
run;
%let i = %eval(&i + 1);
%end;
%mend;

%doit

Of course, the data step could always be replaced with whatever desired
proc.

HTH,
Art
---------
On Wed, 7 Oct 2009 15:12:42 -0700, Claus Yeh <phoebe.caulfield42(a)GMAIL.COM>
wrote:

>On Oct 7, 1:22 pm, michaelrait...(a)WESTAT.COM (Michael Raithel) wrote:
>> Dear SAS-L-ers,
>>
>> Claus Yeh, posted the following:
>>
>>
>>
>> > Dear all,
>>
>> > I have a very large SAS dataset - 500,000 variables and 4000
>> > observations.
>>
>> > I want to create smaller datasets that contains about 1000 to 10,000
>> > variables of the original 500,000 variable dataset.
>>
>> > I used data step to do this but it was very very slow (I need to
>> > create multiple smaller steps).
>>
>> > ie. data small;
>> > set large;
>> > keep var1-var1000;
>> > run;
>>
>> > Is there a way to do it in Proc Dataset that can output the smaller
>> > dataset much quicker? If there are other efficient ways, please let
>> > me know too.
>>
>> Claus, yeh, I can think of a way of doing this that will run so fast,
that you will hear a sonic boom as the DATA Step reaches Mach I! And, it
won't cost you one bit more of storage, to boot!
>>
>> How about using a DATA Step view? You could code:
>>
>> data smallarge/view=smallarge;
>> set large;
>> keep var1-var1000;
>> run;
>>
>> ...which would create a DATA Step view file in the blink of an eye.
Thereafter, you could use that view to surface only Var1 - Var1000 in future
SAS PROCs or DATA Steps.
>>
>> Would that work for you, or are you going to wait for some other SAS-L-
sharpie's clever-er suggestion?
>>
>> Claus, best of luck in all of your SAS endeavors!
>>
>> I hope that this suggestion proves helpful now, and in the future!
>>
>> Of course, all of these opinions and insights are my own, and do not
reflect those of my organization or my associates. All SAS code and/or
methodologies specified in this posting are for illustrative purposes only
and no warranty is stated or implied as to their accuracy or applicability.
People deciding to use information in this posting do so at their own risk.
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Michael A. Raithel
>> "The man who wrote the book on performance"
>> E-mail: MichaelRait...(a)westat.com
>>
>> Author: Tuning SAS Applications in the MVS Environment
>>
>> Author: Tuning SAS Applications in the OS/390 and z/OS Environments,
Second Editionhttp://www.sas.com/apps/pubscat/bookdetails.jsp?
catid=1&pc=58172
>>
>> Author: The Complete Guide to SAS
Indexeshttp://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ...fire all of your guns at once and explode into space... - Steppenwolf,
Born to be Wild
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>Hi Michael,
>
>Thank you so much. I will do some test runs for "view" by running pro
>logistic on it.
>
>thanks again,
>claus

From: "Data _null_;" on 8 Oct 2009 09:54

Your need to post the code you ran because you may be making another
performance "mistake". I you reference the VIEW more that once any
performance gain that the view provides is negated.

Back to you original bit of code. It is better to subset variables on
the input side rather than output. Plus it makes you look like you
know what you are doing. This simple example demonstrates that.
Notice S1000_2 uses less memory and time.

9320 data big;
9321 array c[500000] $1;
9322 do i = 1 to dim(c);
9323 c[i] = put(rantbl(12345,.4,.3),f1.);
9324 end;
9325 do _n_ = 1 to 1000;
9326 output;
9327 end;
9328 run;

NOTE: The data set WORK.BIG has 1000 observations and 500001 variables.
NOTE: DATA statement used (Total process time):
real time 38.51 seconds
user cpu time 7.03 seconds
system cpu time 0.52 seconds
Memory 125958k

9329
9330 data s1000_1;
9331 set big;
9332 keep c1-c1000;
9333 run;

NOTE: There were 1000 observations read from the data set WORK.BIG.
NOTE: The data set WORK.S1000_1 has 1000 observations and 1000 variables.
NOTE: DATA statement used (Total process time):
real time 4.40 seconds
user cpu time 3.45 seconds
system cpu time 0.84 seconds
Memory 166687k

9334
9335 data s1000_2;
9336 set big(keep=c1-c1000);
9337 run;

NOTE: There were 1000 observations read from the data set WORK.BIG.
NOTE: The data set WORK.S1000_2 has 1000 observations and 1000 variables.
NOTE: DATA statement used (Total process time):
real time 0.73 seconds
user cpu time 0.27 seconds
system cpu time 0.43 seconds
Memory 25501k

On 10/7/09, Claus Yeh <phoebe.caulfield42(a)gmail.com> wrote:
> On Oct 7, 1:22 pm, michaelrait...(a)WESTAT.COM (Michael Raithel) wrote:
> > Dear SAS-L-ers,
> >
> > Claus Yeh, posted the following:
> >
> >
> >
> > > Dear all,
> >
> > > I have a very large SAS dataset - 500,000 variables and 4000
> > > observations.
> >
> > > I want to create smaller datasets that contains about 1000 to 10,000
> > > variables of the original 500,000 variable dataset.
> >
> > > I used data step to do this but it was very very slow (I need to
> > > create multiple smaller steps).
> >
> > > ie. data small;
> > > set large;
> > > keep var1-var1000;
> > > run;
> >
> > > Is there a way to do it in Proc Dataset that can output the smaller
> > > dataset much quicker? If there are other efficient ways, please let
> > > me know too.
> >
> > Claus, yeh, I can think of a way of doing this that will run so fast, that you will hear a sonic boom as the DATA Step reaches Mach I! And, it won't cost you one bit more of storage, to boot!
> >
> > How about using a DATA Step view? You could code:
> >
> > data smallarge/view=smallarge;
> > set large;
> > keep var1-var1000;
> > run;
> >
> > ...which would create a DATA Step view file in the blink of an eye. Thereafter, you could use that view to surface only Var1 - Var1000 in future SAS PROCs or DATA Steps.
> >
> > Would that work for you, or are you going to wait for some other SAS-L-sharpie's clever-er suggestion?
> >
> > Claus, best of luck in all of your SAS endeavors!
> >
> > I hope that this suggestion proves helpful now, and in the future!
> >
> > Of course, all of these opinions and insights are my own, and do not reflect those of my organization or my associates. All SAS code and/or methodologies specified in this posting are for illustrative purposes only and no warranty is stated or implied as to their accuracy or applicability. People deciding to use information in this posting do so at their own risk.
> >
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Michael A. Raithel
> > "The man who wrote the book on performance"
> > E-mail: MichaelRait...(a)westat.com
> >
> > Author: Tuning SAS Applications in the MVS Environment
> >
> > Author: Tuning SAS Applications in the OS/390 and z/OS Environments, Second Editionhttp://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=58172
> >
> > Author: The Complete Guide to SAS Indexeshttp://www.sas.com/apps/pubscat/bookdetails.jsp?catid=1&pc=60409
> >
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > ...fire all of your guns at once and explode into space... - Steppenwolf, Born to be Wild
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Hi Mike,
>
> I tried the "view". It did finish the the dataset fast but the
> subsequent procedures took quite long time.
>
> I was wondering if "indexing" the dataset would help?
>
> thanks,
> claus
>

From: Joe Matise on 8 Oct 2009 11:07

Don't know about the documentation, but at least for 9.1.3 and above that's
not true. I've had a few projects with more than that [unfortunately] due
to client requirements. I've had up to 60k, about, not sure if I've had
over 66k or not.

-Joe

On Wed, Oct 7, 2009 at 9:23 PM, Lou <lpogoda(a)hotmail.com> wrote:

> "Claus Yeh" <phoebe.caulfield42(a)gmail.com> wrote in message
> news:8bd4021c-4ba3-4f51-92a8-5b3bc23a6aa2(a)x6g2000prc.googlegroups.com...
> > Dear all,
> >
> > I have a very large SAS dataset - 500,000 variables and 4000
> > observations.
>
> Huh? According to the documentation, the maximum number of variables in a
> single SAS data set under Windows is 32,767. I know Windows is not the
> be-all and end-all, but what platform are you on?
>
> > I want to create smaller datasets that contains about 1000 to 10,000
> > variables of the original 500,000 variable dataset.
> >
> > I used data step to do this but it was very very slow (I need to
> > create multiple smaller steps).
> >
> > ie. data small;
> > set large;
> > keep var1-var1000;
> > run;
>
> Well, you don't necessarily need multiple steps - you could try something
> like
>
> data small1 (keep = var1 - va1000)
> small2 (keep = var1001 - var2000)
> ....;
> set large;
> run;
>
> You might also try the BUFFNO and BUFSIZE options. Increasing the number
> and/or size of the buffers allocated for processing SAS datasets can speed
> up execution (at the expense of consuming more memory).
>
> > Is there a way to do it in Proc Dataset that can output the smaller
> > dataset much quicker? If there are other efficient ways, please let
> > me know too.
> >
> > thank you so much,
> > claus
>

First | Prev |
Pages: 1 2
Prev: a question in PROC GENMOD
Next: Spearman cc's confidence interval

Extracting Variables from very Large SAS Dataset - Use Proc