how to pick the last 65000 rows [SAS]

Prev: Help in running batch SAS code in UNIX
Next: Macro question

From: eric on 13 Jan 2010 14:57

I have a database with several million rows and need a daily "snapshot" file
with just 65000 rows, according to latest month, day and year

The database has on occasion rows inserted out of order so can not depend on
it containing the items in date order

I did not create it and the month, day, year variables are defined as
character so I have to use input (see below)

this code does it, but can it be made more efficient?

libname mylib <database path> ;

proc sql;
create table work.work as select *,
mdy( input (month, 2.), input (day, 2.), input(year, 4.)) as filedate
from mylib.db1
order by filedate DESC ;
quit;

data last;set work; if _n_ = 65000 then stop;run;

From: rjf2 on 13 Jan 2010 16:59

> From: eric
> Sent: Wednesday, January 13, 2010 2:57 PM
> Subject: how to pick the last 65000 rows
>
> I have a database with several million rows and need a daily
> "snapshot" file
> with just 65000 rows, according to latest month, day and year
>
> The database has on occasion rows inserted out of order so
> can not depend on
> it containing the items in date order
>
> I did not create it and the month, day, year variables are defined as
> character so I have to use input (see below)
>
> this code does it, but can it be made more efficient?
>
> libname mylib <database path> ;
>
> proc sql;
> create table work.work as select *,
> mdy( input (month, 2.), input (day, 2.), input(year, 4.))
> as filedate
> from mylib.db1
> order by filedate DESC ;
> quit;
>
> data last;set work; if _n_ = 65000 then stop;run;

you might use an index to avoid the sort

RTFM on both data set options: (last)obs and firstobs.

There are several macro functions available that return nobs

http://www.datasavantconsulting.com/roland/nobs.sas

0.
%Let Nobs = %eval(%Nobs(MyLib.db1) - 65000));

oops! this may not work on an sql database table!

1.
PROC Sort data = MyLib.db1
out = Work.SnapShot
(firstobs = &Nobs.) ;
by filedate DESC;

2.
DATA Work.SnapShot;
do until(EndoFile);
set MyLib.db1
(firstobs = &Nobs.)
end = EndoFile;
output;
end;
stop;
run;

Ron Fehd the macro maven CDC Atlanta GA USA RJF2 at cdc dot gov

From: "Keintz, H. Mark" on 13 Jan 2010 17:28

Eric:

If you know the date of interest (i.e. you know the
latest date), then you are wasting resources using
mdy and input functions and then sorting.

Instead you could do this (say you want 65,000
cases for 09jan2010).

%let yyyy=2010;
%let m=1;
%let d=9;

%let datevalue=%sysfunc(mdy(&m,&d,&yyyy));

data want;
set have (obs=65000);
where year="&yyyy" and month="&m" and day="&d";
retain filedate &datevalue;
format filedate date9.;
run;

Note that the WHERE filter operates before the
OBS= parameter, so only qualifying records are counted.

And no sorting is required, or iterative application of MDY.

Regards,
Mark

> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
> eric
> Sent: Wednesday, January 13, 2010 2:57 PM
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: how to pick the last 65000 rows
>
> I have a database with several million rows and need a daily "snapshot"
> file
> with just 65000 rows, according to latest month, day and year
>
> The database has on occasion rows inserted out of order so can not
> depend on
> it containing the items in date order
>
> I did not create it and the month, day, year variables are defined as
> character so I have to use input (see below)
>
> this code does it, but can it be made more efficient?
>
> libname mylib <database path> ;
>
> proc sql;
> create table work.work as select *,
> mdy( input (month, 2.), input (day, 2.), input(year, 4.)) as
> filedate
> from mylib.db1
> order by filedate DESC ;
> quit;
>
> data last;set work; if _n_ = 65000 then stop;run;

From: Sigurd Hermansen on 13 Jan 2010 18:38

I see this as another instance of the "top n rows" problem for which SAS SQL does not offer an explicit option. A dataset first has to be arranged in filedate order. Next, rows have to be updated with sequential row numbers. Finally, the top 65,000 rows and only those rows have to be selected. Your program does each of these tasks in an efficient manner. It simply takes a substantial amount of time to order millions of tuples while defining an ordering variable at the same time.

I'd look for a short-cut that the SAS SQL compiler could not figure out without knowledge of the context of your programming problem. What doesn't the compiler know?

What was the 65,000th filedate in sequence of the last snapshot? Say after you execute your original program, you add a SQL UPDATE query that updates a permanent dataset in a SAS library with that date value. Once that date parameter dataset exists, you know (but the compiler doesn't) that the next iteration of the snapshot program can limit the date range of relevant tuples to filedate's greater or equal to that date value:

..... WHERE filedate >= (select filedate from parmDS.minDate) ....

The WHERE clause will then limit the number of tuples that have to be ordered by filedate to a number much closer to 65,000 than to the millions in the full dataset. Adding the update query that advances the stored filedate will set up the program for the next round.

Query optimization by compilers suffers from the narrow scope of the knowledge base that developers can build into compiler. The short-cut solution that I have proposed will likely prove easier to implement or more efficient than more generic alternatives (say, computing frequencies by year or year,month, finding the sum of class sums that exceeds 65,000, and using the year or year,month of that class in a WHERE clause that subsets the full dataset).
S

-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of eric
Sent: Wednesday, January 13, 2010 2:57 PM
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: how to pick the last 65000 rows

I have a database with several million rows and need a daily "snapshot" file
with just 65000 rows, according to latest month, day and year

The database has on occasion rows inserted out of order so can not depend on
it containing the items in date order

I did not create it and the month, day, year variables are defined as
character so I have to use input (see below)

this code does it, but can it be made more efficient?

libname mylib <database path> ;

proc sql;
create table work.work as select *,
mdy( input (month, 2.), input (day, 2.), input(year, 4.)) as filedate
from mylib.db1
order by filedate DESC ;
quit;

data last;set work; if _n_ = 65000 then stop;run;

|
Pages: 1
Prev: Help in running batch SAS code in UNIX
Next: Macro question