A faster way to do what this line of code does [SAS]

Prev: about sheet= statement in proc export
Next: Duplicate Observations

From: bcubeb3 on 19 Jun 2010 19:17

Both data sets CO and EA are about a million observations and this
line of code below is taking forever to finish. I wonder how to do
what I want to do below any faster:

PROC SQL;
CREATE TABLE home.matches AS
SELECT *
FROM home.CO as b
where EXISTS
(SELECT *
FROM home.EA as a
WHERE b.var2b = a.var4a and b.var4b = a.var6a
)
;
QUIT;

PROC SQL;
CREATE TABLE home.nomatches AS
SELECT *
FROM home.CO as b
where NOT EXISTS
(SELECT *
FROM home.EA as a
WHERE b.var2b = a.var4a and b.var4b = a.var6a
)
;
QUIT;

From: Arthur Tabachneck on 21 Jun 2010 22:32

Barry,

You will likely get more responses if you provide some sample data,
for both files (in the form of a data step), along with examples of
the files you want to end up with by using the above code.

Art
------------------
On Jun 19, 7:17 pm, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote:
> Both data sets CO and EA are about a million observations and this
> line of code below is taking forever to finish. I wonder how to do
> what I want to do below any faster:
>
> PROC SQL;
> CREATE TABLE home.matches AS
> SELECT *
> FROM home.CO as b
> where EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;
>
> PROC SQL;
> CREATE TABLE home.nomatches AS
> SELECT *
> FROM home.CO as b
> where NOT EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;

From: Jerome on 22 Jun 2010 12:02

Hi,
for the first one,
did you tried to join directly the 2 tables.

PROC SQL;
CREATE TABLE home.matches AS
SELECT b.*
FROM home.CO as b ,home.EA as a
WHERE b.var2b = a.var4a and b.var4b = a.var6a
;
QUIT;

You could also try to index your datasets (if it's not yet).

For the second ,
try "minus" to select the "keys" to exclude.

Jérôme.
On Jun 20, 1:17 am, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote:
> Both data sets CO and EA are about a million observations and this
> line of code below is taking forever to finish. I wonder how to do
> what I want to do below any faster:
>
> PROC SQL;
> CREATE TABLE home.matches AS
> SELECT *
> FROM home.CO as b
> where EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;
>
> PROC SQL;
> CREATE TABLE home.nomatches AS
> SELECT *
> FROM home.CO as b
> where NOT EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;

From: James B on 23 Jun 2010 20:41

Hi,

I'm fairly green (new) to sql, so can't explain the mechanics of why
this does/does not work, but I just tried your query with and without
the distinct keyword in the sql view on my shared system (so can't
guarantee another job didn't slow either down) and got the following
results:

proc sql;
create table work.test1 as
select *
from lib.table1 as a
where not exists (select distinct dis_cm_site_cd
from lib.table2 as b where a.DIS_SITE_CD = b.DIS_CM_SITE_CD)
;quit;

NOTE: PROCEDURE SQL used (Total process time):
real time 7.63 seconds
user cpu time 1.15 seconds
system cpu time 0.07 seconds
Memory 783k

proc sql;
create table work.test2 as
select *
from lib.table1 as a
where not exists (select *
from lib.table2 as b where a.DIS_SITE_CD = b.DIS_CM_SITE_CD)
;quit;

NOTE: PROCEDURE SQL used (Total process time):
real time 8:44:48.59
user cpu time 16.73 seconds
system cpu time 16.68 seconds
Memory 675k

I did the 'distinct' query first on the expectation that it would be
quicker to try to rule caching out as the reason for any reduction in
processing time. Also, the queries retrieved the rows in a slightly
different order (could probably be fixed using an 'order by' clause) -
after a quick sort, proc compare says the test1 and test2 are
identical. For info, table1 has about 25 thousand rows x 98 columns
and table2 has about 25 million rows x 122 columns. Test1 & Test2 came
out to 23 thousand odd rows. I/O is by far the largest constraint on
the system I use.

James

On Jun 20, 9:17 am, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote:
> Both data sets CO and EA are about a million observations and this
> line of code below is taking forever to finish. I wonder how to do
> what I want to do below any faster:
>
> PROC SQL;
> CREATE TABLE home.matches AS
> SELECT *
> FROM home.CO as b
> where EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;
>
> PROC SQL;
> CREATE TABLE home.nomatches AS
> SELECT *
> FROM home.CO as b
> where NOT EXISTS
> (SELECT *
> FROM home.EA as a
> WHERE b.var2b = a.var4a and b.var4b = a.var6a
> )
> ;
> QUIT;

From: tanwan on 26 Jun 2010 09:33

But you scan both tables TWICE, once for each query. In each case
looking to see if the same condition

WHERE b.var2b = a.var4a and b.var4b = a.var6a
is satsified or not.

If you could use a datastep with something like

Data AAA BBB;
Merge Y Z;
.. . .
If CONDITION SATSIFIED output AAA;
Else output BBB;
Run;

This will scan both tables once. I think it will be faster that two
SQL statements.
Run;

|
Pages: 1
Prev: about sheet= statement in proc export
Next: Duplicate Observations