From: bcubeb3 on 19 Jun 2010 19:17 Both data sets CO and EA are about a million observations and this line of code below is taking forever to finish. I wonder how to do what I want to do below any faster: PROC SQL; CREATE TABLE home.matches AS SELECT * FROM home.CO as b where EXISTS (SELECT * FROM home.EA as a WHERE b.var2b = a.var4a and b.var4b = a.var6a ) ; QUIT; PROC SQL; CREATE TABLE home.nomatches AS SELECT * FROM home.CO as b where NOT EXISTS (SELECT * FROM home.EA as a WHERE b.var2b = a.var4a and b.var4b = a.var6a ) ; QUIT;
From: Arthur Tabachneck on 21 Jun 2010 22:32 Barry, You will likely get more responses if you provide some sample data, for both files (in the form of a data step), along with examples of the files you want to end up with by using the above code. Art ------------------ On Jun 19, 7:17 pm, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote: > Both data sets CO and EA are about a million observations and this > line of code below is taking forever to finish. I wonder how to do > what I want to do below any faster: > > PROC SQL; > CREATE TABLE home.matches AS > SELECT * > FROM home.CO as b > where EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT; > > PROC SQL; > CREATE TABLE home.nomatches AS > SELECT * > FROM home.CO as b > where NOT EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT;
From: Jerome on 22 Jun 2010 12:02 Hi, for the first one, did you tried to join directly the 2 tables. PROC SQL; CREATE TABLE home.matches AS SELECT b.* FROM home.CO as b ,home.EA as a WHERE b.var2b = a.var4a and b.var4b = a.var6a ; QUIT; You could also try to index your datasets (if it's not yet). For the second , try "minus" to select the "keys" to exclude. Jérôme. On Jun 20, 1:17 am, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote: > Both data sets CO and EA are about a million observations and this > line of code below is taking forever to finish. I wonder how to do > what I want to do below any faster: > > PROC SQL; > CREATE TABLE home.matches AS > SELECT * > FROM home.CO as b > where EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT; > > PROC SQL; > CREATE TABLE home.nomatches AS > SELECT * > FROM home.CO as b > where NOT EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT;
From: James B on 23 Jun 2010 20:41 Hi, I'm fairly green (new) to sql, so can't explain the mechanics of why this does/does not work, but I just tried your query with and without the distinct keyword in the sql view on my shared system (so can't guarantee another job didn't slow either down) and got the following results: proc sql; create table work.test1 as select * from lib.table1 as a where not exists (select distinct dis_cm_site_cd from lib.table2 as b where a.DIS_SITE_CD = b.DIS_CM_SITE_CD) ;quit; NOTE: PROCEDURE SQL used (Total process time): real time 7.63 seconds user cpu time 1.15 seconds system cpu time 0.07 seconds Memory 783k proc sql; create table work.test2 as select * from lib.table1 as a where not exists (select * from lib.table2 as b where a.DIS_SITE_CD = b.DIS_CM_SITE_CD) ;quit; NOTE: PROCEDURE SQL used (Total process time): real time 8:44:48.59 user cpu time 16.73 seconds system cpu time 16.68 seconds Memory 675k I did the 'distinct' query first on the expectation that it would be quicker to try to rule caching out as the reason for any reduction in processing time. Also, the queries retrieved the rows in a slightly different order (could probably be fixed using an 'order by' clause) - after a quick sort, proc compare says the test1 and test2 are identical. For info, table1 has about 25 thousand rows x 98 columns and table2 has about 25 million rows x 122 columns. Test1 & Test2 came out to 23 thousand odd rows. I/O is by far the largest constraint on the system I use. James On Jun 20, 9:17 am, bcubeb3 <barry.brian.barr...(a)gmail.com> wrote: > Both data sets CO and EA are about a million observations and this > line of code below is taking forever to finish. I wonder how to do > what I want to do below any faster: > > PROC SQL; > CREATE TABLE home.matches AS > SELECT * > FROM home.CO as b > where EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT; > > PROC SQL; > CREATE TABLE home.nomatches AS > SELECT * > FROM home.CO as b > where NOT EXISTS > (SELECT * > FROM home.EA as a > WHERE b.var2b = a.var4a and b.var4b = a.var6a > ) > ; > QUIT;
From: tanwan on 26 Jun 2010 09:33 But you scan both tables TWICE, once for each query. In each case looking to see if the same condition WHERE b.var2b = a.var4a and b.var4b = a.var6a is satsified or not. If you could use a datastep with something like Data AAA BBB; Merge Y Z; .. . . If CONDITION SATSIFIED output AAA; Else output BBB; Run; This will scan both tables once. I think it will be faster that two SQL statements. Run;
|
Pages: 1 Prev: about sheet= statement in proc export Next: Duplicate Observations |