Prev: Exact binomial test
Next: ORACLE XMLTYPE
From: bcubeb3 on 22 Jun 2010 08:46 Suppose I have two data sets: CO and EA and I want to find the subset of CO that are matches to EA and likewise the subset of CO that aren't matches to EA The following code works. However when CO and EA are over a million observations the proc sql procedure is very very very slow. I want to know if there is an optimized way to do what I am doing below: /*************CO *********************/ data home.CO; input fname $ lname $; datalines; John Smith Mary Jane Will Smith Eric Manelow Christina Aguilera ; data home.CO; set home.CO; FL = CATS(fname,lname); run; data home.EA; input fname $ lname $; datalines; Thomas Jefferson Lady Gaga Michael Jackson Kobe Bryant Will Smith Eric Manelow Christina Aguilera ; data home.EA; set home.EA; FL = CATS(fname,lname); run; PROC SQL; CREATE TABLE home.matches AS SELECT * FROM home.CO as b where EXISTS (SELECT * FROM home.EA as a WHERE compare(b.FL,a.FL,'I') EQ 0 ) ; QUIT; PROC SQL; CREATE TABLE home.nomatches AS SELECT * FROM home.CO as b where NOT EXISTS (SELECT * FROM home.EA as a WHERE compare(b.FL,a.FL,'I') EQ 0 ) ; QUIT;
From: Patrick on 22 Jun 2010 10:00 Hi Thanks that you provided your code and sample data. That makes it so much easier to give an answer. If you're processing this in SAS then I'd suggest to use a SAS hash lookup instead of any SQL code. Do you know this paper? http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf Using a hash lookup your code could then look like this: data work.CO; input fname $ lname $; datalines; John Smith Mary Jane Will Smith Eric Manelow Christina Aguilera ; data work.EA; input fname $ lname $; datalines; Thomas Jefferson Lady Gaga Michael Jackson Kobe Bryant Will Smith Eric Manelow Christina Aguilera ; data CoMatches CoNonMatches; length fname lname $ 8; drop rc; if _N_ = 1 then do; declare hash h(dataset:'work.EA'); h.defineKey('fname','lname'); h.defineDone(); call missing(fname,lname); end; set work.co; rc=h.check(); if rc=0 then output CoMatches; else output CoNonMatches; run; Only one pass through the data is needed. I'm sure this will perform much better than the code you've posted. As your lookup table has 1M rows: Make sure to read the section about memory usage in the paper I've sent the link for. If you have to do it in SQL (i.e. because this should execute in a Data Base) then I believe to remember that a subselect performs better than EXIST. HTH Patrick
|
Pages: 1 Prev: Exact binomial test Next: ORACLE XMLTYPE |