Prev: Orthogonal Procrustes Rotation in SAS 9.1
Next: Reminder: Toronto Area SAS Society (TASS) June 2010 Meeting
From: theorbo on 1 Jun 2010 16:44 Hi. I'm seeking advice on an alternative way to solve my problem. Let me describe the problem: First my two datasets: ANCHOR GROUPS (AG) - 100 obs & 500 variables NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same variables as in dataset AG) I have my anchor groups and need to figure out which group each of the 5000 people from NP are most similar to. Unfortunately data reduction techniques with the variables is not an option currently. To do this I've been doing the following ... Iteratively going through NP and merging 1 observation's 500 variables' values to the AG dataset. Then I compute the difference between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP. I then sum the square of the 500 differences to get a "distance" measure. I then know that the NP observation is most similar to the AG observation where it has the minimum "distance" measure. Iteratively going through these 5000 observations (merge, compute distance, etc.) takes an extremely long time, especially if the number of variables or observations increase. Is there a different approach that you might suggest? The math is relatively simple but like with most things in SAS there are always a million and one ways to do something. And to reiterate, we considered some data reduction techniques but they are not going to be feasible at the current time. Thank you. Please post if you have any questions or need any clarification to help me.
From: Ya on 1 Jun 2010 18:22 On Jun 1, 1:44 pm, theorbo <theo...(a)gmail.com> wrote: > Hi. I'm seeking advice on an alternative way to solve my problem. > Let me describe the problem: > > First my two datasets: > > ANCHOR GROUPS (AG) - 100 obs & 500 variables > NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same > variables as in dataset AG) > > I have my anchor groups and need to figure out which group each of the > 5000 people from NP are most similar to. Unfortunately data reduction > techniques with the variables is not an option currently. > > To do this I've been doing the following ... > > Iteratively going through NP and merging 1 observation's 500 > variables' values to the AG dataset. Then I compute the difference > between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP. I then sum > the square of the 500 differences to get a "distance" measure. I then > know that the NP observation is most similar to the AG observation > where it has the minimum "distance" measure. > > Iteratively going through these 5000 observations (merge, compute > distance, etc.) takes an extremely long time, especially if the number > of variables or observations increase. > > Is there a different approach that you might suggest? The math is > relatively simple but like with most things in SAS there are always a > million and one ways to do something. And to reiterate, we considered > some data reduction techniques but they are not going to be feasible > at the current time. > > Thank you. Please post if you have any questions or need any > clarification to help me. Instead of comparing 500 vars, you can transpose the data first, then compare one variable, it also makes the calculation of distance much easier (use by processing): data anchor; input id1 x1 x2 x3; cards; 1 23 34 65 2 45 67 89 3 32 43 76 ; data person; input id2 x1 x2 x3; cards; 1 24 32 64 2 34 42 74 3 145 167 189 4 30 40 70 ; proc transpose data=anchor out=tanchor; by id1; var x1-x3; run; proc transpose data=person out=tperson; by id2; var x1-x3; run; proc sql; create table distance as select distinct a.id1,b.id2,sum((a.col1-b.col1)**2) as ss from tanchor a, tperson b where a._name_=b._name_ group by id2,id1 order by id2,ss ; data distance; set distance; by id2 ss; if first.id2; run; proc print; var id2 id1 ss; run; id2 id1 ss 1 1 6 2 3 9 3 2 30000 4 3 49 HTH Ya
From: theorbo on 1 Jun 2010 21:26
On Jun 1, 6:22 pm, Ya <huang8...(a)gmail.com> wrote: > On Jun 1, 1:44 pm, theorbo <theo...(a)gmail.com> wrote: > > > > > > > Hi. I'm seeking advice on an alternative way to solve my problem. > > Let me describe the problem: > > > First my two datasets: > > > ANCHOR GROUPS (AG) - 100 obs & 500 variables > > NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same > > variables as in dataset AG) > > > I have my anchor groups and need to figure out which group each of the > > 5000 people from NP are most similar to. Unfortunately data reduction > > techniques with the variables is not an option currently. > > > To do this I've been doing the following ... > > > Iteratively going through NP and merging 1 observation's 500 > > variables' values to the AG dataset. Then I compute the difference > > between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP. I then sum > > the square of the 500 differences to get a "distance" measure. I then > > know that the NP observation is most similar to the AG observation > > where it has the minimum "distance" measure. > > > Iteratively going through these 5000 observations (merge, compute > > distance, etc.) takes an extremely long time, especially if the number > > of variables or observations increase. > > > Is there a different approach that you might suggest? The math is > > relatively simple but like with most things in SAS there are always a > > million and one ways to do something. And to reiterate, we considered > > some data reduction techniques but they are not going to be feasible > > at the current time. > > > Thank you. Please post if you have any questions or need any > > clarification to help me. > > Instead of comparing 500 vars, you can transpose the data first, then > compare one variable, > it also makes the calculation of distance much easier (use by > processing): > > data anchor; > input id1 x1 x2 x3; > cards; > 1 23 34 65 > 2 45 67 89 > 3 32 43 76 > ; > > data person; > input id2 x1 x2 x3; > cards; > 1 24 32 64 > 2 34 42 74 > 3 145 167 189 > 4 30 40 70 > ; > > proc transpose data=anchor out=tanchor; > by id1; > var x1-x3; > run; > > proc transpose data=person out=tperson; > by id2; > var x1-x3; > run; > > proc sql; > create table distance as > select distinct a.id1,b.id2,sum((a.col1-b.col1)**2) as ss > from tanchor a, tperson b > where a._name_=b._name_ > group by id2,id1 > order by id2,ss > ; > > data distance; > set distance; > by id2 ss; > if first.id2; > run; > > proc print; > var id2 id1 ss; > run; > > id2 id1 ss > > 1 1 6 > 2 3 9 > 3 2 30000 > 4 3 49 > > HTH > > Ya- Hide quoted text - > > - Show quoted text - Thanks, Ya. I had thought about that but didn't know if the PROC SQL portion would choke on the large number of variables, etc. Thanks again for your suggestion. |