From: olivesecret on 22 Oct 2009 14:15 I have a large data set consisting of subject id, response y and other interesting variables. A subset of data is like this: ID Y ... 1 1 1 1 1 1 1 0.8 1 0.6 1 0.6 1 0.4 1 0.2 2 1 2 1 2 0.4 2 0 3 1 3 1 3 0.8 3 0.8 4 1 .... What I need do is for each ID, find the two observations, with one having y immediately larger than 0.5 and the other having y immediately smaller 0.5. For the example above, then the observations needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are no observations where y is less than 0.5, then I need the the two obs which having y immediately larger than 0.5, which are ID=3 y=1 and ID=3 y=0.8. Any hints? Thanks a lot!
From: Joe Matise on 22 Oct 2009 14:46 The simple part of your request, and two solutions: data have; input ID Y; datalines; 1 1 1 1 1 1 1 0.8 1 0.6 1 0.6 1 0.4 1 0.2 2 1 2 1 2 0.4 2 0 3 1 3 1 3 0.8 3 0.8 4 1 ;;;; run; data temp; set have; if y > 0.5 then yval='G'; else yval='L'; *how do you handle 0.5 exactly; run; proc sort data=temp nodupkey; by id yval descending y; run; data want; set temp; by id yval descending y; if (y>0.5 and last.yval) or (y<0.5 and first.yval); run; proc sql; create table want as select * from ( select distinct id, max(y) from have where y le 0.5 group by id union all select distinct id, min(Y) from have where y ge 0.5 group by id ) ; quit; Both solutions can be adapted to solve the part of your request that involves adding a second row for other values where the bottom value is not present (or the top value), but I don't have time right now to finish the problem - the SQL solution would involve another pair of UNION ALLs where you exclude the already found rows, and the DATA step solution could be arrived at in a handful of ways, including a double DoW loop or just another datastep similar to the above, again dropping out the relevant rows and re-pulling, then merging so to retrieve the right number of rows. -Joe On Thu, Oct 22, 2009 at 1:15 PM, olivesecret <olivesecret(a)gmail.com> wrote: > I have a large data set consisting of subject id, response y and other > interesting variables. A subset of data is like this: > > ID Y ... > 1 1 > 1 1 > 1 1 > 1 0.8 > 1 0.6 > 1 0.6 > 1 0.4 > 1 0.2 > 2 1 > 2 1 > 2 0.4 > 2 0 > 3 1 > 3 1 > 3 0.8 > 3 0.8 > 4 1 > ... > > What I need do is for each ID, find the two observations, with one > having y immediately larger than 0.5 and the other having y > immediately smaller 0.5. For the example above, then the observations > needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations > needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are > no observations where y is less than 0.5, then I need the the two obs > which having y immediately larger than 0.5, which are ID=3 y=1 and > ID=3 y=0.8. > Any hints? > Thanks a lot! >
From: Ya Huang on 22 Oct 2009 14:47 data xx; input ID Y; cards; 1 1 1 1 1 1 1 0.8 1 0.6 1 0.6 1 0.4 1 0.2 2 1 2 1 2 0.4 2 0 3 1 3 1 3 0.8 3 0.8 ; data xx; set xx; n_=_n_; s=sign(y-0.5); dist=abs(y-0.5); run; proc sql; create table yy as select *, count(distinct s) as us from xx group by id order by id,s,dist ; data zz; set yy; by id s dist; if us=1 then do; if first.s then keep=0; keep+1; if keep <=2 then output; end; else if us=2 then do; if first.s then output; end; run; proc sort; by n_; run; proc print; run; Bascially, you group the records based on which side they are from 0.5, then for each group, you keep the least distance record. For the case of only one side, we need to output the least two, this is done by a flag us. On Thu, 22 Oct 2009 11:15:52 -0700, olivesecret <olivesecret(a)GMAIL.COM> wrote: >I have a large data set consisting of subject id, response y and other >interesting variables. A subset of data is like this: > >ID Y ... >1 1 >1 1 >1 1 >1 0.8 >1 0.6 >1 0.6 >1 0.4 >1 0.2 >2 1 >2 1 >2 0.4 >2 0 >3 1 >3 1 >3 0.8 >3 0.8 >4 1 >... > >What I need do is for each ID, find the two observations, with one >having y immediately larger than 0.5 and the other having y >immediately smaller 0.5. For the example above, then the observations >needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations >needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are >no observations where y is less than 0.5, then I need the the two obs >which having y immediately larger than 0.5, which are ID=3 y=1 and >ID=3 y=0.8. >Any hints? >Thanks a lot!
From: Ya Huang on 22 Oct 2009 14:52 For ID=3, why do you have 0.8 and 1? Based on your rule, you should have 0.8 and 0.8, since they are closer. On Thu, 22 Oct 2009 11:15:52 -0700, olivesecret <olivesecret(a)GMAIL.COM> wrote: >I have a large data set consisting of subject id, response y and other >interesting variables. A subset of data is like this: > >ID Y ... >1 1 >1 1 >1 1 >1 0.8 >1 0.6 >1 0.6 >1 0.4 >1 0.2 >2 1 >2 1 >2 0.4 >2 0 >3 1 >3 1 >3 0.8 >3 0.8 >4 1 >... > >What I need do is for each ID, find the two observations, with one >having y immediately larger than 0.5 and the other having y >immediately smaller 0.5. For the example above, then the observations >needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations >needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are >no observations where y is less than 0.5, then I need the the two obs >which having y immediately larger than 0.5, which are ID=3 y=1 and >ID=3 y=0.8. >Any hints? >Thanks a lot!
From: "Data _null_;" on 22 Oct 2009 15:33 This appears to work using the data you posted. I don't know how it will work with your real data. It might be useful as a check of the other methods or to show that it does not work. data test; input id:1. y @@; array yy[10]; do _n_ = 1 to dim(yy); yy[_n_] = rannor(345785); end; cards; 1 1 1 1 1 1 1 0.8 1 0.6 1 0.6 1 0.4 1 0.2 2 1 2 1 2 0.4 2 0 3 1 3 1 3 0.8 3 0.8 4 1 ;;;; run; data testV / view=testV; set test(keep=id y); x = y-.5; s = sign(x); run; proc summary nway data=testV; class id; output out=obs(drop=_type_ _freq_) idgroup(max(s) min(x) obs out[1](y)=idY1) idgroup(min(s) max(x) obs out[1](y)=idY2) / autoname ; run; proc print; run; data subset; set obs; do point = _obs_,_obs2_; set test point=point; output; end; run; proc print; run; On 10/22/09, olivesecret <olivesecret(a)gmail.com> wrote: > I have a large data set consisting of subject id, response y and other > interesting variables. A subset of data is like this: > > ID Y ... > 1 1 > 1 1 > 1 1 > 1 0.8 > 1 0.6 > 1 0.6 > 1 0.4 > 1 0.2 > 2 1 > 2 1 > 2 0.4 > 2 0 > 3 1 > 3 1 > 3 0.8 > 3 0.8 > 4 1 > ... > > What I need do is for each ID, find the two observations, with one > having y immediately larger than 0.5 and the other having y > immediately smaller 0.5. For the example above, then the observations > needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations > needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are > no observations where y is less than 0.5, then I need the the two obs > which having y immediately larger than 0.5, which are ID=3 y=1 and > ID=3 y=0.8. > Any hints? > Thanks a lot! >
|
Next
|
Last
Pages: 1 2 Prev: Quality of logistic regression model Next: Proc gplot annotate question |