data manipulation problem [SAS]

Prev: Quality of logistic regression model
Next: Proc gplot annotate question

From: olivesecret on 22 Oct 2009 14:15

I have a large data set consisting of subject id, response y and other
interesting variables. A subset of data is like this:

ID Y ...
1 1
1 1
1 1
1 0.8
1 0.6
1 0.6
1 0.4
1 0.2
2 1
2 1
2 0.4
2 0
3 1
3 1
3 0.8
3 0.8
4 1
....

What I need do is for each ID, find the two observations, with one
having y immediately larger than 0.5 and the other having y
immediately smaller 0.5. For the example above, then the observations
needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations
needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are
no observations where y is less than 0.5, then I need the the two obs
which having y immediately larger than 0.5, which are ID=3 y=1 and
ID=3 y=0.8.
Any hints?
Thanks a lot!

From: Joe Matise on 22 Oct 2009 14:46

The simple part of your request, and two solutions:

data have;
input ID Y;
datalines;
1 1
1 1
1 1
1 0.8
1 0.6
1 0.6
1 0.4
1 0.2
2 1
2 1
2 0.4
2 0
3 1
3 1
3 0.8
3 0.8
4 1
;;;;
run;

data temp;
set have;
if y > 0.5 then yval='G';
else yval='L'; *how do you handle 0.5 exactly;
run;

proc sort data=temp nodupkey;
by id yval descending y;
run;
data want;
set temp;
by id yval descending y;
if (y>0.5 and last.yval) or (y<0.5 and first.yval);
run;
proc sql;
create table want as select * from (
select distinct id, max(y) from have where y le 0.5 group by id
union all
select distinct id, min(Y) from have where y ge 0.5 group by id
)
;
quit;

Both solutions can be adapted to solve the part of your request that
involves adding a second row for other values where the bottom value is not
present (or the top value), but I don't have time right now to finish the
problem - the SQL solution would involve another pair of UNION ALLs where
you exclude the already found rows, and the DATA step solution could be
arrived at in a handful of ways, including a double DoW loop or just another
datastep similar to the above, again dropping out the relevant rows and
re-pulling, then merging so to retrieve the right number of rows.

-Joe

On Thu, Oct 22, 2009 at 1:15 PM, olivesecret <olivesecret(a)gmail.com> wrote:

> I have a large data set consisting of subject id, response y and other
> interesting variables. A subset of data is like this:
>
> ID Y ...
> 1 1
> 1 1
> 1 1
> 1 0.8
> 1 0.6
> 1 0.6
> 1 0.4
> 1 0.2
> 2 1
> 2 1
> 2 0.4
> 2 0
> 3 1
> 3 1
> 3 0.8
> 3 0.8
> 4 1
> ...
>
> What I need do is for each ID, find the two observations, with one
> having y immediately larger than 0.5 and the other having y
> immediately smaller 0.5. For the example above, then the observations
> needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations
> needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are
> no observations where y is less than 0.5, then I need the the two obs
> which having y immediately larger than 0.5, which are ID=3 y=1 and
> ID=3 y=0.8.
> Any hints?
> Thanks a lot!
>

From: Ya Huang on 22 Oct 2009 14:47

data xx;
input ID Y;
cards;
1 1
1 1
1 1
1 0.8
1 0.6
1 0.6
1 0.4
1 0.2
2 1
2 1
2 0.4
2 0
3 1
3 1
3 0.8
3 0.8
;

data xx;
set xx;
n_=_n_;
s=sign(y-0.5);
dist=abs(y-0.5);
run;

proc sql;
create table yy as
select *, count(distinct s) as us
from xx
group by id
order by id,s,dist
;

data zz;
set yy;
by id s dist;
if us=1 then do;
if first.s then keep=0;
keep+1;
if keep <=2 then output;
end;
else if us=2 then do;
if first.s then output;
end;
run;

proc sort;
by n_;
run;

proc print;
run;

Bascially, you group the records based on which side they are from 0.5,
then for each group, you keep the least distance record. For the case
of only one side, we need to output the least two, this is done
by a flag us.

On Thu, 22 Oct 2009 11:15:52 -0700, olivesecret <olivesecret(a)GMAIL.COM>
wrote:

>I have a large data set consisting of subject id, response y and other
>interesting variables. A subset of data is like this:
>
>ID Y ...
>1 1
>1 1
>1 1
>1 0.8
>1 0.6
>1 0.6
>1 0.4
>1 0.2
>2 1
>2 1
>2 0.4
>2 0
>3 1
>3 1
>3 0.8
>3 0.8
>4 1
>...
>
>What I need do is for each ID, find the two observations, with one
>having y immediately larger than 0.5 and the other having y
>immediately smaller 0.5. For the example above, then the observations
>needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations
>needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are
>no observations where y is less than 0.5, then I need the the two obs
>which having y immediately larger than 0.5, which are ID=3 y=1 and
>ID=3 y=0.8.
>Any hints?
>Thanks a lot!

From: Ya Huang on 22 Oct 2009 14:52

For ID=3, why do you have 0.8 and 1? Based on your rule, you
should have 0.8 and 0.8, since they are closer.

On Thu, 22 Oct 2009 11:15:52 -0700, olivesecret <olivesecret(a)GMAIL.COM>
wrote:

>I have a large data set consisting of subject id, response y and other
>interesting variables. A subset of data is like this:
>
>ID Y ...
>1 1
>1 1
>1 1
>1 0.8
>1 0.6
>1 0.6
>1 0.4
>1 0.2
>2 1
>2 1
>2 0.4
>2 0
>3 1
>3 1
>3 0.8
>3 0.8
>4 1
>...
>
>What I need do is for each ID, find the two observations, with one
>having y immediately larger than 0.5 and the other having y
>immediately smaller 0.5. For the example above, then the observations
>needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations
>needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are
>no observations where y is less than 0.5, then I need the the two obs
>which having y immediately larger than 0.5, which are ID=3 y=1 and
>ID=3 y=0.8.
>Any hints?
>Thanks a lot!

From: "Data _null_;" on 22 Oct 2009 15:33

This appears to work using the data you posted. I don't know how it
will work with your real data. It might be useful as a check of the
other methods or to show that it does not work.

data test;
input id:1. y @@;
array yy[10];
do _n_ = 1 to dim(yy);
yy[_n_] = rannor(345785);
end;
cards;
1 1 1 1 1 1 1 0.8 1 0.6 1 0.6 1 0.4 1 0.2
2 1 2 1 2 0.4 2 0 3 1 3 1
3 0.8 3 0.8
4 1
;;;;
run;
data testV / view=testV;
set test(keep=id y);
x = y-.5;
s = sign(x);
run;
proc summary nway data=testV;
class id;
output out=obs(drop=_type_ _freq_)
idgroup(max(s) min(x) obs out[1](y)=idY1)
idgroup(min(s) max(x) obs out[1](y)=idY2)
/ autoname
;
run;
proc print;
run;
data subset;
set obs;
do point = _obs_,_obs2_;
set test point=point;
output;
end;
run;
proc print;
run;
On 10/22/09, olivesecret <olivesecret(a)gmail.com> wrote:
> I have a large data set consisting of subject id, response y and other
> interesting variables. A subset of data is like this:
>
> ID Y ...
> 1 1
> 1 1
> 1 1
> 1 0.8
> 1 0.6
> 1 0.6
> 1 0.4
> 1 0.2
> 2 1
> 2 1
> 2 0.4
> 2 0
> 3 1
> 3 1
> 3 0.8
> 3 0.8
> 4 1
> ...
>
> What I need do is for each ID, find the two observations, with one
> having y immediately larger than 0.5 and the other having y
> immediately smaller 0.5. For the example above, then the observations
> needed for ID=1 are ID=1 y=0.6 and ID=1 y=0.4, and the observations
> needed for ID=2 are ID=2 y=1 and ID=2 y=0.4. For ID=3, since there are
> no observations where y is less than 0.5, then I need the the two obs
> which having y immediately larger than 0.5, which are ID=3 y=1 and
> ID=3 y=0.8.
> Any hints?
> Thanks a lot!
>

| Next | Last
Pages: 1 2
Prev: Quality of logistic regression model
Next: Proc gplot annotate question