Reshaping the data to run the simple correlation analysis [SAS]

Prev: Numeric ID Values with decimals -- probably not the best idea in
Next: nps74.tmp.pdf, nps75.tmp.pdf

From: xlr82sas on 21 Nov 2009 20:15

On Nov 21, 10:07 am, Henry <chchanghe...(a)gmail.com> wrote:
> Hi Tom,
>
> Thanks for your reply. I guess I know why it is confused. I have a
> 900000 unique firms I denoted as i. As each firm has a value with
> another firm, denoted as j(including itself). Then, I will have
> 900000*900000 observations.
>
> F1 here denoted i firm and F2 denoted as the corresponding firm j for
> i. That's why I wrote something like:
>
> F1 F2 V1
>
> 1 1 0.1
> 1 2 0.3
> 1 3 0.5
> 2 1 0.56
> 2 2 0.99
> 2 3 0.78
>
> Now, I tried to get the correlation between these firms except
> themselves which is when F1 = F2.
>
> So, in brief, I want the correlation between these two columns
>
> 0.3 0.56
> 0.5 0.78
>
> that's why I put :
>
> F1 F2 V1 F1 F2 V1
> 1 2 0.3 2 1 0.56
> 1 3 0.5 2 3 0.78
>
> As I am thinking about reshaping the values so I can get a pair of
> values as you mentioned to calculate the correlation.
>
> then as firm1(when F1 = 1) will also need a correlation with firm 3
> (when F1=3) So, I put:
>
> F1 F2 V1 F1 F2 V1
> 1 2 0.3 3 1 0.84
> 1 3 0.5 3 2 0.83
>
> Meaning I need the correlation between
> 0.3 0.84
> 0.5 0.83
>
> This will have to be done for this 900000 unique firms which has its
> another 900000 corresponding firms (including itself).
>
> Then, after obtaining the correlation between firm i and j, I will
> have to merge them back to the original data. So, I think I still
> need the value for F1 and F2 so I can do the merge. Of course, the F1
> and F2 they are characteristic values (as they also contain a, b ,c
> within numbers. i.e. firm i's F1 = 200c00).
>
> I am not sure if you can understand my issues here? Thanks again for
> your reply!
>
> Henry

Hi,

I think you have to work with one firm against the thousands of other
at a time. I might macrotize this code create a massive
long and skinny and process with one datastep. I decided to use a
datastep to get the corr coef, because I had some old code. Also I
think it might be more flexible and quicker than 'proc corr'.

/* reshape for 'proc corr or datastep */
/* this reshaping can easily be done by creating a datastep and then
using proc sql or merge
to add the common firm. I think this is more efficient */
/* you might want to use arrays */
data res;
retain cnt 0 grp frm1x2 frm1x3 frm1x4 frm1x24;
drop cnt frm1x2 frm1x3 frm1x4 ;
input f1 f2 v1;
if f1 ne f2;
cnt+1;
grp="Firm &frm1";
select;
when(_n_=2) frm1x2=v1;
when(_n_=3) frm1x3=v1;
when(_n_=4) do; frm1x4=v1;cnt=0;end;
when(mod(cnt,3)=1) do; frm1x24=frm1x2; output;end;
when(mod(cnt,3)=2) do; frm1x24=frm1x3; output;end;
when(mod(cnt,3)=0) do; frm1x24=frm1x4; output;end;
end;
cards;
1 1 0.1
1 2 0.3
1 3 0.5
1 4 0.9
2 1 0.56
2 2 0.55
2 3 0.78
2 4 0.99
3 1 0.84
3 2 0.83
3 3 0.65
3 4 0.34
;
run;

/* I added the slope and intercept. I aslo added a fourth F2 since two
points give a corr coef of 1 or -1 */
data res2nd;
retain sumx sumy sumxx sumxy sumyy n;
set res(rename=(frm1x24=x v1=y));
by f1 notsorted;
if first.f1 then do;
sumx = 0;
sumy = 0;
sumxx = 0;
sumxy = 0;
sumyy = 0;
n = 0;
end;
sumx = sum(sumx, x);
sumy = sum(sumy, y);
sumxx = sum(sumxx, x ** 2);
sumxy = sum(sumxy, x * y);
sumyy = sum(sumyy, y * y);
n = sum(n, 1);
if last.f1 then do;
nmr = sumxy - sumx * sumy /n;
den = sqrt ( sumxx -(sumx)**2/n ) * sqrt ( sumyy -(sumy)**2/n );
slope = (sumxy - sumx * sumy / n)/(sumxx - sumx ** 2 / n);
intercept = sumy / n - slope * sumx / n;
corr=nmr/den;
put slope = ;
put intercept = ;
put corr=;
end;
run;

From: Henry on 23 Nov 2009 13:08

Hello,

Thanks for sharing your codes with me!! I will work on it to see how
it goes.

Henry

On Nov 21, 6:15 pm, xlr82sas <xlr82...(a)aol.com> wrote:
> On Nov 21, 10:07 am, Henry <chchanghe...(a)gmail.com> wrote:
>
>
>
> > Hi Tom,
>
> > Thanks for your reply. I guess I know why it is confused. I have a
> > 900000 unique firms I denoted as i. As each firm has a value with
> > another firm, denoted as j(including itself). Then, I will have
> > 900000*900000 observations.
>
> > F1 here denoted i firm and F2 denoted as the corresponding firm j for
> > i. That's why I wrote something like:
>
> > F1 F2 V1
>
> > 1 1 0.1
> > 1 2 0.3
> > 1 3 0.5
> > 2 1 0.56
> > 2 2 0.99
> > 2 3 0.78
>
> > Now, I tried to get the correlation between these firms except
> > themselves which is when F1 = F2.
>
> > So, in brief, I want the correlation between these two columns
>
> > 0.3 0.56
> > 0.5 0.78
>
> > that's why I put :
>
> > F1 F2 V1 F1 F2 V1
> > 1 2 0.3 2 1 0.56
> > 1 3 0.5 2 3 0.78
>
> > As I am thinking about reshaping the values so I can get a pair of
> > values as you mentioned to calculate the correlation.
>
> > then as firm1(when F1 = 1) will also need a correlation with firm 3
> > (when F1=3) So, I put:
>
> > F1 F2 V1 F1 F2 V1
> > 1 2 0.3 3 1 0.84
> > 1 3 0.5 3 2 0.83
>
> > Meaning I need the correlation between
> > 0.3 0.84
> > 0.5 0.83
>
> > This will have to be done for this 900000 unique firms which has its
> > another 900000 corresponding firms (including itself).
>
> > Then, after obtaining the correlation between firm i and j, I will
> > have to merge them back to the original data. So, I think I still
> > need the value for F1 and F2 so I can do the merge. Of course, the F1
> > and F2 they are characteristic values (as they also contain a, b ,c
> > within numbers. i.e. firm i's F1 = 200c00).
>
> > I am not sure if you can understand my issues here? Thanks again for
> > your reply!
>
> > Henry
>
> Hi,
>
> I think you have to work with one firm against the thousands of other
> at a time. I might macrotize this code create a massive
> long and skinny and process with one datastep. I decided to use a
> datastep to get the corr coef, because I had some old code. Also I
> think it might be more flexible and quicker than 'proc corr'.
>
> /* reshape for 'proc corr or datastep */
> /* this reshaping can easily be done by creating a datastep and then
> using proc sql or merge
> to add the common firm. I think this is more efficient */
> /* you might want to use arrays */
> data res;
> retain cnt 0 grp frm1x2 frm1x3 frm1x4 frm1x24;
> drop cnt frm1x2 frm1x3 frm1x4 ;
> input f1 f2 v1;
> if f1 ne f2;
> cnt+1;
> grp="Firm &frm1";
> select;
> when(_n_=2) frm1x2=v1;
> when(_n_=3) frm1x3=v1;
> when(_n_=4) do; frm1x4=v1;cnt=0;end;
> when(mod(cnt,3)=1) do; frm1x24=frm1x2; output;end;
> when(mod(cnt,3)=2) do; frm1x24=frm1x3; output;end;
> when(mod(cnt,3)=0) do; frm1x24=frm1x4; output;end;
> end;
> cards;
> 1 1 0.1
> 1 2 0.3
> 1 3 0.5
> 1 4 0.9
> 2 1 0.56
> 2 2 0.55
> 2 3 0.78
> 2 4 0.99
> 3 1 0.84
> 3 2 0.83
> 3 3 0.65
> 3 4 0.34
> ;
> run;
>
> /* I added the slope and intercept. I aslo added a fourth F2 since two
> points give a corr coef of 1 or -1 */
> data res2nd;
> retain sumx sumy sumxx sumxy sumyy n;
> set res(rename=(frm1x24=x v1=y));
> by f1 notsorted;
> if first.f1 then do;
> sumx = 0;
> sumy = 0;
> sumxx = 0;
> sumxy = 0;
> sumyy = 0;
> n = 0;
> end;
> sumx = sum(sumx, x);
> sumy = sum(sumy, y);
> sumxx = sum(sumxx, x ** 2);
> sumxy = sum(sumxy, x * y);
> sumyy = sum(sumyy, y * y);
> n = sum(n, 1);
> if last.f1 then do;
> nmr = sumxy - sumx * sumy /n;
> den = sqrt ( sumxx -(sumx)**2/n ) * sqrt ( sumyy -(sumy)**2/n );
> slope = (sumxy - sumx * sumy / n)/(sumxx - sumx ** 2 / n);
> intercept = sumy / n - slope * sumx / n;
> corr=nmr/den;
> put slope = ;
> put intercept = ;
> put corr=;
> end;
> run;

First | Prev |
Pages: 1 2
Prev: Numeric ID Values with decimals -- probably not the best idea in
Next: nps74.tmp.pdf, nps75.tmp.pdf