From: xlr82sas on
On Nov 21, 10:07 am, Henry <chchanghe...(a)gmail.com> wrote:
> Hi  Tom,
>
> Thanks for your reply.  I guess I know why it is confused.   I have a
> 900000 unique firms  I denoted as i.   As each firm has a value with
> another firm, denoted as j(including itself). Then, I will have
> 900000*900000 observations.
>
> F1 here denoted i firm and F2 denoted as the corresponding firm j for
> i.  That's why I wrote something like:
>
> F1    F2    V1
>
> 1      1      0.1
> 1      2      0.3
> 1      3      0.5
> 2      1      0.56
> 2      2      0.99
> 2      3      0.78
>
> Now, I tried to get the correlation between these firms except
> themselves which is when F1 = F2.
>
> So, in brief,  I want the correlation  between these two columns
>
> 0.3   0.56
> 0.5   0.78
>
> that's why I put :
>
> F1   F2      V1     F1    F2     V1
> 1      2      0.3      2     1       0.56
> 1      3      0.5      2     3       0.78
>
> As I am thinking about reshaping the values so I can get a pair of
> values as you mentioned to calculate the correlation.
>
> then as firm1(when F1 = 1) will also need a correlation with firm 3
> (when F1=3) So,  I put:
>
> F1   F2      V1     F1    F2     V1
> 1      2      0.3      3     1       0.84
> 1      3      0.5      3     2       0.83
>
> Meaning I need the correlation between
> 0.3   0.84
> 0.5   0.83
>
> This will have to be done for this 900000 unique firms which has its
> another 900000 corresponding firms (including itself).
>
> Then, after obtaining the correlation between firm i and j, I will
> have to merge them back to the original data.  So, I think I still
> need the value for F1 and F2 so I can do the merge.  Of course, the F1
> and F2 they are characteristic values (as they also contain a, b ,c
> within numbers.  i.e. firm i's F1 = 200c00).
>
> I am not sure if you can understand my issues here?   Thanks again for
> your reply!
>
> Henry

Hi,

I think you have to work with one firm against the thousands of other
at a time. I might macrotize this code create a massive
long and skinny and process with one datastep. I decided to use a
datastep to get the corr coef, because I had some old code. Also I
think it might be more flexible and quicker than 'proc corr'.

/* reshape for 'proc corr or datastep */
/* this reshaping can easily be done by creating a datastep and then
using proc sql or merge
to add the common firm. I think this is more efficient */
/* you might want to use arrays */
data res;
retain cnt 0 grp frm1x2 frm1x3 frm1x4 frm1x24;
drop cnt frm1x2 frm1x3 frm1x4 ;
input f1 f2 v1;
if f1 ne f2;
cnt+1;
grp="Firm &frm1";
select;
when(_n_=2) frm1x2=v1;
when(_n_=3) frm1x3=v1;
when(_n_=4) do; frm1x4=v1;cnt=0;end;
when(mod(cnt,3)=1) do; frm1x24=frm1x2; output;end;
when(mod(cnt,3)=2) do; frm1x24=frm1x3; output;end;
when(mod(cnt,3)=0) do; frm1x24=frm1x4; output;end;
end;
cards;
1 1 0.1
1 2 0.3
1 3 0.5
1 4 0.9
2 1 0.56
2 2 0.55
2 3 0.78
2 4 0.99
3 1 0.84
3 2 0.83
3 3 0.65
3 4 0.34
;
run;

/* I added the slope and intercept. I aslo added a fourth F2 since two
points give a corr coef of 1 or -1 */
data res2nd;
retain sumx sumy sumxx sumxy sumyy n;
set res(rename=(frm1x24=x v1=y));
by f1 notsorted;
if first.f1 then do;
sumx = 0;
sumy = 0;
sumxx = 0;
sumxy = 0;
sumyy = 0;
n = 0;
end;
sumx = sum(sumx, x);
sumy = sum(sumy, y);
sumxx = sum(sumxx, x ** 2);
sumxy = sum(sumxy, x * y);
sumyy = sum(sumyy, y * y);
n = sum(n, 1);
if last.f1 then do;
nmr = sumxy - sumx * sumy /n;
den = sqrt ( sumxx -(sumx)**2/n ) * sqrt ( sumyy -(sumy)**2/n );
slope = (sumxy - sumx * sumy / n)/(sumxx - sumx ** 2 / n);
intercept = sumy / n - slope * sumx / n;
corr=nmr/den;
put slope = ;
put intercept = ;
put corr=;
end;
run;


From: Henry on
Hello,

Thanks for sharing your codes with me!! I will work on it to see how
it goes.

Henry




On Nov 21, 6:15 pm, xlr82sas <xlr82...(a)aol.com> wrote:
> On Nov 21, 10:07 am, Henry <chchanghe...(a)gmail.com> wrote:
>
>
>
> > Hi  Tom,
>
> > Thanks for your reply.  I guess I know why it is confused.   I have a
> > 900000 unique firms  I denoted as i.   As each firm has a value with
> > another firm, denoted as j(including itself). Then, I will have
> > 900000*900000 observations.
>
> > F1 here denoted i firm and F2 denoted as the corresponding firm j for
> > i.  That's why I wrote something like:
>
> > F1    F2    V1
>
> > 1      1      0.1
> > 1      2      0.3
> > 1      3      0.5
> > 2      1      0.56
> > 2      2      0.99
> > 2      3      0.78
>
> > Now, I tried to get the correlation between these firms except
> > themselves which is when F1 = F2.
>
> > So, in brief,  I want the correlation  between these two columns
>
> > 0.3   0.56
> > 0.5   0.78
>
> > that's why I put :
>
> > F1   F2      V1     F1    F2     V1
> > 1      2      0.3      2     1       0.56
> > 1      3      0.5      2     3       0.78
>
> > As I am thinking about reshaping the values so I can get a pair of
> > values as you mentioned to calculate the correlation.
>
> > then as firm1(when F1 = 1) will also need a correlation with firm 3
> > (when F1=3) So,  I put:
>
> > F1   F2      V1     F1    F2     V1
> > 1      2      0.3      3     1       0.84
> > 1      3      0.5      3     2       0.83
>
> > Meaning I need the correlation between
> > 0.3   0.84
> > 0.5   0.83
>
> > This will have to be done for this 900000 unique firms which has its
> > another 900000 corresponding firms (including itself).
>
> > Then, after obtaining the correlation between firm i and j, I will
> > have to merge them back to the original data.  So, I think I still
> > need the value for F1 and F2 so I can do the merge.  Of course, the F1
> > and F2 they are characteristic values (as they also contain a, b ,c
> > within numbers.  i.e. firm i's F1 = 200c00).
>
> > I am not sure if you can understand my issues here?   Thanks again for
> > your reply!
>
> > Henry
>
> Hi,
>
> I think you have to work with one firm against the thousands of other
> at a time. I might macrotize this code create a massive
> long and skinny and process with one datastep. I decided to use a
> datastep to get the corr coef, because I had some old code. Also I
> think it might be more flexible and quicker than 'proc corr'.
>
> /* reshape for 'proc corr or datastep */
> /* this reshaping can easily be done by creating a datastep and then
> using proc sql or merge
> to add the common firm. I think this is more efficient */
> /* you might want to use arrays */
> data res;
> retain cnt 0 grp frm1x2 frm1x3 frm1x4 frm1x24;
> drop   cnt       frm1x2 frm1x3 frm1x4 ;
> input f1 f2 v1;
> if f1 ne f2;
> cnt+1;
> grp="Firm &frm1";
> select;
>   when(_n_=2)            frm1x2=v1;
>   when(_n_=3)            frm1x3=v1;
>   when(_n_=4)        do; frm1x4=v1;cnt=0;end;
>   when(mod(cnt,3)=1) do; frm1x24=frm1x2; output;end;
>   when(mod(cnt,3)=2) do; frm1x24=frm1x3; output;end;
>   when(mod(cnt,3)=0) do; frm1x24=frm1x4; output;end;
> end;
> cards;
> 1 1 0.1
> 1 2 0.3
> 1 3 0.5
> 1 4 0.9
> 2 1 0.56
> 2 2 0.55
> 2 3 0.78
> 2 4 0.99
> 3 1 0.84
> 3 2 0.83
> 3 3 0.65
> 3 4 0.34
> ;
> run;
>
> /* I added the slope and intercept. I aslo added a fourth F2 since two
> points give a corr coef of 1 or -1 */
> data res2nd;
>    retain sumx sumy sumxx sumxy sumyy n;
>    set res(rename=(frm1x24=x v1=y));
>    by f1 notsorted;
>    if first.f1 then do;
>      sumx  = 0;
>      sumy  = 0;
>      sumxx = 0;
>      sumxy = 0;
>      sumyy = 0;
>      n     = 0;
>    end;
>    sumx  = sum(sumx, x);
>    sumy  = sum(sumy, y);
>    sumxx = sum(sumxx, x ** 2);
>    sumxy = sum(sumxy, x * y);
>    sumyy = sum(sumyy, y * y);
>    n     = sum(n, 1);
>    if last.f1 then do;
>       nmr = sumxy  - sumx * sumy /n;
>       den = sqrt ( sumxx -(sumx)**2/n ) * sqrt ( sumyy -(sumy)**2/n );
>       slope = (sumxy - sumx * sumy / n)/(sumxx - sumx ** 2 / n);
>       intercept = sumy / n - slope * sumx / n;
>       corr=nmr/den;
>       put slope = ;
>       put intercept = ;
>       put corr=;
>    end;
> run;