PCA dimensionality reduction / component identifications [Matlab]

Prev: kmeans error message
Next: variable size inputs/outputs

From: Pierrre Gogin on 16 Jul 2010 12:02

Hi everybody,

I have a question regarding the reduction of dimensions with pca (princomp command in Matalb). The reduction itself is not the problem, but I could not figure out, how to indentify the original features, that are actually important components and the one that are not important.
Small example:
I create a feature space of 100 observations and 3 features:
vec1 = rand(1,100)*2;
vec2 = rand(1,100)*20;
vec3 = rand(1,100)*12;
A = [vec1; vec2; vec3];

Obviously the 2nd feature has a higher variance the the 3rd, etc…So from this generic data I would expect, that the vec2 is contribution most to describe my dataset. I followed the “normal” approach to do the pca:
[COEFF2,SCORE2, e] = princomp(A)

From the output I get the eigenvector and eigenvalues, telling me, which dimension of the transformed (!) feature space contributed how much to the representation of the dataset? I want to know, with which percentage each of my three features (vec1,vec2,vec3) from the original(!) distribution is contributing to my dataset, without having the prior knowledge of how they are build. From the output of Matlab I can’t tell. Does somebody have an idea how to get this information?

Thanks in advance
Pierre

From: Peter Perkins on 16 Jul 2010 14:10

On 7/16/2010 12:02 PM, Pierrre Gogin wrote:
> I create a feature space of 100 observations and 3 features:
> vec1 = rand(1,100)*2;
> vec2 = rand(1,100)*20;
> vec3 = rand(1,100)*12;
> A = [vec1; vec2; vec3];
>
> Obviously the 2nd feature has a higher variance the the 3rd,
> etc…So from this generic data I would expect, that the vec2 is
> contribution most to describe my dataset. I followed the
> “normal” approach to do the pca:
> [COEFF2,SCORE2, e] = princomp(A)

Pierre, I bet this (note the transpose) will be a bit more obvious to
figure out:

>> [COEFF2,~,e] = princomp(A')
COEFF2 =
-0.00042265 -0.0045291 0.99999
0.99997 0.0081763 0.00045967
-0.0081783 0.99996 0.0045255
e =
26.508
10.351
0.33826

Which is to say, the first PC [-0.00042265 0.99997 -0.0081783]' picks
out the second feature, and accounts for 26/37=71% of the total
variance, and so on. PRINCOMP, like all functions in the Statistics
Toolbox, is column-oriented.

Hope this helps.

From: Pierrre Gogin on 16 Jul 2010 16:13

Hi Peter,

thanks for your answer. Unfortunately it did not really answer my question (see below)
>
>
> Pierre, I bet this (note the transpose) will be a bit more obvious to
> figure out:
I had the transpose in my code, you forget to copy it, so I got the same results (at least the format, with the rand() it will of course be a bit different each time)

>
> Which is to say, the first PC [-0.00042265 0.99997 -0.0081783]' picks
> out the second feature, and accounts for 26/37=71% of the total
> variance, and so on.

So here's the point. The first PC accounts 71% of the total variance. I agree but I guess that it is the first PC of the transformed () feature space. The question I'm interested in is: How much % ver1, ver2 and ver3 accounts for the total variance.
Pca is a linear transformation, so the backwards should be possible but I just don't know how.

From: Peter Perkins on 16 Jul 2010 17:10

On 7/16/2010 4:13 PM, Pierrre Gogin wrote:
> So here's the point. The first PC accounts 71% of the total variance. I
> agree but I guess that it is the first PC of the transformed () feature
> space. The question I'm interested in is: How much % ver1, ver2 and ver3
> accounts for the total variance. Pca is a linear transformation, so the
> backwards should be possible but I just don't know how.

Is this what you're asking?

>> [COEFF2,~, e] = princomp(A')
COEFF2 =
-0.017612 -0.024113 0.99955
0.99486 0.099277 0.019924
-0.099713 0.99477 0.022241
e =
33.805
10.657
0.33649
>> sum(e)
ans =
44.799
>> S = cov(A')
S =
0.35288 -0.61112 -0.18879
-0.61112 33.564 -2.3009
-0.18879 -2.3009 10.882
>> sum(diag(S))
ans =
44.799
>> diag(S)/sum(diag(S))
ans =
0.0078769
0.74921
0.24291

From: Pierrre Gogin on 17 Jul 2010 11:15

Hi Peter,

Thanks a lot that looks very much what I’m looking for. I was confused because actually the pca is not really necessary to get the information I’m looking for. It is sufficient to evaluation the main diagonal of the covariance matrix. The princomp command is matlab does not much more that that I guess. Only if compute also the second output that you might get the input Matrix rotated by the matrix containing all eigenvalues. Correct?

Peter Perkins <Peter.Perkins(a)MathRemoveThisWorks.com> wrote in message <i1qhrn$kio$1(a)fred.mathworks.com>...
> On 7/16/2010 4:13 PM, Pierrre Gogin wrote:
> > So here's the point. The first PC accounts 71% of the total variance. I
> > agree but I guess that it is the first PC of the transformed () feature
> > space. The question I'm interested in is: How much % ver1, ver2 and ver3
> > accounts for the total variance. Pca is a linear transformation, so the
> > backwards should be possible but I just don't know how.
>
> Is this what you're asking?
>
> >> [COEFF2,~, e] = princomp(A')
> COEFF2 =
> -0.017612 -0.024113 0.99955
> 0.99486 0.099277 0.019924
> -0.099713 0.99477 0.022241
> e =
> 33.805
> 10.657
> 0.33649
> >> sum(e)
> ans =
> 44.799
> >> S = cov(A')
> S =
> 0.35288 -0.61112 -0.18879
> -0.61112 33.564 -2.3009
> -0.18879 -2.3009 10.882
> >> sum(diag(S))
> ans =
> 44.799
> >> diag(S)/sum(diag(S))
> ans =
> 0.0078769
> 0.74921
> 0.24291

| Next | Last
Pages: 1 2 3 4
Prev: kmeans error message
Next: variable size inputs/outputs