PCA dimensionality reduction / component identifications [Matlab]

Prev: kmeans error message
Next: variable size inputs/outputs

From: Philip Mewes on 20 Jul 2010 12:12

"Rob Campbell" <matlab(a)robertREMOVEcampbell.removethis.co.uk> wrote in message <i24fno$6s3$1(a)fred.mathworks.com>...
> >The reduction itself is not the problem, but I could not figure out, how to indentify the
> >original features, that are actually important components and the one that are not
> >important.
>
> I see, this is what you want to know? Why not look at the direction of the eigenvectors? Plot the eigenvectors: the resulting "shapes" will tell you what features of your data they explain.

if I do [vec ~ eigenv] = pca(A) and A is a matrix that contains my vector in each row. vec is than containing the eigenvectors in the rows of a matrix. Are those eigenvectors still correlated to my original data?

From: Rob Campbell on 20 Jul 2010 12:18

Ah! So you have two groups and you can produce a training set where you know whether or not each image contains a car? In that case, you have a supervised classification problem. PCA isn't the right way to go. Why not conduct a discriminant analysis? This will produce a single direction in your space which best separates the car from non-car images. You can plot your data as two histograms along this axis. The direction of the vector will tell you the basis upon which the discrimination was made: each of your original variables will have a "weighting" and you can use the magnitude of each weighting to decide whether or not it is significant. You can calculate confidence intervals for these weightings (maybe using a permutation test) to help you determine significance.

You want:
help classify

From: Greg Heath on 21 Jul 2010 00:58

On Jul 16, 12:02 pm, "Pierrre Gogin" <pierre.go...(a)freemail.de> wrote:
> Hi everybody,
>
> I have a question regarding the reduction of dimensions with pca (princomp command in Matalb). The reduction itself is not the problem, but I could not figure out, how to indentify the original features, that are actually important components and the one that are not important.
> Small example:
> I create a feature space of 100 observations and 3 features:
> vec1 = rand(1,100)*2;
> vec2 = rand(1,100)*20;
> vec3 = rand(1,100)*12;
> A = [vec1; vec2; vec3];
>
> Obviously the 2nd feature has a higher variance the the 3rd, etc…So from this generic data I would expect, that the vec2 is contribution most to describe my dataset.

Is contributing the most what? That's almost like saying
the volume of a rectangular box V = L*W*H is

V = (12 in)*( 1 ft)*( (1/3) yard) = 4 in-ft-yards

and the length of 12 is contributing the most to the volume.

For many variable comparisons it is usually prudent to

1. Use standardized variables to be independent of
scaling and the choice of origin location.
2. Define a characteristic for making comparisons
3. Use a quantitative measure of the amount of
characteristic created by an arbitrary variable
subset.
4. Be cognizant of the fact that if the variables
are correlated, the contribution of a variable will
depend on what other vaiables are present. For
example, when all variables are present, removing
x2 might decrease the measure the most. However,
when no variables are present, the measure might
be increased the most when x3 is added.

> I followed the “normal” approach to do the pca:
> [COEFF2,SCORE2, e] = princomp(A)
>
> From the output I get the eigenvector and eigenvalues, telling me, which dimension of the transformed (!) feature space contributed how >much to the representation of the dataset?

Representation of what characteristic? The eigenvalues
of the covariance matrix indicate scale dependent spread.
Is spread the characteristic that is most important ?
What happens if you rescale the data?
What about the classification of data from two
parallel cigar-shaped distributions? ... The spread
is largest along the length of the cigars. However,
the direction of largest class separation can be in
the direction perpendicuar to the maximum spread
direction.

>I want to know, with which percentage each of my three features (vec1,vec2,vec3) from the original(!) distribution is contributing to my dataset, without having the prior knowledge of how they are build. From the output of Matlab I can’t tell. Does somebody have an idea how to get this information?

As implied above, the meaning of "contributing to the
dataset" has to be defined.

If you are only interested in the directions
of largest spread, beware that they may be
useless if you are trying to separate classes
of different data types.

Begin by defining an output as a function of input
variables and a godness measure for the output.

The rest depends on your particular problem and
may be very difficult. However, you'll seldom
achieve your goal without a well defined goal,
a well planned approach and a good start.

Hope this helps.

Greg

First | Prev |
Pages: 1 2 3 4
Prev: kmeans error message
Next: variable size inputs/outputs