PCA [Matlab]

Prev: A bug in avifile?
Next: Parallel port

From: Greg Heath on 7 Sep 2006 12:57

Gian Piero Bandieramonte wrote:
-----SNIP>
> I have been researching the use of prepca and trapca, It seems good,
> but I have a doubt regarding its use:
>
> I test prepca with a vector

You mean a matrix. Please read this and the following post. PREPCA
erroneously bombs and outputs an error when Nvar > Nobs.

% CASE 1: Nobs > Nvar0 + 1

clear all, close all, clc

X=[1 2 3 4 5; 1 3 5 7 9; 23 5 77 3 2; 3 5 7 35 456; 345 456 ...
1234 568 34; 234 523 235 123 34; 1 234 63 346 23; 234 234 ...
234 234 234; 4 56 234 423 5; 1 4 6 4 6; 23 6 234 63 2; 34 ...
345 3 346 74; 24 34 34 34 54; 34 23 2 4 5; 23 3 3 24 32; ...
2314 234 234 234 234; 2 4 52 14 52; 24 42 5 3 63; 2314 ...
52 52 253 5; 213 4 12 42 52]'

% 1. Note the transpose
% 2. The numbers look strange. Did you make them up?

[Nvar0 Nobs] = size(X) % = [ 5 20 ]

% Notice that Nobs/(Nvar0 +1) = 3.33 is not large enough to
% guarantee very accurate estimates of the weights for
% even a simple linear model (much less a neural net with
% a hidden layer).
%
% Also notice that if [Nvar0 Nobs] = [20 5], we would be dealing
% with a degenerate case: 5 observations of 20 dimensional
% vectors. The degeneracy results from the fact that the 5
% observations span, at most, a 4 dimensional space.
% This case is considered in the next reply.

meanX = mean(X')';
stdX = std(X')';
[ meanX stdX ]

% 292.6000 698.7164
% 113.4500 164.9218
% 136.4500 275.4331
% 138.2000 171.1649
% 69.0500 113.3112

rankX = rank(X) % = 5
condX = cond(X) % = 8.2737

% Since condX << 100, The 5 variables are not multicollinear
% and variable reduction (by PCA or any other means)
% is not warranted.
%
% The only thing PCA will do in this case is transform from
% 5 independent variables to 5 orthogonal variables.

Xn = prestd(X);

sizeXn = size(Xn) % = [ 5 20 ]
meanXn = mean(Xn')';
stdXn = std(Xn')';
[ meanXn stdXn ]

% 0.0000 1.0000
% 0 1.0000
% 0.0000 1.0000
% 0.0000 1.0000
% 0.0000 1.0000

rankXn = rank(Xn) % = 5
condXn = cond(Xn) % = 2.9378

[Xnt,T] = prepca(Xn,0.02);

[ Nvar Nobs] = size(Xnt) % [ 5 20] ==> No variable reduction

rankXnt = rank(Xnt) % 5
condXnt = cond(Xnt) % 2.9378

sizeT = size(T) % [ 5 5 ]
rankT = rank(T) % 5
condT = cond(T) % 1.0000

approxerr1 = max(max(abs(Xnt-T*Xn))) % 0
approxerr2 = max(max(abs(Xn-T'*Xnt))) % 2.6645e-015

return

From: Greg Heath on 7 Sep 2006 13:01

Gian Piero Bandieramonte wrote:
-----SNIP>
> I have been researching the use of prepca and trapca, It seems good,
> but I have a doubt regarding its use:
>
> I test prepca with a vector

You mean a matrix. Please read this and the previous post. PREPCA
erroneously bombs and outputs an error when Nvar > Nobs.

% CASE 2: Nobs < Nvar0 + 1

clear all, close all, clc

X=[1 2 3 4 5; 1 3 5 7 9; 23 5 77 3 2; 3 5 7 35 456; 345 456 ...
1234 568 34; 234 523 235 123 34; 1 234 63 346 23; 234 234 ...
234 234 234; 4 56 234 423 5; 1 4 6 4 6; 23 6 234 63 2; 34 ...
345 3 346 74; 24 34 34 34 54; 34 23 2 4 5; 23 3 3 24 32; ...
2314 234 234 234 234; 2 4 52 14 52; 24 42 5 3 63; 2314 ...
52 52 253 5; 213 4 12 42 52]

[Nvar0 Nobs] = size(X) % = [ 20 5 ]

% This is a degenerate case: 5 observations of 20 dimensional
% vectors. The degeneracy results from the fact that the 5
% observations span, at most, a 4 dimensional space.
%
% Therefore, expect that the PCA reduction will yield a
% transformed input matrix, Xnt of dimensions [ Nvar Nobs ]
% with Nvar <= 4

meanX = mean(X')';
stdX = std(X')';
[ meanX stdX ]

% 3.0000 1.5811
% 5.0000 3.1623
% 22.0000 31.9218
% 101.2000 198.7692
% 527.4000 442.3639
% 229.8000 184.2246
% 133.4000 149.9943
% 234.0000 0 <== BAD NEWS: SHOULD DELETE
% THIS CONSTANT VARIABLE
% 144.4000 182.0750
% 4.2000 2.0494
% 65.6000 97.1818
% 160.4000 170.8371
% 36.0000 10.9545
% 13.6000 14.1880
% 17.0000 13.2476
% 650.0000 930.2043
% 24.8000 25.2428
% 27.4000 25.4421
% 535.2000 998.9798
% 64.6000 85.3393
%
% MATLAB WARNING MATLAB WARNING MATLAB WARNING
%
% Warning: Some standard deviations are zero. Those inputs won't
% be transformed.
%
% COMMENT COMMENT COMMENT COMMENT COMMENT
%
% Probably should either set the transformed value to zero
% or eliminate the variable from the data matrix
%
% Not sure how MATLAB's decision will effect the analysis

rankX = rank(X) % = 5
condX = cond(X) % = 8.2737

Xn = prestd(X);

sizeXn = size(Xn) % = [ 20 5 ]
meanXn = mean(Xn')';
stdXn = std(Xn')';
[ meanXn stdXn ]

% 0.0000 1.0000
% 0.0000 1.0000
% 0.0000 1.0000
% 0 1.0000
% 0.0000 1.0000
% -0.0000 1.0000
% -0.0000 1.0000
% 234.0000 0 <== I don't like this
% -0.0000 1.0000
% -0.0000 1.0000
% 0.0000 1.0000
% -0.0000 1.0000
% 0 1.0000
% 0.0000 1.0000
% 0 1.0000
% 0.0000 1.0000
% -0.0000 1.0000
% 0.0000 1.0000
% -0.0000 1.0000
% 0.0000 1.0000

rankXn = rank(Xn) % = 5
condXn = cond(Xn) % = 178.6037

% Since condX > 100, the 20 variables are multicollinear
% and variable reduction is warranted.
%
% ... HOWEVER ...
%
% PREPCA erroneously bombs when Nvar > Nobs.
% See my pretraining advice post for the fix
% named below as PREPCAGH. Besides commenting
% out the offending line of code, PREPCAGH also
% introduces a 3rd input that allows thresholding
% on the cumulative sum of eigenvalues via
%
% [Xnt,T] = prepcagh(Xn,0.98,'cumulative');
%
% This modification is not used below

[Xnt,T] = prepcagh(Xn,0.02,'individual');

[ Nvar Nobs ] = size(Xnt) % [ 1 5] Variable reduction complete

rankXnt = rank(Xnt) % 1
condXnt = cond(Xnt) % 1

sizeT = size(T) % [ 1 20 ]
rankT = rank(T) % 1
condT = cond(T) % 1

approxerr1 = max(max(abs(Xnt-T*Xn))) % 0
approxerr2 = max(max(abs(Xn-T'*Xnt))) % 1.7889

return

From: Greg Heath on 7 Sep 2006 13:17

Gian Piero Bandieramonte wrote:
> > 1. Go to Google Groups and search on
> >
> > greg-heath pretraining-advice
> >
> > 2. Sort by date
> > 3. It should be near the last post
> >
> > Hope this helps.
> >
> > Greg
> > ------------SNIP
>
> I have read your post on pretraining advice, and what gets closest to
> the answer to my question is your point 2 on your advice:
>
> "2. Use TRANSPOSE and PRESTD to standardize the columns of Z. On
> special occasions normalization to the bounded interval [-1,1]
> (PREMNMX) is used for some columns. However, this is most
> useful only if you know that all unknown data must fall
> within the original bounds of the training data. "
>
> It says to use transpose and prestd, but it does not tell me anything
> about their order of use. So, should I use first transpose or prestd?
> Sorry if I'm being tedious, I want to be sure of this before I change
> my code from using princomp to using prepca.

Unfortunately, the row/column variable/observation convention
is reversed between the statistics and neural net toolboxes.

Therefore, in order to be sure I give you the correct answer, i
have to go to the documentation via

help prestd
help prepca

etc

to determine whether the rows should be columns or observations.
That will determines whether transposition is needed or not.

I also find it very helpful to print the size, rank and condition
number
of matrices throughout the code in order to make sure I am doing
what I want.

See my last two replies w.r.t. the 20 X 5 matrix you posted.

I would rather help you understand how to figure out the answers
than to just give them to you. I'd rather you go to the documentation
rather than me.

Hope this helps.

Greg

From: Gian Piero Bandieramonte on 12 Sep 2006 14:08

I have tried processing my network inputs and targets using
prestd,prepca,trastd,trapca and poststd. By doing this, my network
generalization performance hugely decreases to dangerous levels. I
have used them the way the matlab help tells me to use them. So I had
to stick up with using princomp to process my data so as to reduce
its dimensionality (the net generalization using princomp is aprox
1000 times better than using prepca,trapca....).

My design set, consisting of a training set of 252 rows has been
processed by princomp to reduce the dimensionality of the input from
37 to 22 (I'm being conservative by now). Then I simulate my net with
the same training set having excellent fit. Then I simulate with a
new test set, of 5000 rows, having somewhat a good generalization. So
obviously I preprocessed this test set with princomp before
simulating. But there is a problem here: I'm not supposed to
preprocess my test set using all the batch at one time, I'm supposed
to preprocess individually each row of the test set. But I don't know
how to do this with princomp, if I use this func with only one row, a
matlab error appears. If I use two rows,it creates a new 2x37 matrix
with zero's from column 2 to 37, but if I processed using the whole
batch, the new transformed matrix wouldn't have zero's at
M(1:2,2:37). The output changes depending on the number of rows. So
I need a consistent way of processing each row individually, maybe
using some transformation matrix or vector, but I don't know how to
find it, and it is not available the way you said on :

> > >I use eigs(corcoeff(X)) for PCA (instead of princomps), so
> > > I don't know if the transformation matrix is available to
you
> > > without solving T*X = PC using T = X/PC.

I tried calculating T this way but it doesn't properly transform my
rows using it any way. I'm also a bit confused about this. In the
case of using prepca and trapca I do have a transformation matrix,
because it is one of the outputs of prepca. This sort of
transformation matrix is what I need. But since I'm not using prepca,
but princomp, what can I do?

From: Greg Heath on 24 Sep 2006 17:14

Gian Piero Bandieramonte wrote:
> I have tried processing my network inputs and targets using
> prestd,prepca,trastd,trapca and poststd. By doing this, my network
> generalization performance hugely decreases to dangerous levels. I
> have used them the way the matlab help tells me to use them. So I had
> to stick up with using princomp to process my data so as to reduce
> its dimensionality (the net generalization using princomp is aprox
> 1000 times better than using prepca,trapca....).
>
> My design set, consisting of a training set of 252 rows has been
> processed by princomp to reduce the dimensionality of the input from
> 37 to 22 (I'm being conservative by now). Then I simulate my net with
> the same training set having excellent fit. Then I simulate with a
> new test set, of 5000 rows, having somewhat a good generalization. So
> obviously I preprocessed this test set with princomp before
> simulating. But there is a problem here: I'm not supposed to
> preprocess my test set using all the batch at one time, I'm supposed
> to preprocess individually each row of the test set. But I don't know
> how to do this with princomp, if I use this func with only one row, a
> matlab error appears. If I use two rows,it creates a new 2x37 matrix
> with zero's from column 2 to 37, but if I processed using the whole
> batch, the new transformed matrix wouldn't have zero's at
> M(1:2,2:37). The output changes depending on the number of rows. So
> I need a consistent way of processing each row individually, maybe
> using some transformation matrix or vector, but I don't know how to
> find it, and it is not available the way you said on :
>
> > > >I use eigs(corcoeff(X)) for PCA (instead of princomps), so
> > > > I don't know if the transformation matrix is available to
> you
> > > > without solving T*X = PC using T = X/PC.
>
> I tried calculating T this way but it doesn't properly transform my
> rows using it any way. I'm also a bit confused about this. In the
> case of using prepca and trapca I do have a transformation matrix,
> because it is one of the outputs of prepca. This sort of
> transformation matrix is what I need. But since I'm not using prepca,
> but princomp, what can I do?

I can't tell you what you are doing wrong from your rereply,
Post a copy of your code.

Hope this helps.

Greg

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: A bug in avifile?
Next: Parallel port