From: Fred on
Hello,
I'm quite new with Matlab. I'm having some problems with the speed of "pdist" function in matlab 2007. I'm trying to calculate between-class distances using this particular function but it takes a hell of a long time to do it. Here is the code I'm using:

Inter=zeros(1,244978);
D=1;
for i=1:(size(PCA_Scores.data,1))
for n=i+1:(size(PCA_Scores.data,1))
if PCA_Scores(i,:).class{1}==PCA_Scores(n,:).class{1}
else
Inter(1,D)=pdist([PCA_Scores(i,:).data;PCA_Scores(n,:).data],'cityblock');
D=D+1;
end
end
end

PCA_Scores is a 785x10 matrix containing the PC scores calculated from 785 near-infrared spectra (785X3112 data matrix).
PCA_Scores.class{1} contains the class index for each sample (785x1 vector).

What I want to do is to calculate the distance between samples which aren't in the same class (since usually pdist calculates the between-distance of all samples in the data matrix).
I've tried to preallocate the size of the final matrix, but speed doesn't seem to increase particularly.
Could someone help me with this issue?

Thanks for help

Fred
From: Peter Perkins on
On 5/4/2010 3:06 AM, Fred wrote:
> Hello,
> I'm quite new with Matlab. I'm having some problems with the speed of
> "pdist" function in matlab 2007.

> What I want to do is to calculate the distance between samples which
> aren't in the same class (since usually pdist calculates the
> between-distance of all samples in the data matrix).

If you have access to R2010b, PDIST2 would be closer to what you need:

>> help pdist2
PDIST2 Pairwise distance between two sets of observations.

<http://www.mathworks.com/access/helpdesk/help/toolbox/stat /pdist2.html>

You would still want two nested loops, but this time over classes, at
not be over observations.


> Inter=zeros(1,244978);
> D=1;
> for i=1:(size(PCA_Scores.data,1))
> for n=i+1:(size(PCA_Scores.data,1))
> if PCA_Scores(i,:).class{1}==PCA_Scores(n,:).class{1}
> else
> Inter(1,D)=pdist([PCA_Scores(i,:).data;PCA_Scores(n,:).data],'cityblock');
> D=D+1;
> end
> end
> end
>
> PCA_Scores is a 785x10 matrix containing the PC scores calculated from
> 785 near-infrared spectra (785X3112 data matrix).

That can't be exactly true, since you're indexing into it as a
structure. It's hard to tell, but it appears that you've stored all
your scores in a structure array? That isn't going to be the best way
to do this. What you probably want is one numeric matrix for the
scores, one for the class. You've only got 785 observations, and so
vortse a's suggestion (call PDIST once, then carve away the things you
don't need) makes sense. Or something like this:

for i = 1:nclasses-1
scoresi = scores(class==i,:)
ni = sum(scoresi);
for j = i+1:nclasses
scoresj = scores(class==j,:);
nj = sum(scoresj);
D = squareform(pdist(scoresi,scoresj));
[something] = D(1:ni,ni+1:end);
end
end

where scores is a 785x10 numeric matrix and classes is a 785x1 numeric
vector. I've probably made mistakes, but that's the rough idea. If you
used PDIST2, you wouldn't need the call to squareform or the
subscripting to pull out the off-dag block.

Hope this helps.
From: Peter Perkins on
On 5/4/2010 9:12 AM, Peter Perkins wrote:
> If you have access to R2010b, PDIST2 would be closer to what you need:

Sorry, R2010a, not R2010b.