From: C T on
How does matlab calculate the centroid when the distance is correlation?
I tried to look at the code but it just too much information.

For example, If I got
data = [1,2,3,4,5;10,20,30,40,50;10,9,8,6,7;20,30,40,50,60;]
rand('twister',1);
[idx ctr]=kmeans(data,2,'distance','correlation');

I got:
>> idx
idx =
2
2
1
2

>> ctr
ctr =
0.6325 0.3162 0 -0.6325 -0.3162
-0.6325 -0.3162 0 0.3162 0.6325

How did matlab calculate ctr?
I tried to calculate each_row_in_cluster2 - mean_of_each_row_in_cluster2)/standardeviation_of_each_row_in_cluster2 but it's not exactly = ctr(2,:)

Thanks
From: Peter Perkins on
On 4/27/2010 3:40 PM, C T wrote:
> How does matlab calculate the centroid when the distance is correlation?
> I tried to look at the code but it just too much information.

For correlation distance, the data are first normalized to have zero row
mean and unit row variance (put them on the unit hypersphere).

X = X - repmat(mean(X,2),1,p);
Xnorm = sqrt(sum(X.^2, 2));
X = X ./ Xnorm(:,ones(1,p));

As for the centroids, they are not really defined as points, but rather
as directions -- their magnitude is arbitrary. So given normalized
data, it suffices to compute the centroid as coordinate-wise arithmetic
means

centroids(i,:) = sum(X(members,:),1) / counts(i);

Note that in your example, each centroid has mean zero. Is the norm of
the centroids important to you?
From: C T on
Peter Perkins <Peter.Perkins(a)MathRemoveThisWorks.com> wrote in message <hr9cr7$p4b$1(a)fred.mathworks.com>...
> On 4/27/2010 3:40 PM, C T wrote:
> > How does matlab calculate the centroid when the distance is correlation?
> > I tried to look at the code but it just too much information.
>
> For correlation distance, the data are first normalized to have zero row
> mean and unit row variance (put them on the unit hypersphere).
>
> X = X - repmat(mean(X,2),1,p);
> Xnorm = sqrt(sum(X.^2, 2));
> X = X ./ Xnorm(:,ones(1,p));
>
> As for the centroids, they are not really defined as points, but rather
> as directions -- their magnitude is arbitrary. So given normalized
> data, it suffices to compute the centroid as coordinate-wise arithmetic
> means
>
> centroids(i,:) = sum(X(members,:),1) / counts(i);
>
> Note that in your example, each centroid has mean zero. Is the norm of
> the centroids important to you?

Thank you! I guess I'm just curious on what Matlab did.
From: Peter Perkins on
On 4/28/2010 11:56 AM, C T wrote:

> I guess I'm just curious on what Matlab did.

A bit of an explanation:

K-Means the algorithm (as opposed to KMEANS the function) is supposed to
minimize the sum of within-cluster point-to-centroid distances. And so
for squared Euclidean distance, the centroid for each cluster is the
element-wise arithmetic mean, for city block distance, it's the
component-wise median. There are not a lot of distances for which the
minimizer is easy to compute -- even for (unsquared) Euclidean distance,
it's hard. It would seem kind of funny to choose a centroid that did
not minimize that sum within its own cluster. So that's why KMEANS,
unlike LINKAGE, only supports five distances.