how to cluster points into n-unknown groups? [Matlab]

Prev: ignore imaginary numbers only if they're small
Next: Map or Region

From: Greg Heath on 22 Apr 2010 10:44

On Apr 22, 9:01 am, "Aadarsh " <imfromwa...(a)hotmail.com> wrote:
> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>
> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>

For k-means you give the maximum possible number of groups.
Some may be empty.

Greg

From: Doug Schwarz on 22 Apr 2010 12:13

Greg Heath wrote:
> On Apr 22, 9:01 am, "Aadarsh " <imfromwa...(a)hotmail.com> wrote:
>> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>>
>> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>>
>
>
> For k-means you give the maximum possible number of groups.
> Some may be empty.
>
> Greg

It is possible for kmeans to return empty clusters, but it is not the
normal behavior. Try this:

s = randn(10000,2);
idx = kmeans(s,2);
gplotmatrix(s,[],idx)

and you will see that kmeans has split the single cluster into two, as
requested.

As to the OP's original question: yes, but it's not an easy problem.
I'm no expert, but I have recently implemented something called
Superparamagnetic Clustering. It's too complicated to explain here,
but, within limits, it is capable of determining the number of clusters.
It is slow. If the range of number of clusters is small and your data
is suitable well behaved then just run kmeans multiple times with
different values of k and use the third and/or fourth outputs to pick
the best one.

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

From: Greg Heath on 22 Apr 2010 14:48

On Apr 22, 12:13 pm, Doug Schwarz <s...(a)sig.for.address.edu> wrote:
> Greg Heath wrote:
> > On Apr 22, 9:01 am, "Aadarsh " <imfromwa...(a)hotmail.com> wrote:
> >> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>
> >> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>
> > For k-means you give the maximum possible number of groups.
> > Some may be empty.
>
> > Greg
>
> It is possible for kmeans to return empty clusters, but it is not the
> normal behavior. Try this:
>
> s = randn(10000,2);
> idx = kmeans(s,2);
> gplotmatrix(s,[],idx)
>
> and you will see that kmeans has split the single cluster into two, as
> requested.

.... and I have used MATLABs kmeans with empty cluster results.

What is "normal" should be understood with a full realization
of what can occur.

Canned kmeans tends to be too restrictive

With my own clustering codes (FORTRAN/decades ago) I introduced
initialization options, some of which would encourage empty clusters.

> As to the OP's original question: yes, but it's not an easy problem.
> I'm no expert, but I have recently implemented something called
> Superparamagnetic Clustering. It's too complicated to explain here,

Is it anything like the popular SuperDuperparaelectric algorithm?

> but, within limits, it is capable of determining the number of clusters.
> It is slow. If the range of number of clusters is small and your data
> is suitable well behaved then just run kmeans multiple times with
> different values of k and use the third and/or fourth outputs to pick
> the best one.

Greg

From: Doug Schwarz on 22 Apr 2010 15:13

Greg Heath wrote:
> On Apr 22, 12:13 pm, Doug Schwarz <s...(a)sig.for.address.edu> wrote:
>> Greg Heath wrote:
>> > On Apr 22, 9:01 am, "Aadarsh " <imfromwa...(a)hotmail.com> wrote:
>> >> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>>
>> >> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>>
>> > For k-means you give the maximum possible number of groups.
>> > Some may be empty.
>>
>> > Greg
>>
>> It is possible for kmeans to return empty clusters, but it is not the
>> normal behavior. Try this:
>>
>> s = randn(10000,2);
>> idx = kmeans(s,2);
>> gplotmatrix(s,[],idx)
>>
>> and you will see that kmeans has split the single cluster into two, as
>> requested.
>
> ... and I have used MATLABs kmeans with empty cluster results.

And I agreed that that can happen, but you implied that one could simply
enter the maximum number of clusters expected and kmeans will "do the
right thing" and return the "true" number of clusters. That is not
usually true and I illustrated it with a trivial example.

>> As to the OP's original question: yes, but it's not an easy problem.
>> I'm no expert, but I have recently implemented something called
>> Superparamagnetic Clustering. It's too complicated to explain here,
>
> Is it anything like the popular SuperDuperparaelectric algorithm?

Are you trying to discredit my opinion by poking fun at the name? Hey,
I didn't invent it, I merely wrote an implementation.

The algorithm is analogous to the way magnetic domains are formed in a
ferrous material. When the temperature is low all the domains are
oriented in the same direction (ferromagnetic phase -- one cluster);
when the temperature is high the domains are randomly oriented
(paramagnetic phase -- many clusters). The goal is the find the
intermediate temperature that gives rise to the superparamagnetic phase
at which points that are close change state together. Google it for
more details. The paper I found to be easiest to understand is called
"Data Clustering Using a Model Granular Magnet" by Blatt, Wiseman and
Domany.

--
Doug Schwarz
dmschwarz&ieee,org
Make obvious changes to get real email address.

From: Aadarsh on 22 Apr 2010 17:42

"Aadarsh " <imfromwales(a)hotmail.com> wrote in message <hqphal$40s$1(a)fred.mathworks.com>...
> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>
> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>
> Thanks

Thanks for the information, ive taken it on board and decided to go with matlab's built in kmeans(), and set the maximum number of clusters to be 1/10 the total number of points to be clustered. This seems to be the best way of doing it with the knwledge i have, although im sure (as someone mentioned earlier) to take an iterative approach and to try to find the best model. How can I do this?

Below is my code:

for i=1:clusters
[idx, c, sumd,D] = kmeans(correct,i*6, 'distance','cityblock');
avDistance = mean(sumd);
i=i+1;
end

where clusters is 1/10 the number of points to cluster.
I have looked into looking at the distance between each point to its centroid (avDistance above), but this is bound to keep getting smaller the more clusters I get. How can i set a good tolerance or stopping creterion for this iterative approach?

Can anyone suggest any better method of stopping?

Thanks a lot :)

| Next | Last
Pages: 1 2
Prev: ignore imaginary numbers only if they're small
Next: Map or Region