From: Aadarsh on
If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.

I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.

Thanks
From: ImageAnalyst on
On Apr 22, 9:01 am, "Aadarsh " <imfromwa...(a)hotmail.com> wrote:
> If i have a n unknown points and n unknown clusters, is there any algorithm which will try to group the points into the most appropriate groups.
>
> I could use K-Means clustering algorithm but with that you have to give the predicted number of groups. For my problem, I do not know how many groups there will be - it changes each with each new run of the algorithm.
>
> Thanks

-----------------------------------------------------------
Aardarsh:
The QT clustering method (http://en.wikipedia.org/wiki/
Cluster_analysis#K-means_and_derivatives) is pretty simple and it
doesn't require specifying in advance how many clusters there should
be.

"The algorithm is:
* The user chooses a maximum diameter for clusters.
* Build a candidate cluster for each point by including the
closest point, the next closest, and so on, until the diameter of the
cluster surpasses the threshold.
* Save the candidate cluster with the most points as the first
true cluster, and remove all points in the cluster from further
consideration. Must clarify what happens if more than 1 cluster has
the maximum number of points ?
* Recurse with the reduced set of points."


That Wikipedia page also has a pretty good discussion of cluster
analysis in general.
-ImageAnalyst
From: Zaphod on
> Thanks for the information, ive taken it on board and decided to go with matlab's built in kmeans(), and set the maximum number of clusters to be 1/10 the total number of points to be clustered. This seems to be the best way of doing it with the knwledge i have, although im sure (as someone mentioned earlier) to take an iterative approach and to try to find the best model. How can I do this?
>
> Below is my code:
>
> for i=1:clusters
> [idx, c, sumd,D] = kmeans(correct,i*6, 'distance','cityblock');
> avDistance = mean(sumd);
> i=i+1;
> end
>
> where clusters is 1/10 the number of points to cluster.
> I have looked into looking at the distance between each point to its centroid (avDistance above), but this is bound to keep getting smaller the more clusters I get. How can i set a good tolerance or stopping creterion for this iterative approach?
>
> Can anyone suggest any better method of stopping?
>
> Thanks a lot :)

It is to be expected that as you increase the number of clusters, the average distance from a point to the cluster to which it is assigned will decrease. Hence, it is not a good idea to simply pick the value of k that gives the minimum average distance, as this will almost always result in selecting the maximum value of k evaluated.

A better approach would be to plot avDistance vs. k and look for the value of k where a sharp inflection "elbow" occurs. This is pretty much the standard way of selecting k for k-means.

Another option would be to use an algorithm which automatically finds the number o f clusters while it is clustering. The 'mean-shift' algorithm is probably the most well known of these. This amounts to following the density gradient to a local maxima (the basin of which defines one cluster).

Yet another option is to use hierarchical agglomerative clustering. Essentially, each point begins in its own cluster. Then until everything is in the same cluster, the two 'most similar' clusters are merged. Finally, a suitable number of clusters can be determined by looking at the similarity measures between clusters.
From: Zaphod on
> set the maximum number of clusters to be 1/10 the total number of points to be clustered.

depending on the size of your data set, you are probably looking at values of k that are too large (try picking k points at random from the data set, then imagine they are cluster centers and evaluate the average distance; if your k-means distance isn't better, you are wasting time on excessively large values of k). note that the runtime of k-means clustering depends on the number of clusters. for most data sets, I would recommend using a value of k between 2 and 100.
From: Aadarsh on
"Ashish Uthama" <first.last(a)mathworks.com> wrote in message

> http://www.mathworks.com/matlabcentral/fileexchange/26431-qt-clustering-using-euclidean-distance

Do you know how the algorithm works? Ive tried it several times but cant get it to work - have a look on the page, i left a comment.

Thanks