clustering 'locally' for large data set? [Matlab]

Prev: help about segmentation!
Next: quad2d and singularities

From: jay vaughan on 25 May 2010 18:45

Hi Folks,

I have a list of 10^3 to 10^6 positions that I would like to use for cluster analysis. I have been using the function clusterdata...

T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
distance_threshold,'criterion','distance');

Not surprisingly, the analysis doesn't work beyond around 10^4 points, I suspect because of memory limitations in calculating pairwise distances between so many points.

I am wondering if there is another way to do this, especially considering that I do not need the whole cluster heierarchy, but only 'local clustering' within a radius that would only ever include a very small number of points of the whole data.

Any suggestions?

Thanks,
jay

I am running Windows XP and R2010a and have access to the toolboxes.

From: ImageAnalyst on 25 May 2010 22:50

Assuming your clusters are not all clustered (no pun intended), or at
least all the clusters are fairly well represented with the first
10,000 points, then why not just use the first 10,000 points? Does
including points 10,000 - 1,000,000 really improve your clustering
anyway (if you could do it)? Probably not.

From: jay vaughan on 26 May 2010 10:14

Thanks for the tip, ImageAnalyst. If I am unable to find a better solution, I may have to resign myself to tossing most of the data, as heartbreaking as that may be.

I guess I was hoping there may be other algorithms or approaches for efficient 'local' clustering of large data sets. For instance, sort the data into smaller, manageable, chunks, process with clusterdata, and then figure out how to handle the edge points.

ImageAnalyst <imageanalyst(a)mailinator.com> wrote in message <31092b43-472c-4aa7-9c9a-6ab8f7c69030(a)l6g2000vbo.googlegroups.com>...
> Assuming your clusters are not all clustered (no pun intended), or at
> least all the clusters are fairly well represented with the first
> 10,000 points, then why not just use the first 10,000 points? Does
> including points 10,000 - 1,000,000 really improve your clustering
> anyway (if you could do it)? Probably not.

From: Walter Roberson on 27 May 2010 01:05

jay vaughan wrote:

> I have a list of 10^3 to 10^6 positions that I would like to use for
> cluster analysis. I have been using the function clusterdata...

> T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> distance_threshold,'criterion','distance');

> Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> suspect because of memory limitations in calculating pairwise distances
> between so many points.

> I am wondering if there is another way to do this, especially
> considering that I do not need the whole cluster heierarchy, but only
> 'local clustering' within a radius that would only ever include a very
> small number of points of the whole data.

The below assumes that the data is at least 2 dimensional, but could be
adapted for the case of 1D positions.

Use histc to bin along Y at a constant width and get the indices. You
will use the bins and related indices to extract subsets of the points.

Start at the top and proceed downwards.

Take a window of data at a time (i.e., extract the subset of points in
that Y interval.) Cluster on that subset. Extract all the clusters whose
centers are at least R (the radius) above the bottom of the window and
record those as found clusters.

Move the window down to have at least R overlap with the old window, and
repeat.

If you set your bin size to be R, and use two bins together as your
window, then the clusters to be recorded will be the ones whose center
would fall in the upper bin, and moving the window downward would
correspond to making the lower bin the upper one and taking the next
bin. Since you have already done the binning, this would save you from a
bunch of data extraction and recalculation. If you find that the
overhead of firing up for the clustering is too high and you have enough
free memory, you could use several bins in sequence, saving the clusters
whose centers would be in any but the final one, and re-use just that
final bin as the first bin of the new window.

Why discard the clusters that start below the bottom R? Because if they
start below the bottom R, then the radius R continues on in to the next
bin (whose data was not presented for clustering), and so there might be
additional points for that cluster (and the cluster center might shift).
Discarding such a cluster the first time is not a problem because the
next iteration around all of the points will be within the new window
(unless the statement about clustering within a radius is violated.)

From: jay vaughan on 27 May 2010 09:05

Walter Roberson <roberson(a)hushmail.com> wrote in message <y4nLn.32339$wV2.7729(a)newsfe23.iad>...
> jay vaughan wrote:
>
> > I have a list of 10^3 to 10^6 positions that I would like to use for
> > cluster analysis. I have been using the function clusterdata...
>
> > T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> > distance_threshold,'criterion','distance');
>
> > Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> > suspect because of memory limitations in calculating pairwise distances
> > between so many points.
>
> > I am wondering if there is another way to do this, especially
> > considering that I do not need the whole cluster heierarchy, but only
> > 'local clustering' within a radius that would only ever include a very
> > small number of points of the whole data.
>
> The below assumes that the data is at least 2 dimensional, but could be
> adapted for the case of 1D positions.
>
> Use histc to bin along Y at a constant width and get the indices. You
> will use the bins and related indices to extract subsets of the points.
>
> Start at the top and proceed downwards.
>
> Take a window of data at a time (i.e., extract the subset of points in
> that Y interval.) Cluster on that subset. Extract all the clusters whose
> centers are at least R (the radius) above the bottom of the window and
> record those as found clusters.
>
> Move the window down to have at least R overlap with the old window, and
> repeat.
>
>
> If you set your bin size to be R, and use two bins together as your
> window, then the clusters to be recorded will be the ones whose center
> would fall in the upper bin, and moving the window downward would
> correspond to making the lower bin the upper one and taking the next
> bin. Since you have already done the binning, this would save you from a
> bunch of data extraction and recalculation. If you find that the
> overhead of firing up for the clustering is too high and you have enough
> free memory, you could use several bins in sequence, saving the clusters
> whose centers would be in any but the final one, and re-use just that
> final bin as the first bin of the new window.
>
>
> Why discard the clusters that start below the bottom R? Because if they
> start below the bottom R, then the radius R continues on in to the next
> bin (whose data was not presented for clustering), and so there might be
> additional points for that cluster (and the cluster center might shift).
> Discarding such a cluster the first time is not a problem because the
> next iteration around all of the points will be within the new window
> (unless the statement about clustering within a radius is violated.)

Hi Walter,

that sounds like just what I am looking for. I will give it a try in the next day or two and see how it goes.

thanks,
jay

| Next | Last
Pages: 1 2
Prev: help about segmentation!
Next: quad2d and singularities