clustering 'locally' for large data set? [Matlab]

Prev: help about segmentation!
Next: quad2d and singularities

From: Ting Su on 27 May 2010 10:50

Jay,
If hierarchical clustrering is not necessary what you need, you can try
function "kmeans" in the MATLAB Statistics toolbox. Function "kmeans" can be
used for data sets with much larger number of observations than the current
implementation of "clusterdata".

-Ting

"jay vaughan" <jvaughan5.nospam(a)gmail.com> wrote in message
news:hthju0$22m$1(a)fred.mathworks.com...
> Hi Folks,
>
> I have a list of 10^3 to 10^6 positions that I would like to use for
> cluster analysis. I have been using the function clusterdata...
>
> T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> distance_threshold,'criterion','distance');
>
> Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> suspect because of memory limitations in calculating pairwise distances
> between so many points.
>
> I am wondering if there is another way to do this, especially considering
> that I do not need the whole cluster heierarchy, but only 'local
> clustering' within a radius that would only ever include a very small
> number of points of the whole data.
>
> Any suggestions?
>
> Thanks,
> jay
>
> I am running Windows XP and R2010a and have access to the toolboxes.
>

From: jay vaughan on 28 May 2010 12:12

Ting,

thanks for the suggestion. I read up on kmeans clustering but don't think it is right for my job. I don't actually know how many clusters I have, and if I understood the description correctly, kmeans uses number of clusters as an input.

best regards,
jay

"Ting Su" <Ting.Su(a)mathworks.com> wrote in message <htm0r8$cq6$1(a)fred.mathworks.com>...
> Jay,
> If hierarchical clustrering is not necessary what you need, you can try
> function "kmeans" in the MATLAB Statistics toolbox. Function "kmeans" can be
> used for data sets with much larger number of observations than the current
> implementation of "clusterdata".
>
> -Ting
>
> "jay vaughan" <jvaughan5.nospam(a)gmail.com> wrote in message
> news:hthju0$22m$1(a)fred.mathworks.com...
> > Hi Folks,
> >
> > I have a list of 10^3 to 10^6 positions that I would like to use for
> > cluster analysis. I have been using the function clusterdata...
> >
> > T = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
> > distance_threshold,'criterion','distance');
> >
> > Not surprisingly, the analysis doesn't work beyond around 10^4 points, I
> > suspect because of memory limitations in calculating pairwise distances
> > between so many points.
> >
> > I am wondering if there is another way to do this, especially considering
> > that I do not need the whole cluster heierarchy, but only 'local
> > clustering' within a radius that would only ever include a very small
> > number of points of the whole data.
> >
> > Any suggestions?
> >
> > Thanks,
> > jay
> >
> > I am running Windows XP and R2010a and have access to the toolboxes.
> >
>

From: jay vaughan on 28 May 2010 12:40

Hi Walter,

thanks again for the tip. Although I am anything but an ace programmer, I managed to implement what you described for 1D slicing or 2D tiling of large numbers of points. Now I can get back to my research.

For ~500 points, slicing or tiling were comparable or a little slower than just using clusterdata. For ~5000 points, slicing or tiling were ~5x faster than clusterdata. For ~50,000 points, clusterdata failed, and slicing was ~2x faster than tiling for some reason. For ~500,000 points, tiling was quite slow (5 min) but was the only routine that didn't have a memory overload. If clusters were 'too large' relative to the window, some points were lost, although these appear as zeros in the cluster_assignment output so the user can be aware of it.

I included a testing routine and the slicing/tiling cluster functions below in case you or others want checking it out. Due to the lack of generality (e.g. it is hard coded for 2-dimensional data and a certain clustering routine) I didn't think it was worthy of the file exchange.

best regards,
jay

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all

%% input parameters
num_seed_pts = 100;
max_num_pts_per_cluster = 200;
cluster_spread = 0.002;

num_slices_per_dim = 20;
clust_dist_thr = cluster_spread;

%% generate test 2D cluster data and view it
seed_pts = [rand(num_seed_pts,1) rand(num_seed_pts,2)];
num_pts_per_cluster = round(max_num_pts_per_cluster*rand(num_seed_pts,1));
X = [];
ctr = 1;
for k = 1:num_seed_pts
x = seed_pts(k,1) + cluster_spread*rand(num_pts_per_cluster(k),1);
y = seed_pts(k,2) + cluster_spread*rand(num_pts_per_cluster(k),1);
X(ctr:ctr+num_pts_per_cluster(k)-1,:) = [x y];
ctr = ctr + num_pts_per_cluster(k);
end

figure(1)
plot(X(:,1),X(:,2),'.','markersize',1)
xlim([-0.1 1.1])
ylim([-0.1 1.1])

clear num_seed_pts seed_pts max_num_pts_per_cluster num_pts_per_clutser ctr k x y num_pts_per_cluster

%% do the cluster analysis
try % conventional (unsliced/untiled) cluster analysis
tic
cluster_assignments1 = clusterdata(X,'distance','euclid','linkage','single','cutoff',...
clust_dist_thr,'criterion','distance');
conv_time = toc;
catch
conv_time = 0;
cluster_assignments1 = zeros(size(X,1),1);
end

try % 1D sliced cluster analysis
tic
cluster_assignments2 = clusterdata_slices(X,num_slices_per_dim,clust_dist_thr);
clust_1Dslices_time = toc;
catch
clust_1Dslices_time = 0;
cluster_assignments2 = zeros(size(X,1),1);
end

try % 2D, tiled cluster analysis
tic
cluster_assignments3 = clusterdata_tiles(X,num_slices_per_dim,clust_dist_thr);
clust_2Dslices_time = toc;
catch
clust_2Dslices_time = 0;
cluster_assignments3 = zeros(size(X,1),1);
end

%% results
disp(['number of input points=' num2str(size(X,1)) '; slices per dim=' num2str(num_slices_per_dim)])
disp(['conventional time=' num2str(conv_time) 's; ' ...
'1D slices time=' num2str(clust_1Dslices_time) 's; ' ...
'2D tiling time=' num2str(clust_2Dslices_time) 's'])
disp(['clusterdata lost points=' num2str(sum(cluster_assignments1==0)) '; ' ...
'1D slices lost points=' num2str(sum(cluster_assignments2==0)) '; ' ...
'2D tiles lost points=' num2str(sum(cluster_assignments3==0))])
disp([' '])

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function cluster_assignments = clusterdata_slices(X,num_slices,clust_dist_thr)

% The function clusterdata_tiles can perform cluster analysis on 2D data. It
% is based on clusterdata from the statistical toolbox but can handle larger
% input matrices. Note that the entire cluster heierarchy is not returned,
% but instead, the bottom level clusters based on a distance threshold provided
% by the user. num_slices should be chosen to be large enough that no one slice
% has so many points that it will overload clusterdata, but not so large that a
% slice is smaller than 4x the size of the largest expected cluster.
%
% The data is divided into num_slices slices. All clusters within a hard
% threshold of clust_dist_thr are identified within the slice but
% excluding clusters whose center positions are within 1/4 the height of
% a slice from the leading edge. The routine then advances by half a slice
% in order to pick up clusters from the edge of the previously analyzed
% region as well as clusters within the new region. The process is repeated.
%
% Unassigned points may occur when the cluster size is larger than 1/4 of a
% slice's height. num_unassigned_points = sum(cluster_assignments==0).
%
% SYNTAX: cluster_assignments = clusterdata_slices(X,num_slices,clust_dist_thr)
% X; % input from user, Nx2 array containing 2D position information
% num_slices = 5; % input from user, number of slices
% clust_dist_thr = 0.05;

% code starts here
num_grid_pieces = num_slices*2;
ygrid = linspace(min(X(:,2)),1.1*max(X(:,2)),num_grid_pieces+1);
grid_inc = ygrid(2)-ygrid(1);

remaining_X_inds = 1:size(X,1);
cluster_assignments = zeros(size(X,1),1);
ctr = 1;

% damn all these indices are confusing!
for m = 1:num_grid_pieces-1
[n,bins_remaining_points] = histc(X(remaining_X_inds,2),ygrid);
remaining_points_inds_in_curr_bin = find(bins_remaining_points==m | bins_remaining_points==m+1);
if length(remaining_points_inds_in_curr_bin)>1
Xsubset = X(remaining_X_inds(remaining_points_inds_in_curr_bin),:);
T2 = clusterdata(Xsubset,'distance','euclid','linkage','single','cutoff',...
clust_dist_thr,'criterion','distance');
inds_to_dump = [];
num_clusters = max(T2);
for k = 1:num_clusters
Xsubset_cluster_inds = find(T2(:)==k);

% if the clutster is not at the edge of the bin, keep the
% cluster and remove the points from the master list
if or(mean(Xsubset(Xsubset_cluster_inds,2))<ygrid(m+2)-0.5*grid_inc,m==num_grid_pieces-1)
clust_inds = remaining_X_inds(remaining_points_inds_in_curr_bin(Xsubset_cluster_inds));
inds_to_dump = union(inds_to_dump,clust_inds);
cluster_assignments(clust_inds) = ctr;
ctr = ctr + 1;
end
end
remaining_X_inds = setdiff(remaining_X_inds,inds_to_dump);
end
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function cluster_assignments = clusterdata_tiles(X,num_slices_per_dim,clust_dist_thr)

% The function clusterdata_tiles can perform cluster analysis on 2D data. It
% is based on clusterdata from the statistical toolbox and clusterdata_slices
% but can handle larger input matrices. Note that the entire cluster heierarchy
% is not returned, but instead, the bottom level clusters based on a distance
% threshold provided by the user. num_slices_per_dim should be chosen to be
% large enough that no one slice has so many points that it will overload
% clusterdata, but not so large that half a slice is smaller than 4x the size
% of the largest expected cluster.
%
% The data is divided into num_slices_per_dim^2 tiles. All clusters within a
% hard threshold of clust_dist_thr are identified within the tile but
% excluding clusters whose center positions are within 1/4 the of a tile from
% the leading edges. The routine then advances by half a tile in order to pick
% up clusters from the edge of the previously analyzed region as well as clusters
% within the new region. The process is repeated.
%
% Unassigned points may occur when the cluster size is larger than 1/4 of a
% tile's width/height. num_unassigned_points = sum(cluster_assignments==0).
%
% SYNTAX: cluster_assignments = clusterdata_slices(X,num_slices_per_dim,clust_dist_thr)
% X; % input from user, Nx2 array containing 2D position information
% num_slices_per_dim = 5; % input from user, number of slices
% clust_dist_thr = 0.05;

% code starts here
num_grid_pieces = num_slices_per_dim*2;
xgrid = linspace(min(X(:,1)),1.1*max(X(:,1)),num_grid_pieces+1);
grid_inc = xgrid(2)-xgrid(1);

remaining_X_inds = 1:size(X,1);
cluster_assignments = zeros(size(X,1),1);
ctr = 1;

h=waitbar(0,'tile clustering in progress');
% damn all these indices are confusing!
try
for m = 1:num_grid_pieces-1
[n,bins_remaining_points] = histc(X(remaining_X_inds,1),xgrid);
remaining_points_inds_in_curr_bin = find(bins_remaining_points==m | bins_remaining_points==m+1);
if length(remaining_points_inds_in_curr_bin)>1
Xsubset = X(remaining_X_inds(remaining_points_inds_in_curr_bin),:);
T2 = clusterdata_slices(Xsubset,num_slices_per_dim,clust_dist_thr);
inds_to_dump = [];
num_clusters = max(T2);
for k = 1:num_clusters
Xsubset_cluster_inds = find(T2(:)==k);

% if the clutster is not at the edge of the bin, keep the
% cluster and remove the points from the master list
if or(mean(Xsubset(Xsubset_cluster_inds,1))<xgrid(m+2)-0.5*grid_inc,m==num_grid_pieces-1)
clust_inds = remaining_X_inds(remaining_points_inds_in_curr_bin(Xsubset_cluster_inds));
inds_to_dump = union(inds_to_dump,clust_inds);
cluster_assignments(clust_inds) = ctr;
ctr = ctr + 1;
end
end
remaining_X_inds = setdiff(remaining_X_inds,inds_to_dump);
end
waitbar(m/(num_grid_pieces-1),h,['tile clustering in progress']);
end
catch
end
close(h)

From: Walter Roberson on 28 May 2010 13:41

jay vaughan wrote:

Good that you got it going. I haven't had time to analyze the code. The
boundary conditions and amount you advance the slice or tile sounds a
bit wrong to me, but I'd have to think about it more.

> seed_pts = [rand(num_seed_pts,1) rand(num_seed_pts,2)];

As a trivial note compared to what you have done: the above line is
equivalent to

seed_pts = rand(num_seed_pts, 3);

which would be more efficient and less likely to run out of memory.

From: jay vaughan on 28 May 2010 14:35

Walter Roberson <roberson(a)hushmail.com> wrote in message <4fTLn.22587$Gx2.3077(a)newsfe20.iad>...
> jay vaughan wrote:
>
> Good that you got it going. I haven't had time to analyze the code. The
> boundary conditions and amount you advance the slice or tile sounds a
> bit wrong to me, but I'd have to think about it more.

yeah, i was lazy and just set the boundary to be 1/4 of the slice thickness and advanced by 1/2 the slice thickness. i agree it's not optimized, especially when the slice thickness is much much greater than the maximum expected cluster size. in that case i think it would be beneficial to have a smaller boundary and move the window by more than 1/2 the slice thickness. in the limit of an infinitesimal boundary region, there would be up to a 50% reduction in the number of slices per dimension. i also wondered about advancing the window to the next unclaimed point, or something like that, but saw a few issues and just went for the easier-to-code route.

best regards,
jay

First | Prev |
Pages: 1 2
Prev: help about segmentation!
Next: quad2d and singularities