From: Misha Koshelev on 19 Jan 2010 18:23 Dear Sirs or Madams: I am trying to find and/or implement the Quality Threshold clustering algorithm in MATLAB: http://en.wikipedia.org/wiki/Cluster_analysis#QT_clustering_algorithm Actually, any clustering algorithm that results in the same clustering given initial inputs is good for me. Something easy & fast would be best. Any tips much appreciated. Thank you! Sincerely yours Misha Koshelev
From: misha680 on 19 Jan 2010 18:50 Here is what I have found so far: R has a version http://rss.acs.unt.edu/Rdoc/library/flexclust/html/qtclust.html Original paper: http://genome.cshlp.org/content/9/11/1106.full with this step-by-step diagram http://genome.cshlp.org/content/9/11/1106/F5.large.jpg Perl implementation: http://stackoverflow.com/questions/1973109/how-can-i-implement-qt-quality-threshold-clustering-in-perl At the moment I am a little confused on how to compute a cluster diameter...? Thank you Sincerely yours Misha Koshelev On Jan 19, 5:23 pm, "Misha Koshelev" <mk144...(a)bcm.edu> wrote: > Dear Sirs or Madams: > > I am trying to find and/or implement the Quality Threshold clustering algorithm in MATLAB:http://en.wikipedia.org/wiki/Cluster_analysis#QT_clustering_algorithm > > Actually, any clustering algorithm that results in the same clustering given initial inputs is good for me. > > Something easy & fast would be best. > > Any tips much appreciated. > > Thank you! > > Sincerely yours > Misha Koshelev
From: Misha Koshelev on 20 Jan 2010 17:14 misha680 <misha680(a)gmail.com> wrote in message <6b2d2873-b332-4160-87a3-af2bc7cbd471(a)34g2000yqp.googlegroups.com>... > Here is what I have found so far: > R has a version > http://rss.acs.unt.edu/Rdoc/library/flexclust/html/qtclust.html > > Original paper: > http://genome.cshlp.org/content/9/11/1106.full > with this step-by-step diagram > http://genome.cshlp.org/content/9/11/1106/F5.large.jpg > > Perl implementation: > http://stackoverflow.com/questions/1973109/how-can-i-implement-qt-quality-threshold-clustering-in-perl > > At the moment I am a little confused on how to compute a cluster > diameter...? > > Thank you > Sincerely yours > Misha Koshelev > > On Jan 19, 5:23 pm, "Misha Koshelev" <mk144...(a)bcm.edu> wrote: > > Dear Sirs or Madams: > > > > I am trying to find and/or implement the Quality Threshold clustering algorithm in MATLAB:http://en.wikipedia.org/wiki/Cluster_analysis#QT_clustering_algorithm > > > > Actually, any clustering algorithm that results in the same clustering given initial inputs is good for me. > > > > Something easy & fast would be best. > > > > Any tips much appreciated. > > > > Thank you! > > > > Sincerely yours > > Misha Koshelev Ok here's my implementation using Euclidean distance: function idx=qtclusteuclid(G,d,D) % QT clustering algorithm as described in: % % Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression % data: Identification and analysis of coexpressed genes. Genome Research % 9, 1106–1115. % % http://genome.cshlp.org/content/9/11/1106.full % http://genome.cshlp.org/content/9/11/1106/F5.large.jpg % % if two sets A{i} have same cardinality, we pick one with smallest diameter % our distance measure is Euclidean distance % % input: % G-nxp data to cluster % d-diameter threshold % D-Euclidean distance for all 0<i<j<=n % % output: % idx-nx1 vector containing cluster indices % % Misha Koshelev % January 20th, 2009 % Montague Laboratory n=size(G,1); if n<=1 idx=ones(n,1); return; end if nargin<3 D=Inf*ones(n,n); for i=1:n D(i,i)=0; for j=i+1:n D(i,j)=sqrt(sum((G(i,:)-G(j,:)).^2));D(j,i)=D(i,j); end end end C=[];Ccard=0;Cdiam=0; for i=1:n flag=true; A=[i];Acard=1;Adiam=0; while flag&&length(A)<n pts=1:n;pts(A)=[]; jdiam=zeros(length(pts),1); for pidx=1:length(pts) % We only need to compute maximal distance from new point j to all % existing points in cluster jdiam(pidx)=max(D(pts(pidx),A)); end [minjdiam,pidx]=min(jdiam);j=pts(pidx); if sum(jdiam==minjdiam)>1 dbstack;keyboard; end if max(Adiam,minjdiam)>d flag=false; else A=[A,j]; Acard=Acard+1; Adiam=max(Adiam,minjdiam); end end A=sort(A); if Acard>Ccard C=A; Ccard=Acard; Cdiam=Adiam; end end idx=ones(n,1); GmC=1:n;GmC(C)=[]; idx(GmC)=qtclusteuclid(G(GmC,:),d,D(GmC,GmC))+1; function d=diam(G,clust) % http://thesaurus.maths.org/mmkb/entry.html?action=entryByConcept&id=3279 % The largest distance between any two points in a set is called the set's diameter. % % input: % G-nxp data for our cluster % clust-vector of row indices into G % dsqr=0; for k=1:length(clust) for k2=k:length(clust) dsqr=max(dsqr,sum((G(clust(k),:)-G(clust(k2),:)).^2)); end end d=sqrt(dsqr);
From: Misha Koshelev on 20 Jan 2010 17:20 Oops should be this sorry :( function idx=qtclusteuclid(G,d,D) % QT clustering algorithm as described in: % % Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression % data: Identification and analysis of coexpressed genes. Genome Research % 9, 1106–1115. % % http://genome.cshlp.org/content/9/11/1106.full % http://genome.cshlp.org/content/9/11/1106/F5.large.jpg % % if two sets A{i} have same cardinality, we pick first one % our distance measure is Euclidean distance % % input: % G-nxp data to cluster % d-diameter threshold % D-Euclidean distance for all 0<i<j<=n % % output: % idx-nx1 vector containing cluster indices % % Misha Koshelev % January 20th, 2009 % Montague Laboratory n=size(G,1); if n<=1 idx=ones(n,1); return; end if nargin<3 D=Inf*ones(n,n); for i=1:n D(i,i)=0; for j=i+1:n D(i,j)=sqrt(sum((G(i,:)-G(j,:)).^2));D(j,i)=D(i,j); end end end C=[];Ccard=0;Cdiam=0; for i=1:n flag=true; A=[i];Acard=1;Adiam=0; while flag&&length(A)<n pts=1:n;pts(A)=[]; jdiam=zeros(length(pts),1); for pidx=1:length(pts) % We only need to compute maximal distance from new point j to all % existing points in cluster jdiam(pidx)=max(D(pts(pidx),A)); end [minjdiam,pidx]=min(jdiam);j=pts(pidx); if sum(jdiam==minjdiam)>1 dbstack;keyboard; end if max(Adiam,minjdiam)>d flag=false; else A=[A,j]; Acard=Acard+1; Adiam=max(Adiam,minjdiam); end end A=sort(A); if Acard>Ccard C=A; Ccard=Acard; Cdiam=Adiam; end end idx=ones(n,1); GmC=1:n;GmC(C)=[]; idx(GmC)=qtclusteuclid(G(GmC,:),d,D(GmC,GmC))+1; function d=diam(G,clust) % http://thesaurus.maths.org/mmkb/entry.html?action=entryByConcept&id=3279 % The largest distance between any two points in a set is called the set's diameter. % % input: % G-nxp data for our cluster % clust-vector of row indices into G % dsqr=0; for k=1:length(clust) for k2=k:length(clust) dsqr=max(dsqr,sum((G(clust(k),:)-G(clust(k2),:)).^2)); end end d=sqrt(dsqr);
From: Misha Koshelev on 20 Jan 2010 17:59
Actually here is three different versions: http://people.hnl.bcm.edu/misha/tmp/qtclust.zip I posted to File Exchange. Will take above link down at some point. QT clustering algorithm as described in: Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression data: Identification and analysis of coexpressed genes. Genome Research 9, 1106–1115. http://genome.cshlp.org/content/9/11/1106.full http://genome.cshlp.org/content/9/11/1106/F5.large.jpg qtclusteuclid.m - use Euclidean distance as measure qtclustjacknife.m - use Jackknife Correlation (see article above) qtclustmod.m - use Euclidean distance, modified to always return 2 clusters. Misha |