Prev: Cannot load SSL support
Next: Surprised that the case statement in proc sql does not support lists
From: oloolo on 6 Feb 2010 17:07 it turns out that the algorithm behind Gap Statistic doesn't require much time on the resampling The sufficient statistic for the resampling is the min/max of each variable. For method 1, you just sample 10000 uniformly distributed r.v N times on the range of [min, max], and for method 2, one more step to apply the linear transformation right eigenvector matrix V from SVD to this sample So that for method 1, the sampling time is linear of Nobs \times N samples. For method 2, an extra from SVD of original data is needed, which may be a problem for high dimension data But all in all, from my own experience, I think Gap Statistic provides fairly consistent and accurate estimate of true underlying clusters On Wed, 15 Jun 2005 17:38:29 -0700, David L. Cassell <cassell.david(a)EPAMAIL.EPA.GOV> wrote: >A Little Birdie(tm) chirped to me: >> After seeing this thread, I looked up the Gap statistic (I haven't >> done a lot of clustering in the last 15 years), and lo and behold, >> the Gap statistic is based on exactly the same simulation model as >> the CCC. The CCC is more or less a numerical approximation to the Gap >> statistic applied to the R-squared statistic from the cluster >analysis. >> So if you are interested in R-squared or the within-cluster sums of >> squares or any of numerous other equivalent statistics, the CCC saves >> you the bother of doing the simulations. > >Thanks! > >I did wonder how the original poster planned to take the time for >all the resampling necessary for an arbitrary number of clusters >on 10,000 points... > >David >-- >David Cassell, CSC >Cassell.David(a)epa.gov >Senior computing specialist >mathematical statistician
|
Pages: 1 Prev: Cannot load SSL support Next: Surprised that the case statement in proc sql does not support lists |