From: oloolo on
it turns out that the algorithm behind Gap Statistic doesn't require much
time on the resampling
The sufficient statistic for the resampling is the min/max of each
variable. For method 1, you just sample 10000 uniformly distributed r.v N
times on the range of [min, max], and for method 2, one more step to apply
the linear transformation right eigenvector matrix V from SVD to this sample

So that for method 1, the sampling time is linear of Nobs \times N samples.
For method 2, an extra from SVD of original data is needed, which may be a
problem for high dimension data

But all in all, from my own experience, I think Gap Statistic provides
fairly consistent and accurate estimate of true underlying clusters


On Wed, 15 Jun 2005 17:38:29 -0700, David L. Cassell
<cassell.david(a)EPAMAIL.EPA.GOV> wrote:

>A Little Birdie(tm) chirped to me:
>> After seeing this thread, I looked up the Gap statistic (I haven't
>> done a lot of clustering in the last 15 years), and lo and behold,
>> the Gap statistic is based on exactly the same simulation model as
>> the CCC. The CCC is more or less a numerical approximation to the Gap
>> statistic applied to the R-squared statistic from the cluster
>analysis.
>> So if you are interested in R-squared or the within-cluster sums of
>> squares or any of numerous other equivalent statistics, the CCC saves
>> you the bother of doing the simulations.
>
>Thanks!
>
>I did wonder how the original poster planned to take the time for
>all the resampling necessary for an arbitrary number of clusters
>on 10,000 points...
>
>David
>--
>David Cassell, CSC
>Cassell.David(a)epa.gov
>Senior computing specialist
>mathematical statistician