Bootstrapping a dendrogram? [SAS]

Prev: Non linear mixed modelling
Next: normal score back transformation

From: "Wall, Steven" on 28 Jan 2010 11:54

SAS-L:

A request from a co-worker is included below regarding a bootstrapping appr=
oach to clustering. I have a tentative solution that I think is working co=
rrectly, but would feel better if I could discuss the problem with others h=
ave done something similar.

In short(?), my approach was to:

- create an alternate version of the original input dataset by sampling it =
with replacement
- running that alternate version through PROC CLUSTER and PROC TREE
- post-processing the output from PROC TREE to create unique text-string id=
entifiers for each cluster
- store that result
- rinse and repeat n-thousand times

I had a followup program then to summarize the n-thousand samples and count=
how many times each cluster appeared. If it appeared > 50% of the time, w=
e labeled the node on the original cluster with the percentage.

If you think the following sounds familiar and are willing to share your wo=
rk, please let me know.

Thanks.
Steve

Original request:
In a nut shell bootstrapping a dendrogram involves the creation of any numb=
er of data sets based on the original allele data. The idea is the give on=
e confidence in how the the GE's cluster. Generally, to have 95% confidenc=
e in a cluster you would bootstrap the data set 2000 times. So this means =
you create 2000 data sets (allele data) by replacement bootstrapping(my und=
erstanding is you conduct random draws of loci across GE's and until you've=
selected the same number of loci that occurred in the original data set. =
The replacement concept is key because each loci would have the same opport=
unity to be drawn each time, ie. loci could occur in the same data set mult=
iple times). Next you would create 2000 distance matrices and subsequently=
2000 dendrograms from all this data. This is where it might get dicy and =
I had to use Phylip to do this, but you have to assess all the dendrograms =
to determine the number of times specific branches occurred across all 2000=
trees. The closer the count it to 2000, the stronger the evidence is that=
the branching is "real". The output ends up being a "consensus" dendrogra=
m with the % of time each branch occurred.

This communication is for use by the intended recipient and contains
information that may be Privileged, confidential or copyrighted under
applicable law. If you are not the intended recipient, you are hereby
formally notified that any use, copying or distribution of this e-mail,
in whole or in part, is strictly prohibited. Please notify the sender by
return e-mail and delete this e-mail from your system. Unless explicitly
and conspicuously designated as "E-Contract Intended", this e-mail does
not constitute a contract offer, a contract amendment, or an acceptance
of a contract offer. This e-mail does not constitute a consent to the
use of sender's contact information for direct marketing purposes or for
transfers of data to third parties.

Francais Deutsch Italiano Espanol Portugues Japanese Chinese Korean

http://www.DuPont.com/corp/email_disclaimer.html

|
Pages: 1
Prev: Non linear mixed modelling
Next: normal score back transformation