From: "Wall, Steven" on 28 Jan 2010 11:54 SAS-L: A request from a co-worker is included below regarding a bootstrapping appr= oach to clustering. I have a tentative solution that I think is working co= rrectly, but would feel better if I could discuss the problem with others h= ave done something similar. In short(?), my approach was to: - create an alternate version of the original input dataset by sampling it = with replacement - running that alternate version through PROC CLUSTER and PROC TREE - post-processing the output from PROC TREE to create unique text-string id= entifiers for each cluster - store that result - rinse and repeat n-thousand times I had a followup program then to summarize the n-thousand samples and count= how many times each cluster appeared. If it appeared > 50% of the time, w= e labeled the node on the original cluster with the percentage. If you think the following sounds familiar and are willing to share your wo= rk, please let me know. Thanks. Steve Original request: In a nut shell bootstrapping a dendrogram involves the creation of any numb= er of data sets based on the original allele data. The idea is the give on= e confidence in how the the GE's cluster. Generally, to have 95% confidenc= e in a cluster you would bootstrap the data set 2000 times. So this means = you create 2000 data sets (allele data) by replacement bootstrapping(my und= erstanding is you conduct random draws of loci across GE's and until you've= selected the same number of loci that occurred in the original data set. = The replacement concept is key because each loci would have the same opport= unity to be drawn each time, ie. loci could occur in the same data set mult= iple times). Next you would create 2000 distance matrices and subsequently= 2000 dendrograms from all this data. This is where it might get dicy and = I had to use Phylip to do this, but you have to assess all the dendrograms = to determine the number of times specific branches occurred across all 2000= trees. The closer the count it to 2000, the stronger the evidence is that= the branching is "real". The output ends up being a "consensus" dendrogra= m with the % of time each branch occurred. This communication is for use by the intended recipient and contains information that may be Privileged, confidential or copyrighted under applicable law. If you are not the intended recipient, you are hereby formally notified that any use, copying or distribution of this e-mail, in whole or in part, is strictly prohibited. Please notify the sender by return e-mail and delete this e-mail from your system. Unless explicitly and conspicuously designated as "E-Contract Intended", this e-mail does not constitute a contract offer, a contract amendment, or an acceptance of a contract offer. This e-mail does not constitute a consent to the use of sender's contact information for direct marketing purposes or for transfers of data to third parties. Francais Deutsch Italiano Espanol Portugues Japanese Chinese Korean http://www.DuPont.com/corp/email_disclaimer.html
|
Pages: 1 Prev: Non linear mixed modelling Next: normal score back transformation |