Prev: calculate variance
Next: ODS Tagsets ExcelXP
From: "Nick ." on 11 May 2006 17:50 Dan, I thank you for your explanation. The example you presented solidified for me the BIG PICTURE of what Jonas was/is trying to do. I know D.C. will explain more in the future about this. But for now, I am happy with what I've gotten from you. Thanks. NICK ----- Original Message ----- From: "Nordlund, Dan (DSHS)" To: SAS-L(a)LISTSERV.UGA.EDU Subject: Re: jackknife concept Date: Wed, 10 May 2006 14:16:30 -0700 > -----Original Message----- > From: Nick . [mailto:ni14(a)mail.com] > Sent: Wednesday, May 10, 2006 12:56 PM > To: Nordlund, Dan (DSHS); SAS-L(a)LISTSERV.UGA.EDU > Subject: Re: Re: jackknife concept > > Dan, > > What is the objective of this macro? It will run 20 times, it will give > you statistics on 20 differnt data sets? What is the objective of this > macro for those of us who don't understand what Jonas is trying to > implement and how to interprete? > NICK > > ----- Original Message ----- > From: "Nordlund, Dan (DSHS)" > To: SAS-L(a)LISTSERV.UGA.EDU > Subject: Re: jackknife concept > Date: Wed, 10 May 2006 12:25:43 -0700 > > > > -----Original Message----- > > From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of > Jonas > > Bilenas > > Sent: Wednesday, May 10, 2006 5:16 AM > > To: SAS-L(a)LISTSERV.UGA.EDU > > Subject: Re: jackknife concept > > > > On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote: > > > > >David, for what Jonas was trying to do, i.e. to get some 'error' > > estimates > > >for model predictors, is N sub-samples or N bootstrapping samples the > > better > > >method? > > > > > I modified the code a bit, based on suggestions from David. Similar but > > different results: <<>> > Jonas, > > you haven't incorporated one of the most important suggestions that David > made, which is to use BY processing in Proc Logistic. That will eliminate > having to continually open and close the file of bootstrap samples, and > the > file will only have to be read through once. Remove the %DO loop and > replace the where statement with a BY statement. You can also eliminate > the > Proc Transpose and the Proc Append. Something like the following (I'm not > sure where the macro variable &ivs is defined) : > > %macro boot(iter); > proc surveyselect data=reg out=outdata > rep=&ITER method=urs samprate=1 outhits; > run; > > ods listing close; > ods output ParameterEstimates=bout; > > proc logistic data=outdata; > by replicate; > model bad=&ivs; > run; > > ods output close; > ods listing; > > > proc means data=bout mean min max std n nmiss; > class variable; > var estimate; > output out=estimate_summary; > run; > %mend; > > %boot(20); > > Hope this is helpful, > > Dan > Nick, I haven't got the time, space, or probably even the skill to adequately explain bootstrapping, but I will try to briefly respond. I am sure that David will be only too kind to correct me if I go to far astray. :-) First the fact that I used a macro here was simply because I was responding to what had been written. Unless I was going to try to create a much more general boot macro with many parameters for flexibility (and I wouldn't because it's already been done) I would just write the basic code here with the number of replications hard coded. I am oversimplifying here, but bootstrapping is based on the assumption that your original sample is representative of the population from which it is drawn. So sampling with replacement from your original sample will produce a sample similar to what you could have gotten if you took a new sample from the parent population. Now take many bootstrap samples and compute a desired statistic, say the mean, on each one. Then you can empirically estimate what the sampling distribution of the statistic is, rather than assuming that the distribution is normal or some other distribution and estimating the standard errors using standard formulas. Bootstrapping can also be useful in those situations where you don't have an analytical solution for the standard error of your statistic. Here is a toy example logistic regression that you could play with. **create sample data; data test; do i=1 to 100; y=i GT 50; x0=i+20*normal(1234); x1=uniform(4321) > .5; x2=normal(1234); x3=normal(1234); output; end; run; /**run your initial logistic regression. It might be instructive to compare the mean of the bootstrap sample estimated coefficients (below) to the estimates here **/ proc logistic data=test; model y=x0 x1 x2 x3; run; /**create 20 bootstrap samples; in real life you would probably want many more; **/ proc surveyselect data=test out=outdata rep=20 method=urs samprate=1 outhits; run; ods listing close; ods output ParameterEstimates=bout; /**run your analysis using BY processing; the ODS output statement will collect 20 sets of Regression coefficients into one dataset, bout; **/ proc logistic data=outdata; by replicate; model y=x0 x1 x2 x3; run; ods output close; ods listing; /**compute the mean and Std.Dev. of the 20 regression coefficients For each variable. The std.dev. is *an* estimate of the standard error of estimate for the original regression coefficients. You might then use these standard errors (or an empirically estimated confidence interval) to assess whether your estimated coefficients are different from zero **/ proc means data=bout nway mean min max std n nmiss; class variable; var estimate; output out=estimate_summary; run; I hope this description has not been too far off the mark. Do not go out and try to bootstrap your own estimates using this partial, simplified explanation. I haven't dealt with a whole host of issues, including but not limited to things like bias estimation and whether you should be resampling cases or residuals. I hope this has been helpful for following this discussion thread, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 -- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
From: David L Cassell on 12 May 2006 00:16 Peter replied: > >>> David L Cassell <davidlcassell(a)MSN.COM> 5/9/2006 2:20 pm >>> wrote ><<< >[2] No, they are not concerned with regression only. And they >make fundamental implicit assumptions that no one bothers to >warn you about, so they are not applicable everywhere. Don't >use the naive bootstrap or jackknife on time series data, or >sampel survey date, or ... > >>> > >and also not to estimate maxima or minima, or extreme quantiles. >And be careful with things like factor analysis, where a big problem is >that the signs of the factors are arbitrary, and averaging can lead to >odd results (THAT error cost me a lot of hours to find and correct). etc., etc. Bootstrapping is a nice tool, but the people who sell it as THE SOLUTION TO ALL KNOWN STAT PROBLEMS are really over-hyping the product. Oh yeah, and it also makes hundreds of julienne fres. :-) ><<< >[3] Can they be used in prediction intervals? Yes. Should they? >No. Why not? Because in simple linear regression, you already have >a nice, linear estimation form for your prediction intervals. >Bootstrapping an already-linearized estimate is about as useful as >changing a tire that isn't flat. > >> > >OK, as I know David knows, this is fine if the assumptions of the model >are met. But I was under the impression that bootstrapping can deal >with some fairly serious violations of those assumptions. Am I >incorrect? <Ed MacMahon>You are correct SIR!</Ed MacMahon> Actually, a Little Birdie(tm) pecked me on the head about this, reminding me that the OLS assumptions are just assumptions. David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
From: David L Cassell on 16 May 2006 01:45 ni14(a)MAIL.COM wrote: >----- Original Message ----- >From: "Jonas Bilenas" >To: SAS-L(a)LISTSERV.UGA.EDU >Subject: Re: jackknife concept >Date: Wed, 10 May 2006 08:16:26 -0400 > > >On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote: > > > David, for what Jonas was trying to do, i.e. to get some 'error' >estimates > > for model predictors, is N sub-samples or N bootstrapping samples the >better > > method? > > >I modified the code a bit, based on suggestions from David. Similar but >different results: > >%macro boot(iter); > >proc surveyselect data=reg out=outdata >rep=&ITER method=urs samprate=1 outhits; >run; > >%do i=1 %to &iter; >ods listing close; >ods output ParameterEstimates=bout; >proc logistic data=outdata; >where replicate=&i; >model bad=&ivs; >run; >ods output close; > >proc transpose data=bout out=bt&i; >var estimate; >id variable; >run; >%if "&i" ne "1" %then %do; >proc append base=bt1 data=bt&i; >run; >%end; >%end; >ods listing; > > >proc means data=bt1 mean min max std n nmiss; >run; >%mend; > >%boot(20); >Hello, >Can one of you, Jonas or Dave or Peter or someone explain what this code is >doing??? I am trying to follow this thread and I am lost already. Having a >sample dataset will help a lot!!! >NICK First of all, this is not a jackknife, despite the Subject line. And this is really a pretty small number of repetitions for a bootstrap. Now let's look at what's really going on. I'll use a simpler example, and a data set so we have a more obvious problem. /* Some fake data which is obviously not meeting the assumptions of OLS regression: */ data dreck; x=1; y=180; output; do x = 2 to 29; y = 50 + 6*x + 8*rannor(2354); output; end; x=30; y=82; output; run; proc plot data=dreck; plot y*x; run; /* Okay, we can see that we have outliers which are also leverage points. This is bad. */ proc reg data=dreck; model y=x; run; /* Oops. The PROC REG results do not really match up with the parameters we *know* to be correct. This is why we need diagnostic plots. */ /* Let's re-sample our set of data points. I'm using n=1000 because we're going to be building 98% confidence limits and we need enough points in those tails. */ proc surveyselect data=dreck out=MySample seed=49578574 rep=1000 method=urs outhits samprate=1; run; /* Now let's look at the output. You can see that it looks like the original data set, with a REPLICATE variable for our by-processing */ proc print data=MySample(obs=100); run; /* Now we'll run all our 1000 regressions really quickly, with no output and no notes. This is way cleaner than some big macro loop. And it's a lot faster. Try it if you don't believe me. */ ods listing close; options nonotes; proc reg data=MySample outest=Myests; model y=x; by replicate; run; ods listing; options notes; /* So now we have some idea of the variability and bias inherent in the data, so we can make a 98% CI for the intercept and the slope. */ proc univariate data=MyEsts; var Intercept x; output out=outie mean=meanint meanx p1=lclint lclx p99=uclint uclx; run; proc print data=outie noobs; run; And that's all there is to it. Re-sampling plans like the jackknife and the bootstrap 'smmoth out' the noisy surface that is our real-world data, and give us a chance to examine a linearized surface. Sort of like looking at a nice, flat map of an area instead of hiking up and down all over the place to guess where something lies. HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
From: "Nick ." on 22 May 2006 13:44
Thank you for the explanation David. NICK ----- Original Message ----- From: "David L Cassell" To: SAS-L(a)LISTSERV.UGA.EDU Subject: Re: jackknife concept Date: Mon, 15 May 2006 22:45:03 -0700 ni14(a)MAIL.COM wrote: > ----- Original Message ----- > From: "Jonas Bilenas" > To: SAS-L(a)LISTSERV.UGA.EDU > Subject: Re: jackknife concept > Date: Wed, 10 May 2006 08:16:26 -0400 > > > On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote: > > > David, for what Jonas was trying to do, i.e. to get some 'error' > estimates > > for model predictors, is N sub-samples or N bootstrapping samples the > better > > method? > > > I modified the code a bit, based on suggestions from David. Similar but > different results: > > %macro boot(iter); > > proc surveyselect data=reg out=outdata > rep=&ITER method=urs samprate=1 outhits; > run; > > %do i=1 %to &iter; > ods listing close; > ods output ParameterEstimates=bout; > proc logistic data=outdata; > where replicate=&i; > model bad=&ivs; > run; > ods output close; > > proc transpose data=bout out=bt&i; > var estimate; > id variable; > run; > %if "&i" ne "1" %then %do; > proc append base=bt1 data=bt&i; > run; > %end; > %end; > ods listing; > > > proc means data=bt1 mean min max std n nmiss; > run; > %mend; > > %boot(20); > Hello, > Can one of you, Jonas or Dave or Peter or someone explain what this code is > doing??? I am trying to follow this thread and I am lost already. Having a > sample dataset will help a lot!!! > NICK First of all, this is not a jackknife, despite the Subject line. And this is really a pretty small number of repetitions for a bootstrap. Now let's look at what's really going on. I'll use a simpler example, and a data set so we have a more obvious problem. /* Some fake data which is obviously not meeting the assumptions of OLS regression: */ data dreck; x=1; y=180; output; do x = 2 to 29; y = 50 + 6*x + 8*rannor(2354); output; end; x=30; y=82; output; run; proc plot data=dreck; plot y*x; run; /* Okay, we can see that we have outliers which are also leverage points. This is bad. */ proc reg data=dreck; model y=x; run; /* Oops. The PROC REG results do not really match up with the parameters we *know* to be correct. This is why we need diagnostic plots. */ /* Let's re-sample our set of data points. I'm using n=1000 because we're going to be building 98% confidence limits and we need enough points in those tails. */ proc surveyselect data=dreck out=MySample seed=49578574 rep=1000 method=urs outhits samprate=1; run; /* Now let's look at the output. You can see that it looks like the original data set, with a REPLICATE variable for our by-processing */ proc print data=MySample(obs=100); run; /* Now we'll run all our 1000 regressions really quickly, with no output and no notes. This is way cleaner than some big macro loop. And it's a lot faster. Try it if you don't believe me. */ ods listing close; options nonotes; proc reg data=MySample outest=Myests; model y=x; by replicate; run; ods listing; options notes; /* So now we have some idea of the variability and bias inherent in the data, so we can make a 98% CI for the intercept and the slope. */ proc univariate data=MyEsts; var Intercept x; output out=outie mean=meanint meanx p1=lclint lclx p99=uclint uclx; run; proc print data=outie noobs; run; And that's all there is to it. Re-sampling plans like the jackknife and the bootstrap 'smmoth out' the noisy surface that is our real-world data, and give us a chance to examine a linearized surface. Sort of like looking at a nice, flat map of an area instead of hiking up and down all over the place to guess where something lies. HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330 _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ -- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/ |