jackknife concept [SAS]

Prev: calculate variance
Next: ODS Tagsets ExcelXP

From: "Nick ." on 11 May 2006 17:50

Dan,
I thank you for your explanation. The example you presented solidified for me the BIG PICTURE of what Jonas was/is trying to do. I know D.C. will explain more in the future about this. But for now, I am happy with what I've gotten from you.
Thanks.
NICK

----- Original Message -----
From: "Nordlund, Dan (DSHS)"
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Re: jackknife concept
Date: Wed, 10 May 2006 14:16:30 -0700

> -----Original Message-----
> From: Nick . [mailto:ni14(a)mail.com]
> Sent: Wednesday, May 10, 2006 12:56 PM
> To: Nordlund, Dan (DSHS); SAS-L(a)LISTSERV.UGA.EDU
> Subject: Re: Re: jackknife concept
>
> Dan,
>
> What is the objective of this macro? It will run 20 times, it will give
> you statistics on 20 differnt data sets? What is the objective of this
> macro for those of us who don't understand what Jonas is trying to
> implement and how to interprete?
> NICK
>
> ----- Original Message -----
> From: "Nordlund, Dan (DSHS)"
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: Re: jackknife concept
> Date: Wed, 10 May 2006 12:25:43 -0700
>
>
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
> Jonas
> > Bilenas
> > Sent: Wednesday, May 10, 2006 5:16 AM
> > To: SAS-L(a)LISTSERV.UGA.EDU
> > Subject: Re: jackknife concept
> >
> > On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote:
> >
> > >David, for what Jonas was trying to do, i.e. to get some 'error'
> > estimates
> > >for model predictors, is N sub-samples or N bootstrapping samples the
> > better
> > >method?
> > >
> > I modified the code a bit, based on suggestions from David. Similar but
> > different results:
<<>>

> Jonas,
>
> you haven't incorporated one of the most important suggestions that David
> made, which is to use BY processing in Proc Logistic. That will eliminate
> having to continually open and close the file of bootstrap samples, and
> the
> file will only have to be read through once. Remove the %DO loop and
> replace the where statement with a BY statement. You can also eliminate
> the
> Proc Transpose and the Proc Append. Something like the following (I'm not
> sure where the macro variable &ivs is defined) :
>
> %macro boot(iter);
> proc surveyselect data=reg out=outdata
> rep=&ITER method=urs samprate=1 outhits;
> run;
>
> ods listing close;
> ods output ParameterEstimates=bout;
>
> proc logistic data=outdata;
> by replicate;
> model bad=&ivs;
> run;
>
> ods output close;
> ods listing;
>
>
> proc means data=bout mean min max std n nmiss;
> class variable;
> var estimate;
> output out=estimate_summary;
> run;
> %mend;
>
> %boot(20);
>
> Hope this is helpful,
>
> Dan
>

Nick,

I haven't got the time, space, or probably even the skill to adequately
explain bootstrapping, but I will try to briefly respond. I am sure that
David will be only too kind to correct me if I go to far astray. :-)

First the fact that I used a macro here was simply because I was responding
to what had been written. Unless I was going to try to create a much more
general boot macro with many parameters for flexibility (and I wouldn't
because it's already been done) I would just write the basic code here with
the number of replications hard coded.

I am oversimplifying here, but bootstrapping is based on the assumption that
your original sample is representative of the population from which it is
drawn. So sampling with replacement from your original sample will produce
a sample similar to what you could have gotten if you took a new sample from
the parent population. Now take many bootstrap samples and compute a
desired statistic, say the mean, on each one. Then you can empirically
estimate what the sampling distribution of the statistic is, rather than
assuming that the distribution is normal or some other distribution and
estimating the standard errors using standard formulas. Bootstrapping can
also be useful in those situations where you don't have an analytical
solution for the standard error of your statistic.

Here is a toy example logistic regression that you could play with.

**create sample data;
data test;
do i=1 to 100;
y=i GT 50;
x0=i+20*normal(1234);
x1=uniform(4321) > .5;
x2=normal(1234);
x3=normal(1234);
output;
end;
run;

/**run your initial logistic regression. It might be
instructive to compare the mean of the bootstrap sample
estimated coefficients (below) to the estimates here
**/
proc logistic data=test;
model y=x0 x1 x2 x3;
run;

/**create 20 bootstrap samples;
in real life you would probably want many more;
**/
proc surveyselect data=test out=outdata
rep=20 method=urs samprate=1 outhits;
run;

ods listing close;
ods output ParameterEstimates=bout;

/**run your analysis using BY processing;
the ODS output statement will collect 20 sets of
Regression coefficients into one dataset, bout;
**/
proc logistic data=outdata;
by replicate;
model y=x0 x1 x2 x3;
run;

ods output close;
ods listing;

/**compute the mean and Std.Dev. of the 20 regression coefficients
For each variable. The std.dev. is *an* estimate of the standard error
of estimate for the original regression coefficients. You might
then use these standard errors (or an empirically estimated confidence
interval) to assess whether your estimated coefficients are different
from zero
**/
proc means data=bout nway mean min max std n nmiss;
class variable;
var estimate;
output out=estimate_summary;
run;

I hope this description has not been too far off the mark. Do not go out
and try to bootstrap your own estimates using this partial, simplified
explanation. I haven't dealt with a whole host of issues, including but not
limited to things like bias estimation and whether you should be resampling
cases or residuals.

I hope this has been helpful for following this discussion thread,

Dan

Daniel J. Nordlund
Research and Data Analysis
Washington State Department of Social and Health Services
Olympia, WA 98504-5204

--
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

From: David L Cassell on 12 May 2006 00:16

Peter replied:
> >>> David L Cassell <davidlcassell(a)MSN.COM> 5/9/2006 2:20 pm >>> wrote
><<<
>[2] No, they are not concerned with regression only. And they
>make fundamental implicit assumptions that no one bothers to
>warn you about, so they are not applicable everywhere. Don't
>use the naive bootstrap or jackknife on time series data, or
>sampel survey date, or ...
> >>>
>
>and also not to estimate maxima or minima, or extreme quantiles.
>And be careful with things like factor analysis, where a big problem is
>that the signs of the factors are arbitrary, and averaging can lead to
>odd results (THAT error cost me a lot of hours to find and correct).

etc., etc. Bootstrapping is a nice tool, but the people who sell it as THE
SOLUTION TO ALL KNOWN STAT PROBLEMS are really over-hyping
the product.

Oh yeah, and it also makes hundreds of julienne fres. :-)

><<<
>[3] Can they be used in prediction intervals? Yes. Should they?
>No. Why not? Because in simple linear regression, you already have
>a nice, linear estimation form for your prediction intervals.
>Bootstrapping an already-linearized estimate is about as useful as
>changing a tire that isn't flat.
> >>
>
>OK, as I know David knows, this is fine if the assumptions of the model
>are met. But I was under the impression that bootstrapping can deal
>with some fairly serious violations of those assumptions. Am I
>incorrect?

<Ed MacMahon>You are correct SIR!</Ed MacMahon>

Actually, a Little Birdie(tm) pecked me on the head about this,
reminding me that the OLS assumptions are just assumptions.

David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

From: David L Cassell on 16 May 2006 01:45

ni14(a)MAIL.COM wrote:
>----- Original Message -----
>From: "Jonas Bilenas"
>To: SAS-L(a)LISTSERV.UGA.EDU
>Subject: Re: jackknife concept
>Date: Wed, 10 May 2006 08:16:26 -0400
>
>
>On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote:
>
> > David, for what Jonas was trying to do, i.e. to get some 'error'
>estimates
> > for model predictors, is N sub-samples or N bootstrapping samples the
>better
> > method?
> >
>I modified the code a bit, based on suggestions from David. Similar but
>different results:
>
>%macro boot(iter);
>
>proc surveyselect data=reg out=outdata
>rep=&ITER method=urs samprate=1 outhits;
>run;
>
>%do i=1 %to &iter;
>ods listing close;
>ods output ParameterEstimates=bout;
>proc logistic data=outdata;
>where replicate=&i;
>model bad=&ivs;
>run;
>ods output close;
>
>proc transpose data=bout out=bt&i;
>var estimate;
>id variable;
>run;
>%if "&i" ne "1" %then %do;
>proc append base=bt1 data=bt&i;
>run;
>%end;
>%end;
>ods listing;
>
>
>proc means data=bt1 mean min max std n nmiss;
>run;
>%mend;
>
>%boot(20);

>Hello,
>Can one of you, Jonas or Dave or Peter or someone explain what this code is
>doing??? I am trying to follow this thread and I am lost already. Having a
>sample dataset will help a lot!!!
>NICK

First of all, this is not a jackknife, despite the Subject line.

And this is really a pretty small number of repetitions for a bootstrap.

Now let's look at what's really going on. I'll use a simpler example,
and a data set so we have a more obvious problem.

/* Some fake data which is obviously not meeting the assumptions
of OLS regression: */

data dreck;
x=1; y=180; output;
do x = 2 to 29;
y = 50 + 6*x + 8*rannor(2354);
output;
end;
x=30; y=82; output;
run;

proc plot data=dreck;
plot y*x;
run;

/* Okay, we can see that we have outliers which are also leverage
points. This is bad. */

proc reg data=dreck;
model y=x;
run;

/* Oops. The PROC REG results do not really match up with the
parameters we *know* to be correct. This is why we need
diagnostic plots. */

/* Let's re-sample our set of data points. I'm using n=1000
because we're going to be building 98% confidence limits and we
need enough points in those tails. */

proc surveyselect data=dreck out=MySample
seed=49578574
rep=1000
method=urs
outhits
samprate=1;
run;

/* Now let's look at the output. You can see that it looks like
the original data set, with a REPLICATE variable for our by-processing */

proc print data=MySample(obs=100); run;

/* Now we'll run all our 1000 regressions really quickly, with no output
and no notes. This is way cleaner than some big macro loop. And it's
a lot faster. Try it if you don't believe me. */

ods listing close;
options nonotes;
proc reg data=MySample outest=Myests;
model y=x;
by replicate;
run;
ods listing;
options notes;

/* So now we have some idea of the variability and bias inherent in the
data, so we can make a 98% CI for the intercept and the slope. */

proc univariate data=MyEsts;
var Intercept x;
output out=outie mean=meanint meanx p1=lclint lclx p99=uclint uclx;
run;

proc print data=outie noobs; run;

And that's all there is to it. Re-sampling plans like the jackknife and the
bootstrap 'smmoth out' the noisy surface that is our real-world data, and
give us a chance to examine a linearized surface. Sort of like looking at
a nice, flat map of an area instead of hiking up and down all over the
place to guess where something lies.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

From: "Nick ." on 22 May 2006 13:44

Thank you for the explanation David.
NICK

----- Original Message -----
From: "David L Cassell"
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Re: jackknife concept
Date: Mon, 15 May 2006 22:45:03 -0700

ni14(a)MAIL.COM wrote:
> ----- Original Message -----
> From: "Jonas Bilenas"
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: Re: jackknife concept
> Date: Wed, 10 May 2006 08:16:26 -0400
>
>
> On Tue, 9 May 2006 17:49:53 -0400, Luo, Peter wrote:
>
> > David, for what Jonas was trying to do, i.e. to get some 'error'
> estimates
> > for model predictors, is N sub-samples or N bootstrapping samples the
> better
> > method?
> >
> I modified the code a bit, based on suggestions from David. Similar but
> different results:
>
> %macro boot(iter);
>
> proc surveyselect data=reg out=outdata
> rep=&ITER method=urs samprate=1 outhits;
> run;
>
> %do i=1 %to &iter;
> ods listing close;
> ods output ParameterEstimates=bout;
> proc logistic data=outdata;
> where replicate=&i;
> model bad=&ivs;
> run;
> ods output close;
>
> proc transpose data=bout out=bt&i;
> var estimate;
> id variable;
> run;
> %if "&i" ne "1" %then %do;
> proc append base=bt1 data=bt&i;
> run;
> %end;
> %end;
> ods listing;
>
>
> proc means data=bt1 mean min max std n nmiss;
> run;
> %mend;
>
> %boot(20);

> Hello,
> Can one of you, Jonas or Dave or Peter or someone explain what this code is
> doing??? I am trying to follow this thread and I am lost already. Having a
> sample dataset will help a lot!!!
> NICK

First of all, this is not a jackknife, despite the Subject line.

And this is really a pretty small number of repetitions for a bootstrap.

Now let's look at what's really going on. I'll use a simpler example,
and a data set so we have a more obvious problem.

/* Some fake data which is obviously not meeting the assumptions
of OLS regression: */

data dreck;
x=1; y=180; output;
do x = 2 to 29;
y = 50 + 6*x + 8*rannor(2354);
output;
end;
x=30; y=82; output;
run;

proc plot data=dreck;
plot y*x;
run;

/* Okay, we can see that we have outliers which are also leverage
points. This is bad. */

proc reg data=dreck;
model y=x;
run;

/* Oops. The PROC REG results do not really match up with the
parameters we *know* to be correct. This is why we need
diagnostic plots. */

/* Let's re-sample our set of data points. I'm using n=1000
because we're going to be building 98% confidence limits and we
need enough points in those tails. */

proc surveyselect data=dreck out=MySample
seed=49578574
rep=1000
method=urs
outhits
samprate=1;
run;

/* Now let's look at the output. You can see that it looks like
the original data set, with a REPLICATE variable for our by-processing */

proc print data=MySample(obs=100); run;

/* Now we'll run all our 1000 regressions really quickly, with no output
and no notes. This is way cleaner than some big macro loop. And it's
a lot faster. Try it if you don't believe me. */

ods listing close;
options nonotes;
proc reg data=MySample outest=Myests;
model y=x;
by replicate;
run;
ods listing;
options notes;

/* So now we have some idea of the variability and bias inherent in the
data, so we can make a 98% CI for the intercept and the slope. */

proc univariate data=MyEsts;
var Intercept x;
output out=outie mean=meanint meanx p1=lclint lclx p99=uclint uclx;
run;

proc print data=outie noobs; run;

And that's all there is to it. Re-sampling plans like the jackknife and the
bootstrap 'smmoth out' the noisy surface that is our real-world data, and
give us a chance to examine a linearized surface. Sort of like looking at
a nice, flat map of an area instead of hiking up and down all over the
place to guess where something lies.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

--
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

First | Prev |
Pages: 1 2 3 4
Prev: calculate variance
Next: ODS Tagsets ExcelXP