From: Sigurd Hermansen on
Steve:
I didn't have time to reply when you posted this message. Hope that this response will have some use to you.

What you are proposing amounts to automated variable selection with the c-statistic as a criterion. While I wouldn't recommend that you rely exclusively on the c-statistic, I have demonstrated an efficient method for computing it for a set of predictions and observed binary outcomes. See Lex Jansen's excellent archives for a two-part paper on Evaluating Predictive Models:
http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c=sesug

As for techniques, I'd recommend that you look first at the JMP analog of classification trees (recursive partitioning? CART?). Variables selected for early "splits" will be good candidates for a predictive model. Correlations among predictors doesn't affect CART, and CART makes good use of proxies for missing values. CART does tend to commit very early to a hierarchy and may not find a better model. You may need to remove predictors in early splits and explore alternative models.

The new GLMSELECT procedure in SAS implements stage-wise variable selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 predictors. Perhaps the early split variables in a classification tree and others that seem important a priori would give you a good start. Once you have a potential model or several potential models specified, try a logistic regression model and compute scores and graph the ROC.

All automated variable/predictor selection programs suffer from the same well-known generic defects: estimation methods optimize within an observed sample (model optimism), misleading confidence bounds on parameter estimates, and predictions conditioned on outcomes presumed known with certainty. Cross-validation of models may help. Also, Efron has written extensively recently about using expected false discovery rates to evaluate models selected from sets of many predictors and many observations.
S


-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Steven Raimi
Sent: Wednesday, December 02, 2009 9:09 AM
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Screening factors for logistic regression

I have developed 600+ potential predictors for use in a logistic
regression model I'm working on. I want to screen each as efficiently as
possible for predictive power (using the c-statistic). We have a brute-
force method to generate the c-statistics (proc logistic on
yvar=xvar_in_question, then numerically integrate the ROC curve to
estimate), but there has to be a more straightforward (and efficient) way
to perform this task, right?

Also, I want to identify variables/groups of variables that are collinear,
so I can leave out all but the most sensible one(s) (per subject matter
knowledge). I could use PROC CORR, but that will be overwhelmed trying to
do 600*600 combinations. Again, isn't there a better way to attack this?

FYI - I have both SAS and JMP available. Only about 5% of the dataset can
fit in JMP - but we'll be developing the regression there (using all
target outcomes, and a few percent of the other records so there's a
minimum of two non-target records per target one).

Thanks for the guidance!
Steve
From: Steven Raimi on
Sigurd,

Thank you very much for the useful input - and no less for following
up on the old request!!!

FYI - we got advice that wasn't posted on the list (from a "little
birdie") not to even attempt to preselect factors based on ROC curve -
just let stepwise Logistic run on a few bootstrap samples to narrow them
down. The same person recommended not trying to run down all the
potential correlations.

However, it turns out that proc corr handled hundreds of variables
just fine (and fairly swiftly, too), and we were able to reduce our
candidate factors below 300, then further subject matter expertise got
us even lower. Our biggest model (of the 3 we were developing) started
with 286 variables and kept 70 - and we'll be implementing a smaller
model that had virtually identical lift.

We had started with JMP's Recursive Partitioning platform (before I
posted the original question), but I think we had our target specified
incorrectly, and we had abandoned it. We will use that for further
validation of our final models. That and your other suggestions
(especially your paper(s)!) will be included in the internal
model-developing methodology we're writing based on our learnings.

Thanks again,
Steve

-----Original Message-----
From: Sigurd Hermansen [mailto:HERMANS1(a)WESTAT.com]
Sent: Wednesday, December 16, 2009 12:22 PM
To: Steven Raimi; SAS-L(a)LISTSERV.UGA.EDU
Subject: RE: Screening factors for logistic regression

Steve:
I didn't have time to reply when you posted this message. Hope that this
response will have some use to you.

What you are proposing amounts to automated variable selection with the
c-statistic as a criterion. While I wouldn't recommend that you rely
exclusively on the c-statistic, I have demonstrated an efficient method
for computing it for a set of predictions and observed binary outcomes.
See Lex Jansen's excellent archives for a two-part paper on Evaluating
Predictive Models:
http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c
=sesug

As for techniques, I'd recommend that you look first at the JMP analog
of classification trees (recursive partitioning? CART?). Variables
selected for early "splits" will be good candidates for a predictive
model. Correlations among predictors doesn't affect CART, and CART makes
good use of proxies for missing values. CART does tend to commit very
early to a hierarchy and may not find a better model. You may need to
remove predictors in early splits and explore alternative models.

The new GLMSELECT procedure in SAS implements stage-wise variable
selection (LAR or LASSO) implements shrinkage methods. I wouldn't start
with 600 predictors. Perhaps the early split variables in a
classification tree and others that seem important a priori would give
you a good start. Once you have a potential model or several potential
models specified, try a logistic regression model and compute scores and
graph the ROC.

All automated variable/predictor selection programs suffer from the same
well-known generic defects: estimation methods optimize within an
observed sample (model optimism), misleading confidence bounds on
parameter estimates, and predictions conditioned on outcomes presumed
known with certainty. Cross-validation of models may help. Also, Efron
has written extensively recently about using expected false discovery
rates to evaluate models selected from sets of many predictors and many
observations.
S


-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
Steven Raimi
Sent: Wednesday, December 02, 2009 9:09 AM
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Screening factors for logistic regression

I have developed 600+ potential predictors for use in a logistic
regression model I'm working on. I want to screen each as efficiently
as
possible for predictive power (using the c-statistic). We have a brute-
force method to generate the c-statistics (proc logistic on
yvar=xvar_in_question, then numerically integrate the ROC curve to
estimate), but there has to be a more straightforward (and efficient)
way
to perform this task, right?

Also, I want to identify variables/groups of variables that are
collinear,
so I can leave out all but the most sensible one(s) (per subject matter
knowledge). I could use PROC CORR, but that will be overwhelmed trying
to
do 600*600 combinations. Again, isn't there a better way to attack
this?

FYI - I have both SAS and JMP available. Only about 5% of the dataset
can
fit in JMP - but we'll be developing the regression there (using all
target outcomes, and a few percent of the other records so there's a
minimum of two non-target records per target one).

Thanks for the guidance!
Steve
From: Wensui Liu on
sigurd
while i agree with most of your sugguestions to steve, i have to disagree
with you on using CART as variable selection tool. CART selects variables on
the local scale instead of global and the child splits highly depends on the
parent split, which is very unstable itself and very sensitive to the data
structure.

also, using glmselect as variable selection tool in logistic regression is
very heuristic without a sound theoretical ground.

"ALL automated variable/predictor selection programs suffer from the same
well-known generic defects" is a false claim itself. How many does "ALL"
represent and which are they? It is very dangerous to "ALL" in a statistical
world.

On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1(a)westat.com>
wrote:
>
> Steve:
> I didn't have time to reply when you posted this message. Hope that this
response will have some use to you.
>
> What you are proposing amounts to automated variable selection with the
c-statistic as a criterion. While I wouldn't recommend that you rely
exclusively on the c-statistic, I have demonstrated an efficient method for
computing it for a set of predictions and observed binary outcomes. See Lex
Jansen's excellent archives for a two-part paper on Evaluating Predictive
Models:
>
http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c=sesug
>
> As for techniques, I'd recommend that you look first at the JMP analog of
classification trees (recursive partitioning? CART?). Variables selected for
early "splits" will be good candidates for a predictive model. Correlations
among predictors doesn't affect CART, and CART makes good use of proxies for
missing values. CART does tend to commit very early to a hierarchy and may
not find a better model. You may need to remove predictors in early splits
and explore alternative models.
>
> The new GLMSELECT procedure in SAS implements stage-wise variable
selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with
600 predictors. Perhaps the early split variables in a classification tree
and others that seem important a priori would give you a good start. Once
you have a potential model or several potential models specified, try a
logistic regression model and compute scores and graph the ROC.
>
> All automated variable/predictor selection programs suffer from the same
well-known generic defects: estimation methods optimize within an observed
sample (model optimism), misleading confidence bounds on parameter
estimates, and predictions conditioned on outcomes presumed known with
certainty. Cross-validation of models may help. Also, Efron has written
extensively recently about using expected false discovery rates to evaluate
models selected from sets of many predictors and many observations.
> S
>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of
Steven Raimi
> Sent: Wednesday, December 02, 2009 9:09 AM
> To: SAS-L(a)LISTSERV.UGA.EDU
> Subject: Screening factors for logistic regression
>
> I have developed 600+ potential predictors for use in a logistic
> regression model I'm working on. I want to screen each as efficiently as
> possible for predictive power (using the c-statistic). We have a brute-
> force method to generate the c-statistics (proc logistic on
> yvar=xvar_in_question, then numerically integrate the ROC curve to
> estimate), but there has to be a more straightforward (and efficient) way
> to perform this task, right?
>
> Also, I want to identify variables/groups of variables that are collinear,
> so I can leave out all but the most sensible one(s) (per subject matter
> knowledge). I could use PROC CORR, but that will be overwhelmed trying to
> do 600*600 combinations. Again, isn't there a better way to attack this?
>
> FYI - I have both SAS and JMP available. Only about 5% of the dataset can
> fit in JMP - but we'll be developing the regression there (using all
> target outcomes, and a few percent of the other records so there's a
> minimum of two non-target records per target one).
>
> Thanks for the guidance!
> Steve



--
==============================
WenSui Liu
Blog : statcompute.spaces.live.com
Tough Times Never Last. But Tough People Do. - Robert Schuller
==============================
From: Sigurd Hermansen on
Wensui:
I mention in the Predictive Models papers that CART suffers from what amoun=
ts to a greedy selection: the method doesn't go back and reconsider the fir=
st split in light of other linear combinations of predictors. I actually re=
commended using more than one of the more robust but partially flawed varia=
ble selection methods as a preliminary look at possible models. CART perfor=
ms better in the face of missing predictor values than other methods, while=
GLMSELECT's shrinkage of estimates minimizes the impact of nuisance predic=
tors in sample data. I would not trust either alone, but would use them to =
help evaluate predictors that might serve as proxies for important but unob=
served predictors.

My view that all automated variable selection methods suffer from "model op=
timism" rests on logic, not statistics. If all automated methods attempt to=
optimize model fit to an observed sample (as I believe that they do), then=
they will select features of the sample that differ from features of the p=
opulation. I haven't seen a good counterargument.
S



From: Wensui Liu [mailto:liuwensui(a)gmail.com]
Sent: Sunday, December 27, 2009 9:11 PM
To: Sigurd Hermansen
Cc: SAS-L(a)listserv.uga.edu
Subject: Re: Screening factors for logistic regression

sigurd
while i agree with most of your sugguestions to steve, i have to disagree w=
ith you on using CART as variable selection tool. CART selects variables on=
the local scale instead of global and the child splits highly depends on t=
he parent split, which is very unstable itself and very sensitive to the da=
ta structure.

also, using glmselect as variable selection tool in logistic regression is =
very heuristic without a sound theoretical ground.
"ALL automated variable/predictor selection programs suffer from the same w=
ell-known generic defects" is a false claim itself. How many does "ALL" rep=
resent and which are they? It is very dangerous to "ALL" in a statistical w=
orld.

On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1(a)westat.com<mai=
lto:HERMANS1(a)westat.com>> wrote:
>
> Steve:
> I didn't have time to reply when you posted this message. Hope that this =
response will have some use to you.
>
> What you are proposing amounts to automated variable selection with the c=
-statistic as a criterion. While I wouldn't recommend that you rely exclusi=
vely on the c-statistic, I have demonstrated an efficient method for comput=
ing it for a set of predictions and observed binary outcomes. See Lex Janse=
n's excellent archives for a two-part paper on Evaluating Predictive Models=
:
> http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=3Dsesug2008&s=3Dsesu=
g&c=3Dsesug
>
> As for techniques, I'd recommend that you look first at the JMP analog of=
classification trees (recursive partitioning? CART?). Variables selected f=
or early "splits" will be good candidates for a predictive model. Correlati=
ons among predictors doesn't affect CART, and CART makes good use of proxie=
s for missing values. CART does tend to commit very early to a hierarchy an=
d may not find a better model. You may need to remove predictors in early s=
plits and explore alternative models.
>
> The new GLMSELECT procedure in SAS implements stage-wise variable selecti=
on (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 p=
redictors. Perhaps the early split variables in a classification tree and o=
thers that seem important a priori would give you a good start. Once you ha=
ve a potential model or several potential models specified, try a logistic =
regression model and compute scores and graph the ROC.
>
> All automated variable/predictor selection programs suffer from the same =
well-known generic defects: estimation methods optimize within an observed =
sample (model optimism), misleading confidence bounds on parameter estimate=
s, and predictions conditioned on outcomes presumed known with certainty. C=
ross-validation of models may help. Also, Efron has written extensively rec=
ently about using expected false discovery rates to evaluate models selecte=
d from sets of many predictors and many observations.
> S
>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU<mailto:SAS-L(a)LISTS=
ERV.UGA.EDU>] On Behalf Of Steven Raimi
> Sent: Wednesday, December 02, 2009 9:09 AM
> To: SAS-L(a)LISTSERV.UGA.EDU<mailto:SAS-L(a)LISTSERV.UGA.EDU>
> Subject: Screening factors for logistic regression
>
> I have developed 600+ potential predictors for use in a logistic
> regression model I'm working on. I want to screen each as efficiently as
> possible for predictive power (using the c-statistic). We have a brute-
> force method to generate the c-statistics (proc logistic on
> yvar=3Dxvar_in_question, then numerically integrate the ROC curve to
> estimate), but there has to be a more straightforward (and efficient) way
> to perform this task, right?
>
> Also, I want to identify variables/groups of variables that are collinear=
,
> so I can leave out all but the most sensible one(s) (per subject matter
> knowledge). I could use PROC CORR, but that will be overwhelmed trying t=
o
> do 600*600 combinations. Again, isn't there a better way to attack this?
>
> FYI - I have both SAS and JMP available. Only about 5% of the dataset ca=
n
> fit in JMP - but we'll be developing the regression there (using all
> target outcomes, and a few percent of the other records so there's a
> minimum of two non-target records per target one).
>
> Thanks for the guidance!
> Steve



--
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
WenSui Liu
Blog : statcompute.spaces.live.com<http://statcompute.spaces.live.com>
Tough Times Never Last. But Tough People Do. - Robert Schuller
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D