From: Sigurd Hermansen on 16 Dec 2009 12:21 Steve: I didn't have time to reply when you posted this message. Hope that this response will have some use to you. What you are proposing amounts to automated variable selection with the c-statistic as a criterion. While I wouldn't recommend that you rely exclusively on the c-statistic, I have demonstrated an efficient method for computing it for a set of predictions and observed binary outcomes. See Lex Jansen's excellent archives for a two-part paper on Evaluating Predictive Models: http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c=sesug As for techniques, I'd recommend that you look first at the JMP analog of classification trees (recursive partitioning? CART?). Variables selected for early "splits" will be good candidates for a predictive model. Correlations among predictors doesn't affect CART, and CART makes good use of proxies for missing values. CART does tend to commit very early to a hierarchy and may not find a better model. You may need to remove predictors in early splits and explore alternative models. The new GLMSELECT procedure in SAS implements stage-wise variable selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 predictors. Perhaps the early split variables in a classification tree and others that seem important a priori would give you a good start. Once you have a potential model or several potential models specified, try a logistic regression model and compute scores and graph the ROC. All automated variable/predictor selection programs suffer from the same well-known generic defects: estimation methods optimize within an observed sample (model optimism), misleading confidence bounds on parameter estimates, and predictions conditioned on outcomes presumed known with certainty. Cross-validation of models may help. Also, Efron has written extensively recently about using expected false discovery rates to evaluate models selected from sets of many predictors and many observations. S -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Steven Raimi Sent: Wednesday, December 02, 2009 9:09 AM To: SAS-L(a)LISTSERV.UGA.EDU Subject: Screening factors for logistic regression I have developed 600+ potential predictors for use in a logistic regression model I'm working on. I want to screen each as efficiently as possible for predictive power (using the c-statistic). We have a brute- force method to generate the c-statistics (proc logistic on yvar=xvar_in_question, then numerically integrate the ROC curve to estimate), but there has to be a more straightforward (and efficient) way to perform this task, right? Also, I want to identify variables/groups of variables that are collinear, so I can leave out all but the most sensible one(s) (per subject matter knowledge). I could use PROC CORR, but that will be overwhelmed trying to do 600*600 combinations. Again, isn't there a better way to attack this? FYI - I have both SAS and JMP available. Only about 5% of the dataset can fit in JMP - but we'll be developing the regression there (using all target outcomes, and a few percent of the other records so there's a minimum of two non-target records per target one). Thanks for the guidance! Steve
From: Steven Raimi on 17 Dec 2009 11:44 Sigurd, Thank you very much for the useful input - and no less for following up on the old request!!! FYI - we got advice that wasn't posted on the list (from a "little birdie") not to even attempt to preselect factors based on ROC curve - just let stepwise Logistic run on a few bootstrap samples to narrow them down. The same person recommended not trying to run down all the potential correlations. However, it turns out that proc corr handled hundreds of variables just fine (and fairly swiftly, too), and we were able to reduce our candidate factors below 300, then further subject matter expertise got us even lower. Our biggest model (of the 3 we were developing) started with 286 variables and kept 70 - and we'll be implementing a smaller model that had virtually identical lift. We had started with JMP's Recursive Partitioning platform (before I posted the original question), but I think we had our target specified incorrectly, and we had abandoned it. We will use that for further validation of our final models. That and your other suggestions (especially your paper(s)!) will be included in the internal model-developing methodology we're writing based on our learnings. Thanks again, Steve -----Original Message----- From: Sigurd Hermansen [mailto:HERMANS1(a)WESTAT.com] Sent: Wednesday, December 16, 2009 12:22 PM To: Steven Raimi; SAS-L(a)LISTSERV.UGA.EDU Subject: RE: Screening factors for logistic regression Steve: I didn't have time to reply when you posted this message. Hope that this response will have some use to you. What you are proposing amounts to automated variable selection with the c-statistic as a criterion. While I wouldn't recommend that you rely exclusively on the c-statistic, I have demonstrated an efficient method for computing it for a set of predictions and observed binary outcomes. See Lex Jansen's excellent archives for a two-part paper on Evaluating Predictive Models: http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c =sesug As for techniques, I'd recommend that you look first at the JMP analog of classification trees (recursive partitioning? CART?). Variables selected for early "splits" will be good candidates for a predictive model. Correlations among predictors doesn't affect CART, and CART makes good use of proxies for missing values. CART does tend to commit very early to a hierarchy and may not find a better model. You may need to remove predictors in early splits and explore alternative models. The new GLMSELECT procedure in SAS implements stage-wise variable selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 predictors. Perhaps the early split variables in a classification tree and others that seem important a priori would give you a good start. Once you have a potential model or several potential models specified, try a logistic regression model and compute scores and graph the ROC. All automated variable/predictor selection programs suffer from the same well-known generic defects: estimation methods optimize within an observed sample (model optimism), misleading confidence bounds on parameter estimates, and predictions conditioned on outcomes presumed known with certainty. Cross-validation of models may help. Also, Efron has written extensively recently about using expected false discovery rates to evaluate models selected from sets of many predictors and many observations. S -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Steven Raimi Sent: Wednesday, December 02, 2009 9:09 AM To: SAS-L(a)LISTSERV.UGA.EDU Subject: Screening factors for logistic regression I have developed 600+ potential predictors for use in a logistic regression model I'm working on. I want to screen each as efficiently as possible for predictive power (using the c-statistic). We have a brute- force method to generate the c-statistics (proc logistic on yvar=xvar_in_question, then numerically integrate the ROC curve to estimate), but there has to be a more straightforward (and efficient) way to perform this task, right? Also, I want to identify variables/groups of variables that are collinear, so I can leave out all but the most sensible one(s) (per subject matter knowledge). I could use PROC CORR, but that will be overwhelmed trying to do 600*600 combinations. Again, isn't there a better way to attack this? FYI - I have both SAS and JMP available. Only about 5% of the dataset can fit in JMP - but we'll be developing the regression there (using all target outcomes, and a few percent of the other records so there's a minimum of two non-target records per target one). Thanks for the guidance! Steve
From: Wensui Liu on 27 Dec 2009 21:10 sigurd while i agree with most of your sugguestions to steve, i have to disagree with you on using CART as variable selection tool. CART selects variables on the local scale instead of global and the child splits highly depends on the parent split, which is very unstable itself and very sensitive to the data structure. also, using glmselect as variable selection tool in logistic regression is very heuristic without a sound theoretical ground. "ALL automated variable/predictor selection programs suffer from the same well-known generic defects" is a false claim itself. How many does "ALL" represent and which are they? It is very dangerous to "ALL" in a statistical world. On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1(a)westat.com> wrote: > > Steve: > I didn't have time to reply when you posted this message. Hope that this response will have some use to you. > > What you are proposing amounts to automated variable selection with the c-statistic as a criterion. While I wouldn't recommend that you rely exclusively on the c-statistic, I have demonstrated an efficient method for computing it for a set of predictions and observed binary outcomes. See Lex Jansen's excellent archives for a two-part paper on Evaluating Predictive Models: > http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c=sesug > > As for techniques, I'd recommend that you look first at the JMP analog of classification trees (recursive partitioning? CART?). Variables selected for early "splits" will be good candidates for a predictive model. Correlations among predictors doesn't affect CART, and CART makes good use of proxies for missing values. CART does tend to commit very early to a hierarchy and may not find a better model. You may need to remove predictors in early splits and explore alternative models. > > The new GLMSELECT procedure in SAS implements stage-wise variable selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 predictors. Perhaps the early split variables in a classification tree and others that seem important a priori would give you a good start. Once you have a potential model or several potential models specified, try a logistic regression model and compute scores and graph the ROC. > > All automated variable/predictor selection programs suffer from the same well-known generic defects: estimation methods optimize within an observed sample (model optimism), misleading confidence bounds on parameter estimates, and predictions conditioned on outcomes presumed known with certainty. Cross-validation of models may help. Also, Efron has written extensively recently about using expected false discovery rates to evaluate models selected from sets of many predictors and many observations. > S > > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of Steven Raimi > Sent: Wednesday, December 02, 2009 9:09 AM > To: SAS-L(a)LISTSERV.UGA.EDU > Subject: Screening factors for logistic regression > > I have developed 600+ potential predictors for use in a logistic > regression model I'm working on. I want to screen each as efficiently as > possible for predictive power (using the c-statistic). We have a brute- > force method to generate the c-statistics (proc logistic on > yvar=xvar_in_question, then numerically integrate the ROC curve to > estimate), but there has to be a more straightforward (and efficient) way > to perform this task, right? > > Also, I want to identify variables/groups of variables that are collinear, > so I can leave out all but the most sensible one(s) (per subject matter > knowledge). I could use PROC CORR, but that will be overwhelmed trying to > do 600*600 combinations. Again, isn't there a better way to attack this? > > FYI - I have both SAS and JMP available. Only about 5% of the dataset can > fit in JMP - but we'll be developing the regression there (using all > target outcomes, and a few percent of the other records so there's a > minimum of two non-target records per target one). > > Thanks for the guidance! > Steve -- ============================== WenSui Liu Blog : statcompute.spaces.live.com Tough Times Never Last. But Tough People Do. - Robert Schuller ==============================
From: Sigurd Hermansen on 27 Dec 2009 22:16 Wensui: I mention in the Predictive Models papers that CART suffers from what amoun= ts to a greedy selection: the method doesn't go back and reconsider the fir= st split in light of other linear combinations of predictors. I actually re= commended using more than one of the more robust but partially flawed varia= ble selection methods as a preliminary look at possible models. CART perfor= ms better in the face of missing predictor values than other methods, while= GLMSELECT's shrinkage of estimates minimizes the impact of nuisance predic= tors in sample data. I would not trust either alone, but would use them to = help evaluate predictors that might serve as proxies for important but unob= served predictors. My view that all automated variable selection methods suffer from "model op= timism" rests on logic, not statistics. If all automated methods attempt to= optimize model fit to an observed sample (as I believe that they do), then= they will select features of the sample that differ from features of the p= opulation. I haven't seen a good counterargument. S From: Wensui Liu [mailto:liuwensui(a)gmail.com] Sent: Sunday, December 27, 2009 9:11 PM To: Sigurd Hermansen Cc: SAS-L(a)listserv.uga.edu Subject: Re: Screening factors for logistic regression sigurd while i agree with most of your sugguestions to steve, i have to disagree w= ith you on using CART as variable selection tool. CART selects variables on= the local scale instead of global and the child splits highly depends on t= he parent split, which is very unstable itself and very sensitive to the da= ta structure. also, using glmselect as variable selection tool in logistic regression is = very heuristic without a sound theoretical ground. "ALL automated variable/predictor selection programs suffer from the same w= ell-known generic defects" is a false claim itself. How many does "ALL" rep= resent and which are they? It is very dangerous to "ALL" in a statistical w= orld. On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1(a)westat.com<mai= lto:HERMANS1(a)westat.com>> wrote: > > Steve: > I didn't have time to reply when you posted this message. Hope that this = response will have some use to you. > > What you are proposing amounts to automated variable selection with the c= -statistic as a criterion. While I wouldn't recommend that you rely exclusi= vely on the c-statistic, I have demonstrated an efficient method for comput= ing it for a set of predictions and observed binary outcomes. See Lex Janse= n's excellent archives for a two-part paper on Evaluating Predictive Models= : > http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=3Dsesug2008&s=3Dsesu= g&c=3Dsesug > > As for techniques, I'd recommend that you look first at the JMP analog of= classification trees (recursive partitioning? CART?). Variables selected f= or early "splits" will be good candidates for a predictive model. Correlati= ons among predictors doesn't affect CART, and CART makes good use of proxie= s for missing values. CART does tend to commit very early to a hierarchy an= d may not find a better model. You may need to remove predictors in early s= plits and explore alternative models. > > The new GLMSELECT procedure in SAS implements stage-wise variable selecti= on (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 p= redictors. Perhaps the early split variables in a classification tree and o= thers that seem important a priori would give you a good start. Once you ha= ve a potential model or several potential models specified, try a logistic = regression model and compute scores and graph the ROC. > > All automated variable/predictor selection programs suffer from the same = well-known generic defects: estimation methods optimize within an observed = sample (model optimism), misleading confidence bounds on parameter estimate= s, and predictions conditioned on outcomes presumed known with certainty. C= ross-validation of models may help. Also, Efron has written extensively rec= ently about using expected false discovery rates to evaluate models selecte= d from sets of many predictors and many observations. > S > > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU<mailto:SAS-L(a)LISTS= ERV.UGA.EDU>] On Behalf Of Steven Raimi > Sent: Wednesday, December 02, 2009 9:09 AM > To: SAS-L(a)LISTSERV.UGA.EDU<mailto:SAS-L(a)LISTSERV.UGA.EDU> > Subject: Screening factors for logistic regression > > I have developed 600+ potential predictors for use in a logistic > regression model I'm working on. I want to screen each as efficiently as > possible for predictive power (using the c-statistic). We have a brute- > force method to generate the c-statistics (proc logistic on > yvar=3Dxvar_in_question, then numerically integrate the ROC curve to > estimate), but there has to be a more straightforward (and efficient) way > to perform this task, right? > > Also, I want to identify variables/groups of variables that are collinear= , > so I can leave out all but the most sensible one(s) (per subject matter > knowledge). I could use PROC CORR, but that will be overwhelmed trying t= o > do 600*600 combinations. Again, isn't there a better way to attack this? > > FYI - I have both SAS and JMP available. Only about 5% of the dataset ca= n > fit in JMP - but we'll be developing the regression there (using all > target outcomes, and a few percent of the other records so there's a > minimum of two non-target records per target one). > > Thanks for the guidance! > Steve -- =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D WenSui Liu Blog : statcompute.spaces.live.com<http://statcompute.spaces.live.com> Tough Times Never Last. But Tough People Do. - Robert Schuller =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D
|
Pages: 1 Prev: Good to know about anydtdte Next: help needed on DDE 'excel|system' |