Prev: Screening factors for logistic regression
Next: Common Programming Mistake with Proc Sort NODUPRECS
From: yoonsup on 2 Dec 2009 10:22 Hi all, I was wondering how Proc GLMSELECT is implemented in SAS. The reason being is that I had a dataset with around 400,000 rows. I needed to select significant columns out of total of around 1200 columns by stepwise selection. There are two class variables, say A and B. The two terms are included in the model statement and the interaction term between them (A*B) is also included. These three terms are forced to stay in during the stepwise selection. Each of 1200 columns I need to select out is nested within A*B combination, say COL1(A*B). Hence, my model statement looks like model y = A B A*B COL1(A*B) all the way to COL1200(A*B) I ran the procedure on a 32bit intel Xeon 2.67Ghz computer with 3GB ram and it failed with insufficient memory. So I went to high performance computer with 124GB ram and it failed with the same warning. Insufficient memory. The procedure stopped quite immediately with the error. Hence, proc glmselect even didn't go through a few selection steps to run out of memory. In theory, the stepwise procedure should start out with one column at a time and if this is the case, I would not run out of memory this fast. Does anyone have any idea about how proc glmselect is implemented? Thanks. Yoon
From: Sigurd Hermansen on 4 Dec 2009 18:19 Yoon: Depends on which of the options you choose: LASSO, LAR, etc. These so-called stage-wise methods employ a general class of algorithm that does not overfit models as badly as stepwise methods. I suspect that SAS attempts to construct an internal array of the dimensions of the model. When the dimensions of the array exceed the amount of memory available, the process fails with an out-of-memory error. I've said only partly in jest that only those who don't need to use them should use automatic variable selection methods. Though better than stepwise selection, stage-wise selection should merely save time when refining a statistical model that has its alternative specifications clearly limited by content knowledge and theory. Automated model selection won't help discover anything unless one controls tightly the expected false discovery rate. All methods fit to a sample that, even if a large one, may have features unique to it but not to the population from which it comes. Further, the inevitably high levels of collinearity among some linear combinations of variables will lead to overly favorable model fit statistics and misspecification of the model. If you start with a model of much smaller dimensions but with values of variables generated at random, you will likely "discover" variables of a model that appear to have significance. You then know that you cannot trust automated selection of variables of a model of those or any greater dimensions. At least take a close look at the literature on false discovery rates. You'll see some nasty trade-offs that explain why supercomputers haven't put statistical modelers out of business. S -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of yoonsup(a)gmail.com Sent: Wednesday, December 02, 2009 10:23 AM To: SAS-L(a)LISTSERV.UGA.EDU Subject: Proc GLMSELECT Hi all, I was wondering how Proc GLMSELECT is implemented in SAS. The reason being is that I had a dataset with around 400,000 rows. I needed to select significant columns out of total of around 1200 columns by stepwise selection. There are two class variables, say A and B. The two terms are included in the model statement and the interaction term between them (A*B) is also included. These three terms are forced to stay in during the stepwise selection. Each of 1200 columns I need to select out is nested within A*B combination, say COL1(A*B). Hence, my model statement looks like model y = A B A*B COL1(A*B) all the way to COL1200(A*B) I ran the procedure on a 32bit intel Xeon 2.67Ghz computer with 3GB ram and it failed with insufficient memory. So I went to high performance computer with 124GB ram and it failed with the same warning. Insufficient memory. The procedure stopped quite immediately with the error. Hence, proc glmselect even didn't go through a few selection steps to run out of memory. In theory, the stepwise procedure should start out with one column at a time and if this is the case, I would not run out of memory this fast. Does anyone have any idea about how proc glmselect is implemented? Thanks. Yoon
|
Pages: 1 Prev: Screening factors for logistic regression Next: Common Programming Mistake with Proc Sort NODUPRECS |