From: yoonsup on
Hi all,

I was wondering how Proc GLMSELECT is implemented in SAS.
The reason being is that I had a dataset with around 400,000 rows. I
needed to select significant columns
out of total of around 1200 columns by stepwise selection. There are
two class variables, say A and B.
The two terms are included in the model statement and the interaction
term between them (A*B) is also
included. These three terms are forced to stay in during the stepwise
selection. Each of 1200 columns
I need to select out is nested within A*B combination, say COL1(A*B).

Hence, my model statement looks like

model y = A B A*B COL1(A*B) all the way to COL1200(A*B)

I ran the procedure on a 32bit intel Xeon 2.67Ghz computer with 3GB
ram and it failed with insufficient memory.

So I went to high performance computer with 124GB ram and it failed
with the same warning. Insufficient memory.

The procedure stopped quite immediately with the error. Hence, proc
glmselect even didn't go through a few selection steps to run out of
memory.

In theory, the stepwise procedure should start out with one column at
a time and if this is the case, I would not

run out of memory this fast. Does anyone have any idea about how proc
glmselect is implemented?

Thanks.

Yoon
From: Sigurd Hermansen on
Yoon:
Depends on which of the options you choose: LASSO, LAR, etc. These so-called stage-wise methods employ a general class of algorithm that does not overfit models as badly as stepwise methods.

I suspect that SAS attempts to construct an internal array of the dimensions of the model. When the dimensions of the array exceed the amount of memory available, the process fails with an out-of-memory error.

I've said only partly in jest that only those who don't need to use them should use automatic variable selection methods. Though better than stepwise selection, stage-wise selection should merely save time when refining a statistical model that has its alternative specifications clearly limited by content knowledge and theory. Automated model selection won't help discover anything unless one controls tightly the expected false discovery rate. All methods fit to a sample that, even if a large one, may have features unique to it but not to the population from which it comes. Further, the inevitably high levels of collinearity among some linear combinations of variables will lead to overly favorable model fit statistics and misspecification of the model.

If you start with a model of much smaller dimensions but with values of variables generated at random, you will likely "discover" variables of a model that appear to have significance. You then know that you cannot trust automated selection of variables of a model of those or any greater dimensions.

At least take a close look at the literature on false discovery rates. You'll see some nasty trade-offs that explain why supercomputers haven't put statistical modelers out of business.
S

-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L(a)LISTSERV.UGA.EDU] On Behalf Of yoonsup(a)gmail.com
Sent: Wednesday, December 02, 2009 10:23 AM
To: SAS-L(a)LISTSERV.UGA.EDU
Subject: Proc GLMSELECT

Hi all,

I was wondering how Proc GLMSELECT is implemented in SAS.
The reason being is that I had a dataset with around 400,000 rows. I
needed to select significant columns
out of total of around 1200 columns by stepwise selection. There are
two class variables, say A and B.
The two terms are included in the model statement and the interaction
term between them (A*B) is also
included. These three terms are forced to stay in during the stepwise
selection. Each of 1200 columns
I need to select out is nested within A*B combination, say COL1(A*B).

Hence, my model statement looks like

model y = A B A*B COL1(A*B) all the way to COL1200(A*B)

I ran the procedure on a 32bit intel Xeon 2.67Ghz computer with 3GB
ram and it failed with insufficient memory.

So I went to high performance computer with 124GB ram and it failed
with the same warning. Insufficient memory.

The procedure stopped quite immediately with the error. Hence, proc
glmselect even didn't go through a few selection steps to run out of
memory.

In theory, the stepwise procedure should start out with one column at
a time and if this is the case, I would not

run out of memory this fast. Does anyone have any idea about how proc
glmselect is implemented?

Thanks.

Yoon