From: kangtsui on 20 Jul 2010 10:53 I've searched this forum for the answer of my question, but it seems that it has not been discussed. I was trying to do variable selection by GLMSELECT (LASSO or LAR). However, I have so many categorical IVs in my pool. The manual says that GLMSELECT would split the columns of those categorical IVs, but I think it would be an issue for me. I hope that columns for the same variable enter or exit the model together. Is there a way to get around this? Actually my real problem is to build a model with a continuous DV and a lot of continuous IVs. The reason I don't want to run variable selection on the original variables are 1. there are missing values here and there. sometimes I could replace it with mean, min, or max, but sometimes it does not make sense to fill the hole with any number 2. many times that the relation (I'm looking for) between the DV and IVs are not linear, or even monotonic. Therefore, I was thinking to apply some algorithm to bin all the IVs (based on the size of each bin and also the relation with DV) and keep missing value as one category, which makes perfect sense to me. Then I encounter the problem how to select the categorical variables. I hate to use forward/ backward/ stepwise approaches since usually they overfit a lot. Anyone has an idea? Great thanks. Jun
From: Paige Miller on 20 Jul 2010 14:46 On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > However, I have so many categorical IVs in my pool. The manual says > that GLMSELECT would split the columns of those categorical IVs, but I > think it would be an issue for me. I hope that columns for the same > variable enter or exit the model together. Is there a way to get > around this? > > Actually my real problem is to build a model with a continuous DV and > a lot of continuous IVs. The reason I don't want to run variable > selection on the original variables are > 1. there are missing values here and there. sometimes I could replace > it with mean, min, or max, but sometimes it does not make sense to > fill the hole with any number > 2. many times that the relation (I'm looking for) between the DV and > IVs are not linear, or even monotonic. It's hard to see how these two paragraphs go together. In paragraph 1, you say that you have categorical IVs, but in paragraph 2, your real problem has nothing to do with categorical IVs, your real problem is missing values, and furthermore you want non-linear modeling on top of that (which means you shouldn't be using PROC GLMSELECT). So mark me down as confused. Perhaps you could explain further? -- Paige Miller paige\dot\miller \at\ kodak\dot\com
From: kangtsui on 20 Jul 2010 22:02 On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > However, I have so many categorical IVs in my pool. The manual says > > that GLMSELECT would split the columns of those categorical IVs, but I > > think it would be an issue for me. I hope that columns for the same > > variable enter or exit the model together. Is there a way to get > > around this? > > > Actually my real problem is to build a model with a continuous DV and > > a lot of continuous IVs. The reason I don't want to run variable > > selection on the original variables are > > 1. there are missing values here and there. sometimes I could replace > > it with mean, min, or max, but sometimes it does not make sense to > > fill the hole with any number > > 2. many times that the relation (I'm looking for) between the DV and > > IVs are not linear, or even monotonic. > > It's hard to see how these two paragraphs go together. In paragraph 1, > you say that you have categorical IVs, but in paragraph 2, your real > problem has nothing to do with categorical IVs, your real problem is > missing values, and furthermore you want non-linear modeling on top of > that (which means you shouldn't be using PROC GLMSELECT). > > So mark me down as confused. Perhaps you could explain further? > > -- > Paige Miller > paige\dot\miller \at\ kodak\dot\com Thanks for your response. Let me try to make it more clear. What I have for the problem is a continuous DV and a bunch of continuous IVs, which have missing values to deal with. My goal is to build an interpretable model on these variables, no matter they're binned or not. There're two approaches I could think of. One is to filling the missing values first for all IVs and run GLMSELECT(LASSO). The issue is that there's no perfect way to replace missing values for such many variables and some useful variable might not have a linear effect to the DV. Then the next approach came into my mind, which is to bin the variables first and run variable selection on the binned ones. It's simple to make missing values as one category for each variable, however, GLMSELECT will split the categorical variables while doing selection. I hope all the columns of the same variable would enter or exit the model together. Grouped LASSO is not built into GLMSELECT right? Sorry for the confusing, but I really wanted to give the whole story of what I was doing instead of asking one specific question. Thanks. Jun
From: Paige Miller on 21 Jul 2010 11:14 On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > > However, I have so many categorical IVs in my pool. The manual says > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > think it would be an issue for me. I hope that columns for the same > > > variable enter or exit the model together. Is there a way to get > > > around this? > > > > Actually my real problem is to build a model with a continuous DV and > > > a lot of continuous IVs. The reason I don't want to run variable > > > selection on the original variables are > > > 1. there are missing values here and there. sometimes I could replace > > > it with mean, min, or max, but sometimes it does not make sense to > > > fill the hole with any number > > > 2. many times that the relation (I'm looking for) between the DV and > > > IVs are not linear, or even monotonic. > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > you say that you have categorical IVs, but in paragraph 2, your real > > problem has nothing to do with categorical IVs, your real problem is > > missing values, and furthermore you want non-linear modeling on top of > > that (which means you shouldn't be using PROC GLMSELECT). > > > So mark me down as confused. Perhaps you could explain further? > > > -- > > Paige Miller > > paige\dot\miller \at\ kodak\dot\com > > Thanks for your response. Let me try to make it more clear. > What I have for the problem is a continuous DV and a bunch of > continuous IVs, which have missing values to deal with. My goal is to > build an interpretable model on these variables, no matter they're > binned or not. There're two approaches I could think of. One is to > filling the missing values first for all IVs and run GLMSELECT(LASSO). > The issue is that there's no perfect way to replace missing values for > such many variables and some useful variable might not have a linear > effect to the DV. Then the next approach came into my mind, which is > to bin the variables first and run variable selection on the binned > ones. It's simple to make missing values as one category for each > variable, however, GLMSELECT will split the categorical variables > while doing selection. I hope all the columns of the same variable > would enter or exit the model together. Grouped LASSO is not built > into GLMSELECT right? > Sorry for the confusing, but I really wanted to give the whole story > of what I was doing instead of asking one specific question. Thanks. > > Jun I never think binning is a good idea with continuous variables. This whole question boils down to: how best to deal with missing values in a complicated modeling situation, which may be nonlinear, but I just don't see PROC GLMSELECT as an option here. I don't think SAS has great tools for what may be a non-linear modeling situation, however there are tools for linear modeling. You may want to look at PROC MI and PROC MIANALYZE. Also Partial Least Squares modeling (PROC PLS) has the ability to "impute" missing values based upon the EM algorithm, so that may be an option as well. As far as I know, these procedures only handle linear modeling situations. -- Paige Miller paige\dot\miller \at\ kodak\dot\com
From: Paige Miller on 21 Jul 2010 14:22
On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote: > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > > > However, I have so many categorical IVs in my pool. The manual says > > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > > think it would be an issue for me. I hope that columns for the same > > > > variable enter or exit the model together. Is there a way to get > > > > around this? > > > > > Actually my real problem is to build a model with a continuous DV and > > > > a lot of continuous IVs. The reason I don't want to run variable > > > > selection on the original variables are > > > > 1. there are missing values here and there. sometimes I could replace > > > > it with mean, min, or max, but sometimes it does not make sense to > > > > fill the hole with any number > > > > 2. many times that the relation (I'm looking for) between the DV and > > > > IVs are not linear, or even monotonic. > > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > > you say that you have categorical IVs, but in paragraph 2, your real > > > problem has nothing to do with categorical IVs, your real problem is > > > missing values, and furthermore you want non-linear modeling on top of > > > that (which means you shouldn't be using PROC GLMSELECT). > > > > So mark me down as confused. Perhaps you could explain further? > > > > -- > > > Paige Miller > > > paige\dot\miller \at\ kodak\dot\com > > > Thanks for your response. Let me try to make it more clear. > > What I have for the problem is a continuous DV and a bunch of > > continuous IVs, which have missing values to deal with. My goal is to > > build an interpretable model on these variables, no matter they're > > binned or not. There're two approaches I could think of. One is to > > filling the missing values first for all IVs and run GLMSELECT(LASSO). > > The issue is that there's no perfect way to replace missing values for > > such many variables and some useful variable might not have a linear > > effect to the DV. Then the next approach came into my mind, which is > > to bin the variables first and run variable selection on the binned > > ones. It's simple to make missing values as one category for each > > variable, however, GLMSELECT will split the categorical variables > > while doing selection. I hope all the columns of the same variable > > would enter or exit the model together. Grouped LASSO is not built > > into GLMSELECT right? > > Sorry for the confusing, but I really wanted to give the whole story > > of what I was doing instead of asking one specific question. Thanks. > > > Jun > > I never think binning is a good idea with continuous variables. > > This whole question boils down to: how best to deal with missing > values in a complicated modeling situation, which may be nonlinear, > but I just don't see PROC GLMSELECT as an option here. > > I don't think SAS has great tools for what may be a non-linear > modeling situation, however there are tools for linear modeling. You > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least > Squares modeling (PROC PLS) has the ability to "impute" missing values > based upon the EM algorithm, so that may be an option as well. As far > as I know, these procedures only handle linear modeling situations. > > -- > Paige Miller > paige\dot\miller \at\ kodak\dot\com Clarification: when I say "I don't think SAS has great tools for what may be a non-linear modeling situation, however there are tools for linear modeling" I am referring to handling missing value in non- linear modeling situations. -- Paige Miller paige\dot\miller \at\ kodak\dot\com |