From: kangtsui on 21 Jul 2010 15:26 On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).. > > > > > However, I have so many categorical IVs in my pool. The manual says > > > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > > > think it would be an issue for me. I hope that columns for the same > > > > > variable enter or exit the model together. Is there a way to get > > > > > around this? > > > > > > Actually my real problem is to build a model with a continuous DV and > > > > > a lot of continuous IVs. The reason I don't want to run variable > > > > > selection on the original variables are > > > > > 1. there are missing values here and there. sometimes I could replace > > > > > it with mean, min, or max, but sometimes it does not make sense to > > > > > fill the hole with any number > > > > > 2. many times that the relation (I'm looking for) between the DV and > > > > > IVs are not linear, or even monotonic. > > > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > > > you say that you have categorical IVs, but in paragraph 2, your real > > > > problem has nothing to do with categorical IVs, your real problem is > > > > missing values, and furthermore you want non-linear modeling on top of > > > > that (which means you shouldn't be using PROC GLMSELECT). > > > > > So mark me down as confused. Perhaps you could explain further? > > > > > -- > > > > Paige Miller > > > > paige\dot\miller \at\ kodak\dot\com > > > > Thanks for your response. Let me try to make it more clear. > > > What I have for the problem is a continuous DV and a bunch of > > > continuous IVs, which have missing values to deal with. My goal is to > > > build an interpretable model on these variables, no matter they're > > > binned or not. There're two approaches I could think of. One is to > > > filling the missing values first for all IVs and run GLMSELECT(LASSO).. > > > The issue is that there's no perfect way to replace missing values for > > > such many variables and some useful variable might not have a linear > > > effect to the DV. Then the next approach came into my mind, which is > > > to bin the variables first and run variable selection on the binned > > > ones. It's simple to make missing values as one category for each > > > variable, however, GLMSELECT will split the categorical variables > > > while doing selection. I hope all the columns of the same variable > > > would enter or exit the model together. Grouped LASSO is not built > > > into GLMSELECT right? > > > Sorry for the confusing, but I really wanted to give the whole story > > > of what I was doing instead of asking one specific question. Thanks. > > > > Jun > > > I never think binning is a good idea with continuous variables. > > > This whole question boils down to: how best to deal with missing > > values in a complicated modeling situation, which may be nonlinear, > > but I just don't see PROC GLMSELECT as an option here. > > > I don't think SAS has great tools for what may be a non-linear > > modeling situation, however there are tools for linear modeling. You > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least > > Squares modeling (PROC PLS) has the ability to "impute" missing values > > based upon the EM algorithm, so that may be an option as well. As far > > as I know, these procedures only handle linear modeling situations. > > > -- > > Paige Miller > > paige\dot\miller \at\ kodak\dot\com > > Clarification: when I say "I don't think SAS has great tools for what > may be a non-linear modeling situation, however there are tools for > linear modeling" I am referring to handling missing value in non- > linear modeling situations. > > -- > Paige Miller > paige\dot\miller \at\ kodak\dot\com Thanks for your help. Do you recommend PROC GAM for my case, if I could handle missing values on my own? Is there a tool to do variable selection for non-linear models? Thanks. Jun
From: Paige Miller on 21 Jul 2010 16:00 On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > > > > > However, I have so many categorical IVs in my pool. The manual says > > > > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > > > > think it would be an issue for me. I hope that columns for the same > > > > > > variable enter or exit the model together. Is there a way to get > > > > > > around this? > > > > > > > Actually my real problem is to build a model with a continuous DV and > > > > > > a lot of continuous IVs. The reason I don't want to run variable > > > > > > selection on the original variables are > > > > > > 1. there are missing values here and there. sometimes I could replace > > > > > > it with mean, min, or max, but sometimes it does not make sense to > > > > > > fill the hole with any number > > > > > > 2. many times that the relation (I'm looking for) between the DV and > > > > > > IVs are not linear, or even monotonic. > > > > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > > > > you say that you have categorical IVs, but in paragraph 2, your real > > > > > problem has nothing to do with categorical IVs, your real problem is > > > > > missing values, and furthermore you want non-linear modeling on top of > > > > > that (which means you shouldn't be using PROC GLMSELECT). > > > > > > So mark me down as confused. Perhaps you could explain further? > > > > > > -- > > > > > Paige Miller > > > > > paige\dot\miller \at\ kodak\dot\com > > > > > Thanks for your response. Let me try to make it more clear. > > > > What I have for the problem is a continuous DV and a bunch of > > > > continuous IVs, which have missing values to deal with. My goal is to > > > > build an interpretable model on these variables, no matter they're > > > > binned or not. There're two approaches I could think of. One is to > > > > filling the missing values first for all IVs and run GLMSELECT(LASSO). > > > > The issue is that there's no perfect way to replace missing values for > > > > such many variables and some useful variable might not have a linear > > > > effect to the DV. Then the next approach came into my mind, which is > > > > to bin the variables first and run variable selection on the binned > > > > ones. It's simple to make missing values as one category for each > > > > variable, however, GLMSELECT will split the categorical variables > > > > while doing selection. I hope all the columns of the same variable > > > > would enter or exit the model together. Grouped LASSO is not built > > > > into GLMSELECT right? > > > > Sorry for the confusing, but I really wanted to give the whole story > > > > of what I was doing instead of asking one specific question. Thanks.. > > > > > Jun > > > > I never think binning is a good idea with continuous variables. > > > > This whole question boils down to: how best to deal with missing > > > values in a complicated modeling situation, which may be nonlinear, > > > but I just don't see PROC GLMSELECT as an option here. > > > > I don't think SAS has great tools for what may be a non-linear > > > modeling situation, however there are tools for linear modeling. You > > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least > > > Squares modeling (PROC PLS) has the ability to "impute" missing values > > > based upon the EM algorithm, so that may be an option as well. As far > > > as I know, these procedures only handle linear modeling situations. > > > > -- > > > Paige Miller > > > paige\dot\miller \at\ kodak\dot\com > > > Clarification: when I say "I don't think SAS has great tools for what > > may be a non-linear modeling situation, however there are tools for > > linear modeling" I am referring to handling missing value in non- > > linear modeling situations. > > > -- > > Paige Miller > > paige\dot\miller \at\ kodak\dot\com > > Thanks for your help. Do you recommend PROC GAM for my case, if I > could handle missing values on my own? Is there a tool to do variable > selection for non-linear models? Thanks. > > Jun I think I still wasn't clear. SAS has good tools for linear and non-linear modeling. SAS has good tools in the presence of outliers in linear models, using PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know) have good tools in the presence of outliers for non-linear modeling. I don't see how PROC GAM handles non-linear models with continuous variables. PROC NLIN is the procedure that will fit almost any non- linear model you can devise; however as far as I know the only outlier handling for PROC NLIN is to remove from the fitting algorithm any observations that have even one missing value in the IVs or DV. -- Paige Miller paige\dot\miller \at\ kodak\dot\com
From: kangtsui on 22 Jul 2010 14:00 On Jul 21, 4:00 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > > > > > > However, I have so many categorical IVs in my pool. The manual says > > > > > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > > > > > think it would be an issue for me. I hope that columns for the same > > > > > > > variable enter or exit the model together. Is there a way to get > > > > > > > around this? > > > > > > > > Actually my real problem is to build a model with a continuous DV and > > > > > > > a lot of continuous IVs. The reason I don't want to run variable > > > > > > > selection on the original variables are > > > > > > > 1. there are missing values here and there. sometimes I could replace > > > > > > > it with mean, min, or max, but sometimes it does not make sense to > > > > > > > fill the hole with any number > > > > > > > 2. many times that the relation (I'm looking for) between the DV and > > > > > > > IVs are not linear, or even monotonic. > > > > > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > > > > > you say that you have categorical IVs, but in paragraph 2, your real > > > > > > problem has nothing to do with categorical IVs, your real problem is > > > > > > missing values, and furthermore you want non-linear modeling on top of > > > > > > that (which means you shouldn't be using PROC GLMSELECT). > > > > > > > So mark me down as confused. Perhaps you could explain further? > > > > > > > -- > > > > > > Paige Miller > > > > > > paige\dot\miller \at\ kodak\dot\com > > > > > > Thanks for your response. Let me try to make it more clear. > > > > > What I have for the problem is a continuous DV and a bunch of > > > > > continuous IVs, which have missing values to deal with. My goal is to > > > > > build an interpretable model on these variables, no matter they're > > > > > binned or not. There're two approaches I could think of. One is to > > > > > filling the missing values first for all IVs and run GLMSELECT(LASSO). > > > > > The issue is that there's no perfect way to replace missing values for > > > > > such many variables and some useful variable might not have a linear > > > > > effect to the DV. Then the next approach came into my mind, which is > > > > > to bin the variables first and run variable selection on the binned > > > > > ones. It's simple to make missing values as one category for each > > > > > variable, however, GLMSELECT will split the categorical variables > > > > > while doing selection. I hope all the columns of the same variable > > > > > would enter or exit the model together. Grouped LASSO is not built > > > > > into GLMSELECT right? > > > > > Sorry for the confusing, but I really wanted to give the whole story > > > > > of what I was doing instead of asking one specific question. Thanks. > > > > > > Jun > > > > > I never think binning is a good idea with continuous variables. > > > > > This whole question boils down to: how best to deal with missing > > > > values in a complicated modeling situation, which may be nonlinear, > > > > but I just don't see PROC GLMSELECT as an option here. > > > > > I don't think SAS has great tools for what may be a non-linear > > > > modeling situation, however there are tools for linear modeling. You > > > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least > > > > Squares modeling (PROC PLS) has the ability to "impute" missing values > > > > based upon the EM algorithm, so that may be an option as well. As far > > > > as I know, these procedures only handle linear modeling situations. > > > > > -- > > > > Paige Miller > > > > paige\dot\miller \at\ kodak\dot\com > > > > Clarification: when I say "I don't think SAS has great tools for what > > > may be a non-linear modeling situation, however there are tools for > > > linear modeling" I am referring to handling missing value in non- > > > linear modeling situations. > > > > -- > > > Paige Miller > > > paige\dot\miller \at\ kodak\dot\com > > > Thanks for your help. Do you recommend PROC GAM for my case, if I > > could handle missing values on my own? Is there a tool to do variable > > selection for non-linear models? Thanks. > > > Jun > > I think I still wasn't clear. > > SAS has good tools for linear and non-linear modeling. > > SAS has good tools in the presence of outliers in linear models, using > PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know) > have good tools in the presence of outliers for non-linear modeling. > > I don't see how PROC GAM handles non-linear models with continuous > variables. PROC NLIN is the procedure that will fit almost any non- > linear model you can devise; however as far as I know the only outlier > handling for PROC NLIN is to remove from the fitting algorithm any > observations that have even one missing value in the IVs or DV. > > -- > Paige Miller > paige\dot\miller \at\ kodak\dot\com Sorry for confusing you. By mentioning GAM, I was thinking to apply some general additive models on my case. It's additive but with possibly non-linear form of the IVs. I never thought of running models with non-linear forms. I could have link functions on my DV, but the form of the model, right side of the equation in other words, should be as simple as linear, additive with components of IVs, after transformation, either polynomial or spline or some other forms. Thanks. Jun
From: Paige Miller on 22 Jul 2010 16:01
On Jul 22, 2:00 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > On Jul 21, 4:00 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote: > > > > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote: > > > > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR). > > > > > > > > However, I have so many categorical IVs in my pool. The manual says > > > > > > > > that GLMSELECT would split the columns of those categorical IVs, but I > > > > > > > > think it would be an issue for me. I hope that columns for the same > > > > > > > > variable enter or exit the model together. Is there a way to get > > > > > > > > around this? > > > > > > > > > Actually my real problem is to build a model with a continuous DV and > > > > > > > > a lot of continuous IVs. The reason I don't want to run variable > > > > > > > > selection on the original variables are > > > > > > > > 1. there are missing values here and there. sometimes I could replace > > > > > > > > it with mean, min, or max, but sometimes it does not make sense to > > > > > > > > fill the hole with any number > > > > > > > > 2. many times that the relation (I'm looking for) between the DV and > > > > > > > > IVs are not linear, or even monotonic. > > > > > > > > It's hard to see how these two paragraphs go together. In paragraph 1, > > > > > > > you say that you have categorical IVs, but in paragraph 2, your real > > > > > > > problem has nothing to do with categorical IVs, your real problem is > > > > > > > missing values, and furthermore you want non-linear modeling on top of > > > > > > > that (which means you shouldn't be using PROC GLMSELECT). > > > > > > > > So mark me down as confused. Perhaps you could explain further? > > > > > > > > -- > > > > > > > Paige Miller > > > > > > > paige\dot\miller \at\ kodak\dot\com > > > > > > > Thanks for your response. Let me try to make it more clear. > > > > > > What I have for the problem is a continuous DV and a bunch of > > > > > > continuous IVs, which have missing values to deal with. My goal is to > > > > > > build an interpretable model on these variables, no matter they're > > > > > > binned or not. There're two approaches I could think of. One is to > > > > > > filling the missing values first for all IVs and run GLMSELECT(LASSO). > > > > > > The issue is that there's no perfect way to replace missing values for > > > > > > such many variables and some useful variable might not have a linear > > > > > > effect to the DV. Then the next approach came into my mind, which is > > > > > > to bin the variables first and run variable selection on the binned > > > > > > ones. It's simple to make missing values as one category for each > > > > > > variable, however, GLMSELECT will split the categorical variables > > > > > > while doing selection. I hope all the columns of the same variable > > > > > > would enter or exit the model together. Grouped LASSO is not built > > > > > > into GLMSELECT right? > > > > > > Sorry for the confusing, but I really wanted to give the whole story > > > > > > of what I was doing instead of asking one specific question. Thanks. > > > > > > > Jun > > > > > > I never think binning is a good idea with continuous variables. > > > > > > This whole question boils down to: how best to deal with missing > > > > > values in a complicated modeling situation, which may be nonlinear, > > > > > but I just don't see PROC GLMSELECT as an option here. > > > > > > I don't think SAS has great tools for what may be a non-linear > > > > > modeling situation, however there are tools for linear modeling. You > > > > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least > > > > > Squares modeling (PROC PLS) has the ability to "impute" missing values > > > > > based upon the EM algorithm, so that may be an option as well. As far > > > > > as I know, these procedures only handle linear modeling situations. > > > > > > -- > > > > > Paige Miller > > > > > paige\dot\miller \at\ kodak\dot\com > > > > > Clarification: when I say "I don't think SAS has great tools for what > > > > may be a non-linear modeling situation, however there are tools for > > > > linear modeling" I am referring to handling missing value in non- > > > > linear modeling situations. > > > > > -- > > > > Paige Miller > > > > paige\dot\miller \at\ kodak\dot\com > > > > Thanks for your help. Do you recommend PROC GAM for my case, if I > > > could handle missing values on my own? Is there a tool to do variable > > > selection for non-linear models? Thanks. > > > > Jun > > > I think I still wasn't clear. > > > SAS has good tools for linear and non-linear modeling. > > > SAS has good tools in the presence of outliers in linear models, using > > PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know) > > have good tools in the presence of outliers for non-linear modeling. > > > I don't see how PROC GAM handles non-linear models with continuous > > variables. PROC NLIN is the procedure that will fit almost any non- > > linear model you can devise; however as far as I know the only outlier > > handling for PROC NLIN is to remove from the fitting algorithm any > > observations that have even one missing value in the IVs or DV. > > > -- > > Paige Miller > > paige\dot\miller \at\ kodak\dot\com > > Sorry for confusing you. By mentioning GAM, I was thinking to apply > some general additive models on my case. It's additive but with > possibly non-linear form of the IVs. I never thought of running models > with non-linear forms. I could have link functions on my DV, but the > form of the model, right side of the equation in other words, should > be as simple as linear, additive with components of IVs, after > transformation, either polynomial or spline or some other forms. > Thanks. > > Jun If that's what you are considering, transformation of the DV, then any of the SAS modeling procedures might work. Again, I do think you should investigate PROC PLS, as not only does it "impute" missing values, as I mentioned, but the algorithm works well when you have many correlated IVs. In general, you would choose the modeling algorithm independently of how you handle missing values. One doesn't determine the other. But as a practical statement, you have only a limited number of software options for handling the missings, and many options for modeling, which is why PROC PLS looks good to me. -- Paige Miller paige\dot\miller \at\ kodak\dot\com |