From: Christoph on 28 Jun 2010 07:58 Hi! I am trying to run a regression to explain the price of certain objects. I have identified about 30 variables that might be significant, but the problem is that some of them are mutually exclusive. Example: some of the objects are flat, so their size is in square inches, some are 3 dimensional, so it's in cubic inches. I cannot put these two numbers in one vector, but when I put them in 2 vectors, I have at least one NaN in each observation, so matlab disregards all of them. I dont want to run two regressions, because most of the other factors apply to both 2 and 3 dimensional objects and I want to use all observations (besides, the 2D/3D thing is not the only mutually exclusive variable, so I would have to run 8 or more regressions and would "loose" a lot of observations). Do you know what can I do to compute the regression using all observations? is this possible at all? Thank you very much in advance!! Christoph
From: Peter Perkins on 28 Jun 2010 12:34 On 6/28/2010 11:58 AM, Christoph wrote: > Hi! > > I am trying to run a regression to explain the price of certain objects. I have identified about 30 variables that might be significant, but the problem is that some of them are mutually exclusive. > > Example: some of the objects are flat, so their size is in square inches, some are 3 dimensional, so it's in cubic inches. I cannot put these two numbers in one vector, but when I put them in 2 vectors, I have at least one NaN in each observation, so matlab disregards all of them. I suspect what you want is an interaction term for "shape" that gives you one coef for 2D objects and another for 3D. The model might look something like y = intercept + I_2D*sizeCoef_2D*size + I_3D*sizeCoef_3D*size + ... where I_2D and I3D are indicator functions. In other words, you want an interaction between "shape" and "size", and probably no main effect for either (though its hard to say without knowing what else is in your model). If you have the Statistics Toolbox, you can get this with a combination of dummy variables and the REGSTATS function, and perhaps the X2FX function. If not, it's not too hard to build up the appropriate design matrix X by hand, just by taking products of the binary dummy vars with the size var. Hope this helps.
From: Richard Willey on 28 Jun 2010 12:42 Hi Christoph Based your example - one variable shows size in square inches, another shows volume in cubic inches - I'd be very worried that your independent variables are correlated with one another. I would strongly recommend that you start by looking at some Statistics Toolbox demos that discuss how to handle this type of problem. The Partial Least Squares / Principal Component Regression demo is probably your best starting point. http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/plspcrdemo.html Once you feel comfortable working with multiple regression in MATLAB you can turn your attention to the "Missing Data" problem. There are a number of different techniques that you can use to compensate for missing data. Casewise deletion is the easiest (especially if you have a lot of data) If you really need to impute values for the missing data your best best is to look at the "mvregress" function in Statistics Toolbox. "mvregress" includes an option to use the ECM algorithm to handle missing data (ECM = Expectional Conditional Maximization). regards, Richard "Christoph" <ennobehrens(a)web.de> wrote in message news:334561268.17344.1277740720511.JavaMail.root(a)gallium.mathforum.org... > Hi! > > I am trying to run a regression to explain the price of certain objects. I > have identified about 30 variables that might be significant, but the > problem is that some of them are mutually exclusive. > > Example: some of the objects are flat, so their size is in square inches, > some are 3 dimensional, so it's in cubic inches. I cannot put these two > numbers in one vector, but when I put them in 2 vectors, I have at least > one NaN in each observation, so matlab disregards all of them. > > I dont want to run two regressions, because most of the other factors > apply to both 2 and 3 dimensional objects and I want to use all > observations (besides, the 2D/3D thing is not the only mutually exclusive > variable, so I would have to run 8 or more regressions and would "loose" a > lot of observations). > > Do you know what can I do to compute the regression using all > observations? is this possible at all? > > Thank you very much in advance!! > > Christoph
From: Christoph on 29 Jun 2010 03:25 Thanks Peter, that's what I meant. But are you sure that this leads to unbiased results? I was thinking that when I use dummy variables, I would have zeros instead of NaNs, and wouldn't the betas be biased then? i'm sorry, i should've been more precise before, but here is a simplified example of how a part of my datasheet with the two size vectors "area" and "volume" and a third "color" vector could look like (I want to regress them on the log price of the objects): "area"__"volume"___"color" 10_______NaN________red 15_______NaN________blue NaN______100________yellow 12_______NaN________red NaN______140________blue and when I use dummy vars: "area"__"volume"___"color" 10_______0________red 15_______0________blue 0_______100_______yellow 12_______0________red 0_______140_______blue What I would like to do is to run a regression, that uses the information in "area", disregarding that there is a NaN in "volume" without assigning a zero to that value, to estimate the beta for "area". And the same for the "volume" coeff as well. But then: even when this is possible, would'nt the coeffs still be biased because the variable "color" could have a different impact on Y, depending on whether the variables "area" or "volume" are included in the regression for each row of Xi? thus, my dilemma is: either I use dummy variables, always include both size variables and have the "zero bias" OR I somehow use only one of the variables at a time and have the biased impact on the other variables... Does anyone have a solution for this? btw, the datasheet has around 3000 obs and 30 to 40 independent variables Thanks a lot!! PS @ Richard: thanks, but my problem is not that the values for the second size variable are missing, I only have values for EITHER of the two, the objects are either 2D or 3D. so I should not impute values for the "missing" data...
From: Christoph on 29 Jun 2010 03:30
Thanks Peter, that's what I meant. But are you sure that this leads to unbiased results? I was thinking that when I use dummy variables, I would have zeros instead of NaNs, and wouldn't the betas be biased then? i'm sorry, i should've been more precise before, but here is a simplified example of how a part of my datasheet with the two size vectors "area" and "volume" and a third "color" vector could look like (I want to regress them on the log price of the objects): "area"__"volume"___"color" 10_______NaN________red 15_______NaN________blue NaN______100________yellow 12_______NaN________red NaN______140________blue and when I use dummy vars: "area"__"volume"___"color" 10_______0________red 15_______0________blue 0_______100_______yellow 12_______0________red 0_______140_______blue What I would like to do is to run a regression, that uses the information in "area", disregarding that there is a NaN in "volume" without assigning a zero to that value, to estimate the beta for "area". And the same for the "volume" coeff as well. But then: even when this is possible, would'nt the coeffs still be biased because the variable "color" could have a different impact on Y, depending on whether the variables "area" or "volume" are included in the regression for each row of Xi? thus, my dilemma is: either I use dummy variables, always include both size variables and have the "zero bias" OR I somehow use only one of the variables at a time and have the biased impact on the other variables... Does anyone have a solution for this? btw, the datasheet has around 3000 obs and 30 to 40 independent variables Thanks a lot!! PS @ Richard: thanks, but my problem is not that the values for the second size variable are missing, I only have values for EITHER of the two, the objects are either 2D or 3D. so I should not impute values for the "missing" data... > On 6/28/2010 12:34 PM, Peter Perkins wrote: > > I suspect what you want is an interaction term for > "shape" that gives > you one coef for 2D objects and another for 3D. The > model might look > something like > > y = intercept + I_2D*sizeCoef_2D*size + > ze + I_3D*sizeCoef_3D*size + ... > > where I_2D and I3D are indicator functions. In other > words, you want an > interaction between "shape" and "size", and probably > no main effect for > either (though its hard to say without knowing what > else is in your > model). If you have the Statistics Toolbox, you can > get this with a > combination of dummy variables and the REGSTATS > function, and perhaps > the X2FX function. If not, it's not too hard to > build up the > appropriate design matrix X by hand, just by taking > products of the > binary dummy vars with the size var. > > Hope this helps. |