From: Christoph on
Hi!

I am trying to run a regression to explain the price of certain objects. I have identified about 30 variables that might be significant, but the problem is that some of them are mutually exclusive.

Example: some of the objects are flat, so their size is in square inches, some are 3 dimensional, so it's in cubic inches. I cannot put these two numbers in one vector, but when I put them in 2 vectors, I have at least one NaN in each observation, so matlab disregards all of them.

I dont want to run two regressions, because most of the other factors apply to both 2 and 3 dimensional objects and I want to use all observations (besides, the 2D/3D thing is not the only mutually exclusive variable, so I would have to run 8 or more regressions and would "loose" a lot of observations).

Do you know what can I do to compute the regression using all observations? is this possible at all?

Thank you very much in advance!!

Christoph
From: Peter Perkins on
On 6/28/2010 11:58 AM, Christoph wrote:
> Hi!
>
> I am trying to run a regression to explain the price of certain objects. I have identified about 30 variables that might be significant, but the problem is that some of them are mutually exclusive.
>
> Example: some of the objects are flat, so their size is in square inches, some are 3 dimensional, so it's in cubic inches. I cannot put these two numbers in one vector, but when I put them in 2 vectors, I have at least one NaN in each observation, so matlab disregards all of them.

I suspect what you want is an interaction term for "shape" that gives
you one coef for 2D objects and another for 3D. The model might look
something like

y = intercept + I_2D*sizeCoef_2D*size + I_3D*sizeCoef_3D*size + ...

where I_2D and I3D are indicator functions. In other words, you want an
interaction between "shape" and "size", and probably no main effect for
either (though its hard to say without knowing what else is in your
model). If you have the Statistics Toolbox, you can get this with a
combination of dummy variables and the REGSTATS function, and perhaps
the X2FX function. If not, it's not too hard to build up the
appropriate design matrix X by hand, just by taking products of the
binary dummy vars with the size var.

Hope this helps.
From: Richard Willey on
Hi Christoph

Based your example - one variable shows size in square inches, another shows
volume in cubic inches - I'd be very worried that your independent variables
are correlated with one another. I would strongly recommend that you start
by looking at some Statistics Toolbox demos that discuss how to handle this
type of problem.

The Partial Least Squares / Principal Component Regression demo is probably
your best starting point.
http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/plspcrdemo.html

Once you feel comfortable working with multiple regression in MATLAB you can
turn your attention to the "Missing Data" problem.

There are a number of different techniques that you can use to compensate
for missing data.

Casewise deletion is the easiest (especially if you have a lot of data)

If you really need to impute values for the missing data your best best is
to look at the "mvregress" function in Statistics Toolbox. "mvregress"
includes an option to use the ECM algorithm to handle missing data (ECM =
Expectional Conditional Maximization).

regards,

Richard

"Christoph" <ennobehrens(a)web.de> wrote in message
news:334561268.17344.1277740720511.JavaMail.root(a)gallium.mathforum.org...
> Hi!
>
> I am trying to run a regression to explain the price of certain objects. I
> have identified about 30 variables that might be significant, but the
> problem is that some of them are mutually exclusive.
>
> Example: some of the objects are flat, so their size is in square inches,
> some are 3 dimensional, so it's in cubic inches. I cannot put these two
> numbers in one vector, but when I put them in 2 vectors, I have at least
> one NaN in each observation, so matlab disregards all of them.
>
> I dont want to run two regressions, because most of the other factors
> apply to both 2 and 3 dimensional objects and I want to use all
> observations (besides, the 2D/3D thing is not the only mutually exclusive
> variable, so I would have to run 8 or more regressions and would "loose" a
> lot of observations).
>
> Do you know what can I do to compute the regression using all
> observations? is this possible at all?
>
> Thank you very much in advance!!
>
> Christoph


From: Christoph on
Thanks Peter, that's what I meant.

But are you sure that this leads to unbiased results? I was thinking that when I use dummy variables, I would have zeros instead of NaNs, and wouldn't the betas be biased then?

i'm sorry, i should've been more precise before, but here is a simplified example of how a part of my datasheet with the two size vectors "area" and "volume" and a third "color" vector could look like (I want to regress them on the log price of the objects):

"area"__"volume"___"color"
10_______NaN________red
15_______NaN________blue
NaN______100________yellow
12_______NaN________red
NaN______140________blue

and when I use dummy vars:

"area"__"volume"___"color"
10_______0________red
15_______0________blue
0_______100_______yellow
12_______0________red
0_______140_______blue


What I would like to do is to run a regression, that uses the information in "area", disregarding that there is a NaN in "volume" without assigning a zero to that value, to estimate the beta for "area". And the same for the "volume" coeff as well.

But then: even when this is possible, would'nt the coeffs still be biased because the variable "color" could have a different impact on Y, depending on whether the variables "area" or "volume" are included in the regression for each row of Xi?

thus, my dilemma is: either I use dummy variables, always include both size variables and have the "zero bias" OR I somehow use only one of the variables at a time and have the biased impact on the other variables...

Does anyone have a solution for this?

btw, the datasheet has around 3000 obs and 30 to 40 independent variables

Thanks a lot!!



PS @ Richard: thanks, but my problem is not that the values for the second size variable are missing, I only have values for EITHER of the two, the objects are either 2D or 3D. so I should not impute values for the "missing" data...
From: Christoph on
Thanks Peter, that's what I meant.

But are you sure that this leads to unbiased results? I was thinking that when I use dummy variables, I would have zeros instead of NaNs, and wouldn't the betas be biased then?

i'm sorry, i should've been more precise before, but here is a simplified example of how a part of my datasheet with the two size vectors "area" and "volume" and a third "color" vector could look like (I want to regress them on the log price of the objects):

"area"__"volume"___"color"
10_______NaN________red
15_______NaN________blue
NaN______100________yellow
12_______NaN________red
NaN______140________blue

and when I use dummy vars:

"area"__"volume"___"color"
10_______0________red
15_______0________blue
0_______100_______yellow
12_______0________red
0_______140_______blue


What I would like to do is to run a regression, that uses the information in "area", disregarding that there is a NaN in "volume" without assigning a zero to that value, to estimate the beta for "area". And the same for the "volume" coeff as well.

But then: even when this is possible, would'nt the coeffs still be biased because the variable "color" could have a different impact on Y, depending on whether the variables "area" or "volume" are included in the regression for each row of Xi?

thus, my dilemma is: either I use dummy variables, always include both size variables and have the "zero bias" OR I somehow use only one of the variables at a time and have the biased impact on the other variables...

Does anyone have a solution for this?

btw, the datasheet has around 3000 obs and 30 to 40 independent variables

Thanks a lot!!



PS @ Richard: thanks, but my problem is not that the values for the second size variable are missing, I only have values for EITHER of the two, the objects are either 2D or 3D. so I should not impute values for the "missing" data...



> On 6/28/2010 12:34 PM, Peter Perkins wrote:
>
> I suspect what you want is an interaction term for
> "shape" that gives
> you one coef for 2D objects and another for 3D. The
> model might look
> something like
>
> y = intercept + I_2D*sizeCoef_2D*size +
> ze + I_3D*sizeCoef_3D*size + ...
>
> where I_2D and I3D are indicator functions. In other
> words, you want an
> interaction between "shape" and "size", and probably
> no main effect for
> either (though its hard to say without knowing what
> else is in your
> model). If you have the Statistics Toolbox, you can
> get this with a
> combination of dummy variables and the REGSTATS
> function, and perhaps
> the X2FX function. If not, it's not too hard to
> build up the
> appropriate design matrix X by hand, just by taking
> products of the
> binary dummy vars with the size var.
>
> Hope this helps.