From: condor on 1 Jun 2010 00:48 I have an easy question... n = number of observations p = number of terms included in the final model I performed 2 stepwisefit regression: a) between 20,000 Y and 10,000 possibile regressors with n=5 (5 years); the regressors considered are always the same for every Y b) between 20,000 Y and 10,000 possibile regressors with n=30 (30 years); the regressors considered are always the same for every Y In a) the number of terms included is NEVER> 3; in b) the number of terms included is NEVER> 24. The R squared is always very high and there isn't multicollinearity. Since the number of coefficients included is a p-by-1 vector (so I could have in a) 5 terms included and in b) I could have 30 terms included) I have two questions: 1) Is there something strange? 2) Is there a relations (in Matlab) between the number of terms included and the number of observations? Or is it just a coincidence? THANKS
From: condor on 1 Jun 2010 03:23 "condor " <brunoricola(a)libero.it> wrote in message <hu23e4$69k$1(a)fred.mathworks.com>... > I have an easy question... > > n = number of observations > p = number of terms included in the final model > > I performed 2 kind of stepwisefit regression (for a total of 40,000 regressions): > a) between 20,000 Y and 10,000 possibile regressors with n=5 (5 years); the regressors considered are always the same for every Y > b) between 20,000 Y and 10,000 possibile regressors with n=30 (30 years); the regressors considered are always the same for every Y > > In a) the number of terms included is NEVER> 3; in b) the number of terms included is NEVER> 24. The R squared is always very high and there isn't multicollinearity. > > Since the number of coefficients included is a p-by-1 vector (so I could have in a) 5 terms included and in b) I could have 30 terms included) I have two questions: > > 1) Is there something strange? > 2) Is there a relations (in Matlab) between the number of terms included and the number of observations? Or is it just a coincidence? > THANKS
From: Tom Lane on 1 Jun 2010 10:16 > I performed 2 stepwisefit regression: > a) between 20,000 Y and 10,000 possibile regressors with n=5 (5 years); > the regressors considered are always the same for every Y b) between > 20,000 Y and 10,000 possibile regressors with n=30 (30 years); the > regressors considered are always the same for every Y > In a) the number of terms included is NEVER> 3; in b) the number of terms > included is NEVER> 24. The R squared is always very high and there isn't > multicollinearity. > > Since the number of coefficients included is a p-by-1 vector (so I could > have in a) 5 terms included and in b) I could have 30 terms included) I > have two questions: If I understand you correctly, you have 5 observations on many variables. You also have an implied constant term. So 4 predictors would lead to a saturated model with no ability to estimate error. When stepwisefit has 3 predictors, it's not possible to compute the significance of a 4th one, so it would always stop short of that. I can't think of any reason why 24 ought to be a strict upper limit on the number of predictors when you have 30 observations. -- Tom
From: condor on 1 Jun 2010 21:54 "Tom Lane" <tlaneATmathworksDOTcom(a)nospam.com> wrote in message <hu34or$8h9$1(a)fred.mathworks.com>... > > I performed 2 stepwisefit regression: > > a) between 20,000 Y and 10,000 possibile regressors with n=5 (5 years); > > the regressors considered are always the same for every Y b) between > > 20,000 Y and 10,000 possibile regressors with n=30 (30 years); the > > regressors considered are always the same for every Y > > In a) the number of terms included is NEVER> 3; in b) the number of terms > > included is NEVER> 24. The R squared is always very high and there isn't > > multicollinearity. > > > > Since the number of coefficients included is a p-by-1 vector (so I could > > have in a) 5 terms included and in b) I could have 30 terms included) I > > have two questions: > > If I understand you correctly, you have 5 observations on many variables. > You also have an implied constant term. So 4 predictors would lead to a > saturated model with no ability to estimate error. When stepwisefit has 3 > predictors, it's not possible to compute the significance of a 4th one, so > it would always stop short of that. > > I can't think of any reason why 24 ought to be a strict upper limit on the > number of predictors when you have 30 observations. > > -- Tom > Yes I have 5 observations on many variables and stepwisefiton Matlab automatically includes a constant term. When you say: > When stepwisefit has 3 predictors, it's not possible to compute the significance of a 4th one > I don't understand why... Because of 5 observations I should have a maximum number of predictors of 4 + the constant term, isn't it? As written above, the r-squared is almost alway near to 1 (from 0.995 to 1). The number of predictors used vary from 1 to 3 (+constat term), what is strange is that on 20,000 regressions I did, I haven't found one with 4 predictors (+constat term)?
From: Tom Lane on 2 Jun 2010 13:36
>> When stepwisefit has 3 predictors, it's not possible to compute the >> significance of a 4th one > > I don't understand why... Because of 5 observations I should have a > maximum number of predictors of 4 + the constant term, isn't it? > As written above, the r-squared is almost alway near to 1 (from 0.995 to > 1). The number of predictors used vary from 1 to 3 (+constat term), what > is strange is that on 20,000 regressions I did, I haven't found one with 4 > predictors (+constat term)? In the process of deciding whether to add a predictor, we (the stepwisefit function) computes F statistics that compare: 1. The reduction in error sum of squares that results from adding the predictor. 2. The remaining error sum of squares after adding the predictor. If adding a new predictor will saturate the model, we're guaranteed that #1 is 100% of the error sum of squares, and #2 is zero. So we can't meaningfully test one against the other. -- Tom |