WQD.2 - Multiple Regression

Printer-friendly versionPrinter-friendly version

Sample R code for
Multiple Regression

Linear regression is fitted to the Training data.

Model I: All predictors in the model

regression coefficients table

For extremely high VIF density was removed from the model. There are other predictors with high VIF, but they were not removed at this step.

Model II: After removal of density VIFs improved

regression coefficients table

Not all predictors are significant. A forward selection method is employed to build a working model.  The sample R output follows:

R output
R output
R output
R output
R output
R output
R output
R output

Model III: Working model

regression coefficients table

Sample R output:

R output

Note that multiple R2 is 25%. Regression diagnostics are examined for possible improvement of the model.

distribution of residuals

Residuals have an approximately symmetric distribution but there seems to be outliers at both ends. Partial residual plots are given below. Note the pattern in the fitted value plot. Since the response actually takes only integer values but has been assumed to be continuous, such pattern arises.

fitted values plot

Outliers and leverage points are identified through the following:

  • Studentized deleted residuals (a point is outlier if residual is outside of [-3, 3] limits
  • DFITS (a point is outlier if residual is outside of [-1, 1] limits
  • Cook’s distance

All three plots are given below. Note that no point is identified as outlier with DFITS value.

studentized residuals and DFITS plots

Cook's distance plot

Only 26 points are identified as outliers according to the above criteria. A final model is fit after eliminating these points and a slight improvement in the R2 value is noted.

Model IV: Final model

 regression coefficients table

Sample R output:

R output

Application of this model on test data gives sum of square of differences between the actual response and predicted response to be 1196.205 whereas sum of square of deviations of actual response is 1554.754. Ratio of these two may be taken as the ratio of Error sum of squares and total sum of squares. Hence a measure similar to that of R2 may be computed as 1 – 1196.205/1554.754 = 0.2306.

Sample R code for
Final Model