31.1 - Lesson Notes

B. Designed Regression

It is worth reinforcing the comment on page 287, that the adjusted r-square is the more appropriate measure of the amount of variation explained by a regression with more than one explanatory variable.

D. Stepwise and Other Variable Selection Methods

To drive home the point about how difficult it might be to select the best model when you have a number of predictor variables, let's make it concrete, and suppose we have five possible predictor variables, a, b, c, d, and e. In that case, there'd be as many as 31 different regression models we'd have to try out on our data:

  • 5 models with just one predictor variable: a, b, c, d, and e
  • 10 models with two predictor variables: ab, bc, bd, be, ac, ad, ae, cd, ce, and de
  • 10 models with three predictor variables: abc, abd, abe, acd, ace, ade, bcd, bce, bde, and cde
  • 5 models with four predictor variables: abcd, abce, abde, acde, and bcde
  • 1 model with all five predictor variables: abcde

Page 295. Did we learn anything from this study? The best predictor of reading achievement at the end of the sixth grade is reading achievement at the end of the fifth grade. Hmmm.

G. Logistic Regression

Page 309. Sensitivity and specificity rates are typically used in quantifying the value of a diagnostic test. Sensitivity is defined as ... given that a person has a disease, what is the probability that the diagnostic test will detect the disease? Specificity is defined as ... given that a person is healthy, what is the probability that the diagnostic test will indicate that the person is healthy? Based on these definitions, it becomes clear that we desire the highest sensitivity and specificity rates that we can get. As you can see, though, on the classification table on page 307, the two values play off of each other. That is, as sensitivity increases, specificity generally decreases. The goal is to find the point at which we can live with the sensitivity and specificity (or find another diagnostic test!). In the example here, the authors are suggesting making the cutoff 0.3, so that the sensitivity is high (92%), but the specificity is not too low (45%).