5.5 - Further Examples

Example 5-4: Pastry Sweetness Data Section

fruit pastry

A designed experiment is done to assess how the moisture content and sweetness of a pastry product affect a taster’s rating of the product (Pastry dataset). In a designed experiment, the eight possible combinations of four moisture levels and two sweetness levels are studied. Two pastries are prepared and rated for each of the eight combinations, so the total sample size is n = 16. The y-variable is the rating of the pastry. The two x-variables are moisture and sweetness. The values (and sample sizes) of the x-variables were designed so that the x-variables were not correlated.

Correlation: Moisture, Sweetness

Pearson correlation of Moisture and Sweetness = 0.000

P-Value = 1.000

A plot of moisture versus sweetness (the two x-variables) is as follows:

scatterplot

Notice that the points are on a rectangular grid so the correlation between the two variables is 0. (Please Note: we are not able to see that actually there are 2 observations at each location of the grid!)

The following figure shows how the two x-variables affect the pastry rating.

scatterplot of Rating vs Moisture

There is a linear relationship between rating and moisture and there is also a sweetness difference. The Minitab results given in the following output are for three different regressions - separate simple regressions for each x-variable and a multiple regression that incorporates both x-variables.

Regression Analysis: Rating versus Moisture

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-value
Regression 1 1566.45 1566.45 54.75 0.000
    Moisture 1 166.45 1566.45 54.75 0.000
Error 14 400.55 28.61    
    Lack-of-Fit 2 15.05 7.52 0.23 0.795
    Pure Error 12 385.50 32.13    
Total 15 1967.00      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
5.34890 79.64% 78.18% 72.71%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 5077 4.39 11.55 0.000  
Moisture 4.425 0.598 7.40 0.000 1.00

Regression Equation

Rating = 50.77 + 4.425 Moisture

Regression Analysis: Rating versus Sweetness

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-value
Regression 1 306.3 306.3 2.58 0.130
Sweetness 1 306.3 306.3 2.58 0.130
Error 14 1660.8 118.6    
Total 15 1967.0      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
10.8915 15.57% 9.54% 0.00%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 68.63 8.61 7.97 0.000  
Sweetness 4.38 2.72 1.61 0.130 1.00

Regression Equation

Rating = 68.63 + 4.38 Sweetness

Regression Analysis: Rating versus Moisture, Sweetness

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-value
Regression 2 1872.70 936.35 129.08 0.000
Moisture 1 1566.45 1566.45 215.95 0.000
Sweetness 1 306.25 306.25 42.22 0.000
Error 13 94.30 7.25    
Lack-of-Fit 5 37.30 7.46 1.05 0.453
857.00 7.13      
Total 15 1967.00      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.69330 95.21% 94.47% 92.46%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 37.65 3.00 12.57 0.000  
Moisture 4.425 0.301 14.70 0.000 1.00
Sweetness 4.375 0.673 6.50 0.000 1.00

Regression Equation

Rating = 37.65 + 4.425 Moisture + 4.375 Sweetness

There are three important features to notice in the results:

  1. The sample coefficient that multiplies Moisture is 4.425 in both the simple and the multiple regression. The sample coefficient that multiplies Sweetness is 4.375 in both the simple and the multiple regression. This result does not generally occur; the only reason that it does, in this case, is that Moisture and Sweetness are not correlated, so the estimated slopes are independent of each other. For most observational studies, predictors are typically correlated and estimated slopes in a multiple linear regression model do not match the corresponding slope estimates in simple linear regression models.

  2. The \(R^{2}\) for the multiple regression, 95.21%, is the sum of the \(R^{2}\) values for the simple regressions (79.64% and 15.57%). Again, this will only happen when we have uncorrelated x-variables.

  3. The variable Sweetness is not statistically significant in the simple regression (p = 0.130), but it is in the multiple regression. This is a benefit of doing a multiple regression. By putting both variables into the equation, we have greatly reduced the standard deviation of the residuals (notice the S values). This in turn reduces the standard errors of the coefficients, leading to greater t-values and smaller p-values.

(Data source: Applied Regression Models, (4th edition), Kutner, Neter, and Nachtsheim).

Example 5-5: Female Stat Students Section

The data are from n = 214 females in statistics classes at the University of California at Davis (Stat Females dataset). The variables are y = student’s self-reported height, \(x_{1}\) = student’s guess at her mother’s height, and \(x_{2}\) = student’s guess at her father’s height. All heights are in inches. The scatterplots below are of each student’s height versus the mother’s height and the student’s height against the father’s height.

scatterplot

Both show a moderate positive association with a straight-line pattern and no notable outliers.

Interpretations

The first two lines of the Minitab output show that the sample multiple regression equations is predicted student height = 18.55 + 0.3035 × mother’s height + 0.3879 × father’s height:

Regression Analysis: Height versus momheight, dadheight

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-value
Regression 2 666.1 333.074 80.73 0.000
momheight 1 128.1 128.117 31.05 0.000
dadheight 1 278.5 278.488 67.50 0.000
Error 211 870.5 4.126    
Lack-of-Fit 101 446.3 4.419 1.15 0.242
Pure Error 110 424.2 3.857    
Total 213 1536.6      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.03115 43.35% 42.81% 41.58%

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 18.55 3.69 5.02 0.000  
momheight 0.3035 0.0545 5.57 0.000 1.19
dadheight 0.3879 0.0472 8.22 0.000 1.19

Regression Equation

Rating = 18.55 + 0.3035 momheight + 0.3879 dadheight

To use this equation for prediction, we substitute specified values for the two parents’ heights.

We can interpret the “slopes” in the same way that we do for a simple linear regression model but we have to add the constraint that values of other variables remain constant. For example:

– When the father’s height is held constant, the average student height increases by 0.3035 inches for each one-inch increase in the mother’s height.

– When the mother’s height is held constant, the average student height increases by 0.3879 inches for each one-inch increase in the father’s height.

  • The p-values given for the two x-variables tell us that student height is significantly related to each.
  • The value of \(R^{2}\) = 43.35% means that the model (the two x-variables) explains 43.35% of the observed variation in student heights.
  • The value S = 2.03115 is the estimated standard deviation of the regression errors. Roughly, it is the average absolute size of a residual.

Residual Plots

Just as in simple regression, we can use a plot of residuals versus fits to evaluate the validity of assumptions. The residual plot for these data is shown in the following figure:

Residuals Versus the Fitted Values plot

It looks about as it should - a random horizontal band of points. Other residual analyses can be done exactly as we did for simple regression. For instance, we might wish to examine a normal probability plot of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or non-constant variance.

Example 5-6: Hospital Data Section

Hospital bed

Data from n = 113 hospitals in the United States are used to assess factors related to the likelihood that a hospital patient acquires an infection while hospitalized. The variables here are y = infection risk, \(x_{1}\) = average length of patient stay, \(x_{2}\) = average patient age, \(x_{3}\) = measure of how many x-rays are given in the hospital (Hospital Infection dataset). The Minitab output is as follows:

Coefficients

Term Coef SE Coef T-Value P-Value VIF
Constant 1.00 1.31 0.76 0.448  
Stay 0.3082 0.0594 5.19 0.000 1.23
Age -0.0230 0.0235 -0.98 0.330 1.05
Xray 0.01966 0.00576 3.41 0.001 1.18

Regression Equation

InfctRsk = 1.00 + 0.3082 Stay - 0.0230 Age + 0.01966 Xray

Interpretations for this example include:

  • The p-value for testing the coefficient that multiplies Age is 0.330. Thus we cannot reject the null hypothesis \(H_0 \colon \beta_{2}\) = 0. The variable Age is not a useful predictor within this model that includes Stay and Xrays.
  • For the variables Stay and X-rays, the p-values for testing their coefficients are at a statistically significant level so both are useful predictors of infection risk (within the context of this model!).
  • We usually don’t worry about the p-value for Constant. It has to do with the “intercept” of the model and seldom has any practical meaning. It also doesn’t give information about how changing an x-variable might change y-values.

(Data source: Applied Regression Models, (4th edition), Kutner, Neter, and Nachtsheim).

Example 5-7: Physiological Measurements Data Section

skinfold measurement of the triceps

For a sample of n = 20 individuals, we have measurements of y = body fat, \(x_{1}\) = triceps skinfold thickness, \(x_{2}\) = thigh circumference, and \(x_{3}\) = midarm circumference (Body Fat dataset). Minitab results for the sample coefficients, MSE (highlighted), and \(\left(X^{T} X \right)^{−1}\) are given below:

Regression Analysis: Bodyfat versus Triceps, Thigh, Midarm

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value
Regression 3 396.985 132.328 21.52 0.000
Triceps 1 12.705 12.705 2.07 0.170
Thigh 1 7.529 7.529 1.22 0.285
Midarm 1 11.546 11.546 1.88 0.190
Error 16 98.405 6.150    
Total 19 495.390      

Model Summary

S R-sq R-sq(adj) R-sq(pred)
2.47998 80.14% 76.41% 67.55%

Coefficients

Predictor Coef SE Coef T-Value P-Value VIF
Constant 117.1 99.8 1.17 0.258  
Triceps 4.33 3.02 1.44 0.170 708.84
Thigh -2.86 2.58 -1.11 0.285 564.34
Midarm -2.19 1.60 -1.37 0.190 104.61

Regression Equation

Bodyfat 117.1 + 4.33 Triceps - 2.86 Thigh - 2.19 Midarm

\(\left(X^{T} X \right)^{−1}\) - (calculated manually, see note below)

Body Fat Calculations
1618.87 48.8103 -41.8487 -25.7988
48.81 1.4785 -1.2648 -0.7785
-41.85 -1.2648 1.0840 0.6658
-25.80 -0.7785 0.6658 0.4139
Note! There is no real need to know how to calculate this matrix in Minitab, but in case you're curious first store the design matrix, X, under Storage when you run the regression. Then select Calc > Matrices > Transpose to find the transpose of X and save the resulting matrix using an M letter. Then select Calc > Matrices > Arithmetic to multiply the transpose of X and X and again save the resulting matrix using an M letter. Then select Calc > Matrices > Invert to invert this matrix and again save the resulting matrix using an M letter. Finally, select Data > Display Data to view the final matrix.

The variance-covariance matrix of the sample coefficients is found by multiplying each element in \(\left(X^{T} X \right)^{−1}\) by MSE. Common notation for the resulting matrix is either \(s^{2}\)(b) or \(se^{2}\)(b). Thus, the standard errors of the coefficients given in the Minitab output can be calculated as follows:

  • Var(\(b_{0}\)) = (6.15031)(1618.87) = 9956.55, so se(\(b_{0}\)) = \(\sqrt{9956.55}\) = 99.782.
  • Var(\(b_{1}\)) = (6.15031)(1.4785) = 9.0932, so se(\(b_{1}\)) = \(\sqrt{9.0932}\) = 3.016.
  • Var(\(b_{2}\)) = (6.15031)(1.0840) = 6.6669, so se(\(b_{2}\)) = \(\sqrt{6.6669}\) = 2.582.
  • Var(\(b_{3}\)) = (6.15031)(0.4139) = 2.54561, so se(\(b_{3}\)) = \(\sqrt{2.54561}\) = 1.595.

As an example of covariance and correlation between two coefficients, we consider \(b_{1 }\)and \(b_{2}\).

  • Cov(\(b_{1}\), \(b_{2}\)) = (6.15031)(−1.2648) = −7.7789. The value -1.2648 is in the second row and third column of \(\left(X^{T} X \right)^{−1}\). (Keep in mind that the first row and first column give information about \(b_0\), so the second row has information about \(b_{1}\), and so on.)
  • Corr(\(b_{1}\), \(b_{2}\)) = covariance divided by product of standard errors = −7.7789 / (3.016 × 2.582) = −0.999.

The extremely high correlation between these two sample coefficient estimates results from a high correlation between the Triceps and Thigh variables. The consequence is that it is difficult to separate the individual effects of these two variables.

If all x-variables are uncorrelated with each other, then all covariances between pairs of sample coefficients that multiply x-variables will equal 0. This means that the estimate of one beta is not affected by the presence of the other x-variables. Many experiments are designed to achieve this property. With observational data, however, we’ll most likely not have this happen.

(Data source: Applied Regression Models, (4th edition), Kutner, Neter, and Nachtsheim).