9.8 - Polynomial Regression Examples

Example 9-5: How is the length of a bluegill fish related to its age? Section

In 1981, n = 78 bluegills were randomly sampled from Lake Mary in Minnesota. The researchers (Cook and Weisberg, 1999) measured and recorded the following data (Bluegills dataset):

  • Response \(\left(y \right) \colon\) length (in mm) of the fish
  • Potential predictor \(\left(x_1 \right) \colon \) age (in years) of the fish

The researchers were primarily interested in learning how the length of a bluegill fish is related to it age.

A scatter plot of the data:

scatter plot

suggests that there is positive trend in the data. That is, not surprisingly, as the age of bluegill fish increases, the length of the fish tends to increase. The trend, however, doesn't appear to be quite linear. It appears as if the relationship is slightly curved.

One way of modeling the curvature in these data is to formulate a "second-order polynomial model" with one quantitative predictor:

\(y_i=(\beta_0+\beta_1x_{i}+\beta_{11}x_{i}^2)+\epsilon_i\)

where:

  • \(y_i\) is length of bluegill (fish) \(i\) (in mm)
  • \(x_i\) is age of bluegill (fish) \(i\) (in years)

and the independent error terms \(\epsilon_i\) follow a normal distribution with mean 0 and equal variance \(\sigma^{2}\).

You may recall from your previous studies that "quadratic function" is another name for our formulated regression function. Nonetheless, you'll often hear statisticians referring to this quadratic model as a second-order model, because the highest power on the \(x_i\) term is 2.

Incidentally, observe the notation used. Because there is only one predictor variable to keep track of, the 1 in the subscript of \(x_{i1}\) has been dropped. That is, we use our original notation of just \(x_i\). Also note the double subscript used on the slope term, \(\beta_{11}\), of the quadratic term, as a way of denoting that it is associated with the squared term of the one and only predictor.

The estimated quadratic regression function looks like it does a pretty good job of fitting the data:

estimated quadratic regression function

To answer the following potential research questions, do the procedures identified in parentheses seem reasonable?

  • How is the length of a bluegill fish related to its age? (Describe the nature — "quadratic" — of the regression function.)
  • What is the length of a randomly selected five-year-old bluegill fish? (Calculate and interpret a prediction interval for the response.)

Among other things, the Minitab output:

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 2 35938.0 17969.0 151.07 0.000
age 1 8252.5 8252.5 69.38 0.000
age^2 1 2972.1 2972.1 24.99 0.000
Error 75 8920.7 118.9    
Lack-of-Fit 3 108.0 360 0.29 0.829
Pure Error 72 88121.7 122.4    
Total 77 44858.7      
Model Summary
S R-sq R-sq(adj) R-sq(pred)
10.9061 80.11% 79.58% 78.72%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 13.6 11.0 1.24 0.220  
age 54.05 6.49 8.33 0.000 23.44
age^2 -4.719 0.944 -5.00 0.000 23.44
Regression Equation

length = 13.6 + 54.05 age - 4.719 age^2

Predictions for length

Variable Setting
age 5
age^2 25
Fit SE Fit 95% CI 95% PI
165.902 2.76901 (160.386, 171.418) (143.487, 188.318)

tells us that:

  • 80.1% of the variation in the length of bluegill fish is reduced by taking into account a quadratic function of the age of the fish.
  • We can be 95% confident that the length of a randomly selected five-year-old bluegill fish is between 143.5 and 188.3 mm.

Example 9-6: Yield Data Set Section

measuring yield in a dairy

This data set of size n = 15 (Yield data) contains measurements of yield from an experiment done at five different temperature levels. The variables are y = yield and x = temperature in degrees Fahrenheit. The table below gives the data used for this analysis.

i Temperature Yield
1 50 3.3
2 50 2.8
3 50 2.9
4 70 2.3
5 70 2.6
6 70 2.1
7 80 2.5
8 80 2.9
9 80 2.4
10 90 3.0
11 90 3.1
12 90 2.8
13 100 3.3
14 100 3.5
15 100 3.0

The figures below give a scatterplot of the raw data and then another scatterplot with lines pertaining to a linear fit and a quadratic fit overlayed. Obviously the trend of this data is better suited to a quadratic fit.

fitted line plot for yield
quadratic fitted line plot

Here we have the linear fit results:

Regression Analysis: Yield versus Temp

Model Summary
S R-sq R-sq(adj) R-sq(pred)
0.391312 9.24% 2.26% 0.00%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 2.306 0.469 4.92 0.000  
Temp 0.00676 0.00587 1.15 0.271 1.00
Regression Equation

Yeild = 2.306 + 0.00676 Temp

Here we have the quadratic fit results:

Polynomial Regression Analysis: Yield versus Temp

Model Summary
S R-sq R-sq(adj) R-sq(pred)
0.244399 67.32% 61.87% 46.64%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 7.96 1.26 6.32 0.000  
Temp -0.1537 0.0349 -4.40 0.001 90.75
Temp*Temp 0.001076 0.000233 4.62 0.001 90.75
Regression Equation

Yeild =7.96 - 0.1537 Temp + 0.001076 Temp*Temp

We see that both temperature and temperature squared are significant predictors for the quadratic model (with p-values of 0.0009 and 0.0006, respectively) and that the fit is much better than for the linear fit. From this output, we see the estimated regression equation is \(y_{i}=7.960-0.1537x_{i}+0.001076x_{i}^{2}\). Furthermore, the ANOVA table below shows that the model we fit is statistically significant at the 0.05 significance level with a p-value of 0.001.

Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 2 1.47656 0.738282 12.36 0.001
Temp 1 1.1560 1.15596 19.35 0.001
Temp*Temp 1 1.2739 1.27386 21.33 0.001
Error 12 0.71677 0.059731    
Lack-of-Fit 2 0.1368 0.06838 1.18 0.347
Pure Error 10 0.5800 0.05800    
Total 14 21.9333      

Example 9-7: Odor Data Set Section

An experiment is designed to relate three variables (temperature, ratio, and height) to a measure of odor in a chemical process. Each variable has three levels, but the design was not constructed as a full factorial design (i.e., it is not a \(3^{3}\) design). Nonetheless, we can still analyze the data using a response surface regression routine, which is essentially polynomial regression with multiple predictors. The data obtained (Odor data) was already coded and can be found in the table below.

Odor Temperature Ratio Height
66 -1 -1 0
58 -1 0 -1
65 0 -1 -1
-31 0 0 0
39 1 -1 0
17 1 0 -1
7 0 1 -1
-35 0 0 0
43 -1 1 0
-5 -1 0 1
43 0 -1 1
-26 0 0 0
49 1 1 0
-40 1 0 1
-22 0 1 1

First we will fit a response surface regression model consisting of all of the first-order and second-order terms. The summary of this fit is given below:

Model Summary
S R-sq R-sq(adj) R-sq(pred)
18.7747 86.83% 76.95% 47.64%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -307 10.8 -2.83 0.022  
Temp -12.13 6.64 -1.83 0.105 1.00
Ratio -17.00 6.64 -2.56 0.034 1.00
Height -21.37 6.64 -3.22 0.012 1.00
Temp2 32.08 9.77 3.28 0.011 1.01
Ratio2 47.83 9.77 4.90 0.001 1.01
Height2 6.08 9.77 0.62 0.551 1.01

As you can see, the square of height is the least statistically significant, so we will drop that term and rerun the analysis. The summary of this new fit is given below:

Model Summary
S R-sq R-sq(adj) R-sq(pred)
18.1247 86.19% 78.52% 56.19%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -26.92 8.71 -3.09 0.013  
Temp -12.13 6.41 -1.89 0.091 1.00
Ratio -17.00 6.41 -2.65 0.026 1.00
Height -21.37 6.41 -3.34 0.009 1.00
Temp2 31.62 9.40 3.36 0.008 1.01
Ratio2 47.37 9.40 5.04 0.001 1.01

The temperature main effect (i.e., the first-order temperature term) is not significant at the usual 0.05 significance level. However, the square of temperature is statistically significant. To adhere to the hierarchy principle, we'll retain the temperature main effect in the model.