7.3 - Log-transforming Both the Predictor and Response for SLR

In this section, we learn how to build and use a model by transforming both the response y and the predictor x. You might have to do this when everything seems wrong — when the regression function is not linear and the error terms are not normal and have unequal variances. In general (although not always!):

  • Transforming the y values corrects problems with the error terms (and may help the non-linearity).
  • Transforming the x values primarily corrects the non-linearity.

Again, keep in mind that although we're focussing on a simple linear regression model here, the essential ideas apply more generally to multiple linear regression models too.

As before, let's learn about transforming both the x and y values by way of example.

Building the model

An example. Many different interest groups — such as the lumber industry, ecologists, and foresters — benefit from being able to predict the volume of a tree just by knowing its diameter. One classic data set (shortleaf.txt) — reported by C. Bruce and F. X. Schumacher in 1935 — concerned the diameter (x, in inches) and volume (y, in cubic feet) of n = 70 shortleaf pines. Let's use the data set to learn not only about the relationship between the diameter and volume of shortleaf pines, but also about the benefits of simultaneously transforming both the response y and the predictor x.

Although the r2 value is quite high (89.3%), the fitted line plot suggests that the relationship between tree volume and tree diameter is not linear:

volume vs diameter plot

The residuals vs. fits plot also suggests that the relationship is not linear:

residuals vs fitted values plot

Because the lack of linearity dominates the plot, we can not use the plot to evaluate whether the error variances are equal. We have to fix the non-linearity problem before we can assess the assumption of equal variances.

The normal probability plot suggests that the error terms are not normal. The plot is not quite linear and the Anderson-Darling P-value is 0.012. There is sufficient evidence to conclude that the error terms are not normally distributed:

normal probability plot

The plot actually has the classical appearance of residuals that are predominantly normal but have one outlier. This illustrates how a data point can be deemed an "outlier" just because of poor model fit.

In summary, it appears as if the relationship between tree diameter and volume is not linear. Furthermore, it appears as if the error terms are not normally distributed.

Let's see if we get anywhere by transforming only the x values. In particular, let's take the natural logarithm of the tree diameters to obtain the new predictor x = lnDiam:

Diameter
Volume
lnDiam
4.4
2.0
1.48160
4.6
2.2
1.52606
5.0
3.0
1.60944
5.1
4.3
1.62924
5.1
3.0
1.62924
5.2
2.9
1.64866
5.2
3.5
1.64866
5.5
3.4
1.70475
5.5
5.0
1.70475
5.6
7.2
1.72277
5.9
6.4
1.77495
5.9
5.6
1.77495
7.5
7.7
2.01490
7.6
10.3
2.02815
… and so on …

For example, ln(5.0) = 1.60944 and ln(7.6) = 2.02815. How well does transforming only the x values work? Not very!

The fitted line plot with y = volume as the response and x = lnDiam as the predictor suggests that the relationship is still not linear:

Volume vs lnDiameter plot

Transforming only the x values didn't change the non-linearity at all. The residuals vs. fits plot also still suggests a non-linear relationship ...

residuals vs fitted values plot

... and there is little improvement in the normality of the error terms:

normal probability plot

The pattern is not linear and the Anderson-Darling P-value is less than 0.005. There is sufficient evidence to conclude that the error terms are not normally distributed.

So, transforming x alone didn't help much. Let's also try transforming the response (y) values. In particular, let's take the natural logarithm of the tree volumes to obtain the new response y = lnVol:

Diameter
Volume
lnDiam
lnVol
4.4
2.0
1.48160
0.69315
4.6
2.2
1.52606
0.78846
5.0
3.0
1.60944
1.09861
5.1
4.3
1.62924
1.45862
5.1
3.0
1.62924
1.09861
5.2
2.9
1.64866
1.06471
5.2
3.5
1.64866
1.25276
5.5
3.4
1.70475
1.22378
5.5
5.0
1.70475
1.60944
5.6
7.2
1.72277
1.97408
5.9
6.4
1.77495
1.85630
5.9
5.6
1.77495
1.72277
7.5
7.7
2.01490
2.04122
7.6
10.3
2.02815
2.33214
... and so on ...

Let's see if transforming both the x and y values does it for us. Wow! The fitted line plot should give us hope! The relationship between the natural log of the diameter and the natural log of the volume looks linear and strong (r2 = 97.4%):

lnVolume vs lnDiameter plot

The residuals vs. fits plot provides yet more evidence of a linear relationship between lnVol and lnDiam:

residuals vs fitted values plot

Generally, speaking the residuals bounce randomly around the residual = 0 line. You might be a little concerned that some "funneling" exists. If it does, it doesn't appear to be too severe, as the negative residuals do follow the desired horizontal band.

The normal probability plot has improved substantially:

normal probability plot

The trend is generally linear and the Anderson-Darling P-value is 0.169. There is insufficient evidence to conclude that the error terms are not normal.

In summary, it appears as if the model with the natural log of tree volume as the response and the natural log of tree diameter as the predictor works well. The relationship appears to be linear and the error terms appear independent and normally distributed with equal variances.

Using the model

Let's now use our linear regression model for the shortleaf pine data—with y = lnVol as the response and x = lnDiam as the predictor—to answer four different research questions.

Research Question #1: What is the nature of the association between diameter and volume of shortleaf pines?

lnVolume vs lnDiameter plot

Again, to answer this research question, we just describe the nature of the relationship. That is, the natural logarithm of tree volume is positively linearly related to the natural logarithm of tree diameter. That is, as the natural log of tree diameters increases, the average natural logarithm of the tree volume also increases.

Research Question #2: Is there an association between diameter and volume of shortleaf pines?

Again, in answering this research question, no modification to the standard procedure is necessary. We merely test the null hypothesis H0: β1 = 0 using either the F-test or the equivalent t-test:

minitab output

As the software output illustrates, the P-value is < 0.001. There is significant evidence at the 0.01 level to conclude that there is a linear association between the natural logarithm of tree volume and the natural logarithm of tree diameter.

Research Question #3: What is the "average" volume of all shortleaf pine trees that are 10" in diameter?

In answering this research question, if we are only interested in a point estimate, we put x = ln(10) = 2.303 into the estimated regression function:

ln(Vol) = -2.8718 + 2.56442 ln(Diam)

to obtain:

ln(Vol) = -2.8718 + 2.56442 × ln(10) = 3.034

That is, we estimate the average of the natural log of the volumes of all 10"-diameter shortleaf pines to be 3.034 log-cubic feet. Of course, this is not a very helpful conclusion. We have to take advantage of the fact, as we showed before, that the average of the natural log of the volumes approximately equals the natural log of the median of the volumes. Exponentiating both sides of the previous equation:

Vol = eln(Vol) = e3.034 = 20.8 cubic feet

we estimate the median volume of all shortleaf pines with a 10" diameter to be 20.8 cubic feet. Helpful, but not sufficient! A 95% confidence interval for the average of the natural log of the volumes of all 10"-diameter shortleaf pines is:

minitab output

Exponentiating both endpoints of the interval, we get:

e2.9922 = 19.9 and e3.0738 = 21.6.

We can be 95% confident that the median volume of all shortleaf pines, 10" in diameter, is between 19.9 and 21.6 cubic feet.

Research Question #4: What is expected change in volume for a two-fold increase in diameter?

Figuring out how to answer this research question also takes a little bit of work. The end result is:

  • In general, the median changes by a factor of \(k^{\beta_1}\) for each k-fold increase in the predictor x.
  • Therefore, the median changes by a factor of \(2^{\beta_1}\) for each two-fold increase in the predictor x.
  • As always, we won't know the slope of the population line, β1. We have to use b1 to estimate it.

Again, you won't be required to duplicate the derivation, shown below, of this result, but it may help you to understand it and therefore remember it.

For the shortleaf pine data, the software output tells us that b1 = 2.56442:

minitab output

and therefore:

\[2^{b_1}=2^{2.56442}=5.92\]

The result tells us that the estimated median volume changes by a factor of 5.92 for each two-fold increase in diameter. For example, the median volume of a 20"-diameter tree is estimated to be 5.92 times the median volume of a 10" diameter tree. And, the median volume of a 10"-diameter tree is estimated to be 5.92 times the median volume of a 5"-diameter tree.

So far, we've only calculated a point estimate for the expected change. Of course, a 95% confidence interval for β1 is:

2.56442 ± 1.9955(0.05120) = (2.462, 2.667)

Because:

22.462 = 5.51 and 22.667 = 6.35

we can be 95% confident that the median volume will increase by a factor between 5.51 and 6.35 for each two-fold increase in diameter.