5.3 - The Multiple Linear Regression Model

Notation for the Population Model

A population model for a multiple linear regression model that relates a y-variable to p -1 x-variables is written as

\(\begin{equation} y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\ldots+\beta_{p-1}x_{i,p-1}+\epsilon_{i}. \end{equation} \)

We assume that the \(\epsilon_{i}\) have a normal distribution with mean 0 and constant variance \(\sigma^{2}\). These are the same assumptions that we used in simple regression with one x-variable.
The subscript i refers to the \(i^{\textrm{th}}\) individual or unit in the population. In the notation for the x-variables, the subscript following i simply denotes which x-variable it is.
The word "linear" in "multiple linear regression" refers to the fact that the model is linear in the parameters, \(\beta_0, \beta_1, \ldots, \beta_{p-1}\). This simply means that each parameter multiplies an x-variable, while the regression function is a sum of these "parameter times x-variable" terms. Each x-variable can be a predictor variable or a transformation of predictor variables (such as the square of a predictor variable or two predictor variables multiplied together). Allowing non-linear transformation of predictor variables like this enables the multiple linear regression model to represent non-linear relationships between the response variable and the predictor variables. We'll explore predictor transformations further in Lesson 9. Note that even \(\beta_0\) represents a "parameter times x-variable" term if you think of the x-variable that is multiplied by \(\beta_0\) as being the constant function "1."
The model includes p-1 x-variables, but p regression parameters (beta) because of the intercept term \(\beta_0\).

Estimates of the Model Parameters

The estimates of the \(\beta\) parameters are the values that minimize the sum of squared errors for the sample. The exact formula for this is given in the next section on matrix notation.
The letter b is used to represent a sample estimate of a \(\beta\) parameter. Thus \(b_{0}\) is the sample estimate of \(\beta_{0}\), \(b_{1}\) is the sample estimate of \(\beta_{1}\), and so on.
\(\textrm{MSE}=\frac{\textrm{SSE}}{n-p}\) estimates \(\sigma^{2}\), the variance of the errors. In the formula, n = sample size, p = number of \(\beta\) parameters in the model (including the intercept) and \(\textrm{SSE}\) = sum of squared errors. Notice that for simple linear regression p = 2. Thus, we get the formula for MSE that we introduced in the context of one predictor.
\(S=\sqrt{MSE}\) estimates \(\sigma\) and is known as the regression standard error or the residual standard error.
In the case of two predictors, the estimated regression equation yields a plane (as opposed to a line in the simple linear regression setting). For more than two predictors, the estimated regression equation yields a hyperplane.

Interpretation of the Model Parameters

Each \(\beta\) parameter represents the change in the mean response, E(y), per unit increase in the associated predictor variable when all the other predictors are held constant.
For example, \(\beta_1\) represents the estimated change in the mean response, E(y), per unit increase in \(x_1\) when \(x_2\), \(x_3\), ..., \(x_{p-1}\) are held constant.
The intercept term, \(\beta_0\), represents the estimated mean response, E(y), when all the predictors \(x_1\), \(x_2\), ..., \(x_{p-1}\), are all zero (which may or may not have any practical meaning).

Predicted Values and Residuals

A predicted value is calculated as \(\hat{y}_{i}=b_{0}+b_{1}x_{i,1}+b_{2}x_{i,2}+\ldots+b_{p-1}x_{i,p-1}\), where the b values come from statistical software and the x-values are specified by us.
A residual (error) term is calculated as \(e_{i}=y_{i}-\hat{y}_{i}\), the difference between an actual and a predicted value of y.
A plot of residuals (vertical) versus predicted values (horizontal) ideally should resemble a horizontal random band. Departures from this form indicate difficulties with the model and/or data.
Other residual analyses can be done exactly as we did in simple regression. For instance, we might wish to examine a normal probability plot (NPP) of the residuals. Additional plots to consider are plots of residuals versus each x-variable separately. This might help us identify sources of curvature or nonconstant variance. We'll explore this further in Lesson 7.

ANOVA Table

Source	df	SS	MS	F
Regression	p – 1	SSR	MSR = SSR / (p – 1)	MSR / MSE
Error	n – p	SSE	MSE = SSE / (n – p)
Total	n – 1	SSTO

Coefficient of Determination, R-squared, and Adjusted R-squared

As in simple linear regression, \(R^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}\), and represents the proportion of variation in \(y\) (about its mean) "explained" by the multiple linear regression model with predictors, \(x_1, x_2, ...\).
If we start with a simple linear regression model with one predictor variable, \(x_1\), then add a second predictor variable, \(x_2\), \(SSE\) will decrease (or stay the same) while \(SSTO\) remains constant, and so \(R^2\) will increase (or stay the same). In other words, \(R^2\) always increases (or stays the same) as more predictors are added to a multiple linear regression model, even if the predictors added are unrelated to the response variable. Thus, by itself, \(R^2\) cannot be used to help us identify which predictors should be included in a model and which should be excluded.
An alternative measure, adjusted \(R^2\), does not necessarily increase as more predictors are added, and can be used to help us identify which predictors should be included in a model and which should be excluded. Adjusted \(R^2=1-\left(\frac{n-1}{n-p}\right)(1-R^2)\), and, while it has no practical interpretation, is useful for such model building purposes. Simply stated, when comparing two models used to predict the same response variable, we generally prefer the model with the higher value of adjusted \(R^2\) – see Lesson 10 for more details.

Significance Testing of Each Variable

Within a multiple regression model, we may want to know whether a particular x-variable is making a useful contribution to the model. That is, given the presence of the other x-variables in the model, does a particular x-variable help us predict or explain the y-variable? For instance, suppose that we have three x-variables in the model. The general structure of the model could be

\(\begin{equation} y=\beta _{0}+\beta _{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+\epsilon. \end{equation}\)

As an example, to determine whether variable \(x_{1}\) is a useful predictor variable in this model, we could test

\(\begin{align*} \nonumber H_{0}&\colon\beta_{1}=0 \\ \nonumber H_{A}&\colon\beta_{1}\neq 0 \end{align*}\)

If the null hypothesis above were the case, then a change in the value of \(x_{1}\) would not change y, so y and \(x_{1}\) are not linearly related (taking into account \(x_2\) and \(x_3\)). Also, we would still be left with variables \(x_{2}\) and \(x_{3}\) being present in the model. When we cannot reject the null hypothesis above, we should say that we do not need variable \(x_{1}\) in the model given that variables \(x_{2}\) and \(x_{3}\) will remain in the model. In general, the interpretation of a slope in multiple regression can be tricky. Correlations among the predictors can change the slope values dramatically from what they would be in separate simple regressions.

To carry out the test, statistical software will report p-values for all coefficients in the model. Each p-value will be based on a t-statistic calculated as

\(t^{*}=\dfrac{ (\text{sample coefficient} - \text{hypothesized value})}{\text{standard error of coefficient}}\)

For our example above, the t-statistic is:

\(\begin{equation*} t^{*}=\dfrac{b_{1}-0}{\textrm{se}(b_{1})}=\dfrac{b_{1}}{\textrm{se}(b_{1})}. \end{equation*}\)

Note that the hypothesized value is usually just 0, so this portion of the formula is often omitted.

Multiple linear regression, in contrast to simple linear regression, involves multiple predictors and so testing each variable can quickly become complicated. For example, suppose we apply two separate tests for two predictors, say \(x_1\) and \(x_2\), and both tests have high p-values. One test suggests \(x_1\) is not needed in a model with all the other predictors included, while the other test suggests \(x_2\) is not needed in a model with all the other predictors included. But, this doesn't necessarily mean that both \(x_1\) and \(x_2\) are not needed in a model with all the other predictors included. It may well turn out that we would do better to omit either \(x_1\) or \(x_2\) from the model, but not both. How then do we determine what to do? We'll explore this issue further in Lesson 6.