4  Linear Regression

Linear Regression
Gauss-Markov Theorem

Overview

A quick review of regression, expectation, variance, and parameter estimation.

Input vector: \(X = (X_1, X_2, ... , X_p)\).

Output Y is real-valued.

Predict Y from X by f(X) so that the expected loss function \(E(L(Y, f(X)))\) is minimized.

Review: Expectation

Intuitively, the expectation of a random variable is its “average” value under its distribution.

Formally, the expectation of a random variable X, denoted E[X], is its Lebesgue integral with respect to its distribution.

If X takes values in some countable numeric set \(\chi\), then

\[E(X) =\sum_{x \in \chi}xP(X=x)\]

If \(X \in \mathbb{r eval=FALSE}^m\) has a density p, then

\[E(X) =\int_{\mathbb{r eval=FALSE}^m}xp(x)dx\]

Expectation is linear: \(E(aX +b)=aE(X) + b\)

Also, \(E(X+Y) = E(X) +E(Y)\)

The expectation is monotone: if XY, then E(X) ≥ E(Y)

Review: Variance

The variance of a random variable X is defined as:

\(Var(X) = E[(X-E[X])^2]=E[X^2]-(E[X])^2\)

and the variance obeys the following \(a, b \in \mathbb{r eval=FALSE}\):

\[Var(aX + b) =a^2Var(X)\]

Review: Frequentist Basics

The data x1, … , xn is generally assumed to be independent and identically distributed (i.i.d.).

We would like to estimate some unknown value θ associated with the distribution from which the data was generated.

In general, our estimate will be a function of the data (i.e., a statistic) \[\hat{\theta} =f(x_1, x_2, ... , x_n)\]

Example: Given the results of n independent flips of a coin, determine the probability p with which it lands on heads.

Review: Parameter Estimation

In practice, we often seek to select a distribution (model) corresponding to our data.

If our model is parameterized by some set of values, then this problem is that of parameter estimation.

How can we obtain estimates in general? One Answer: Maximize the likelihood and the estimate is called the maximum likelihood estimate, MLE.

\[ \begin {align} \hat{\theta} & = argmax_{\theta} \prod_{i=1}^{n}p_{\theta}(x_i) \\ & =argmax_{\theta} \sum_{i=1}^{n}log (p_{\theta}(x_i)) \\ \end {align} \]

Discussion

Let’s look at the setup for linear regression. We have an input vector: \(X = \left( X _ { 1 } , X _ { 2 } , \dots , X _ { p }\right)\). This vector is p dimensional.

The output Y is a real value and is ordered.

We want to predict Y from X.

Before we actually do the prediction we have to train the function f(X). By the end of the training, I would have a function f(X) to map every X into an estimated Y. Then, we need some way to measure how good this predictor function is. This is measured by the expectation of a loss.

Why do we have a loss in the estimation?

Y is actually a random variable given X. For instance, consider predicting someone’s weight based on the person’s height. People can have different weights given the same height. If you think of the weight as Y and the height as X, Y is random given X. We, therefore, cannot have a perfect prediction for every subject because f(X) is a fixed function, impossible to be correct all the time. The loss measures how different the true Y is from your prediction.

Why do we have the overall loss expressed as an expectation?

The loss may be different for different subjects. In statistics, a common thing to do is to average the losses over the entire population.

Squared loss:

\[L ( Y , f ( X ) ) = ( Y - f ( X ) ) ^ { 2 }\]

We simply measure the difference between the two variables and square them so that we can handle negative and positive difference symmetrically.

Suppose the distribution of Y given X is known, the optimal predictor is:

\[\begin{array} { l } { f ^ {*} ( X ) = \operatorname { argmin } _ { f ( x ) } E ( Y - f ( x ) ) ^ { 2 } } \\ { = E ( Y | X ) } \end{array}\]

This is the conditional expectation of Y given X. The function E(Y | X) is called the regression function.

Example 3-1

We want to predict the number of physicians in a metropolitan area.

Problem: The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y, is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table.

i : 1 2 3 139 140 141
X1 9387 7031 7017 233 232 231
X2 1348 4069 3719 1011 813 654
X3 72100 52737 54542 1337 1589 1148
Y 25627 15389 13326 264 371 140

Our Goal: To predict Y from \(X _ { 1 } , X _ { 2 } , \text { and } X _ { 3 }\).

This is a typical regression problem.

Objectives

Upon successful completion of this lesson, you should be able to:


  • Review of linear regression model focusing on prediction.
  • Use least square estimation for linear regression.
  • Apply model developed in training data to an independent test data.
  • Context setting for more complex supervised prediction methods.

4.1 Linear Methods

The linear regression model:

\[ f(X)=\beta_{0} + \sum_{j=1}^{p}X_{j}\beta_{j}\]

This is just a linear combination of the measurements that are used to make predictions, plus a constant, (the intercept term). This is a simple approach. However, It might be the case that the regression function might be pretty close to a linear function, and hence the model is a good approximation.

What if the model is not true?

  • It still might be a good approximation - the best we can do.
  • Sometimes because of the lack of training data or smarter algorithms, this is the most we can estimate robustly from the data.

Comments on \(X_j\):

  • We assume that these are quantitative inputs [or dummy indicator variables representing levels of a qualitative input]
  • We can also perform transformations of the quantitative inputs, e.g., log(•), √(•). In this case, this linear regression model is still a linear function in terms of the coefficients to be estimated. However, instead of using the original \(X_{j}\), we have replaced them or augmented them with the transformed values. Regardless of the transformations performed on \(X _ { j } f ( x )\) is still a linear function of the unknown parameters.
  • Some basic expansions: \(X _ { 2 } = X _ { 1 } ^ { 2 } , X _ { 3 } = X _ { 1 } ^ { 3 } , X _ { 4 } = X _ { 1 } \cdot X _ { 2 }\).

Below is a geometric interpretation of a linear regression.

For instance, if we have two variables,\(X_{1}\) and \(X_{2}\), and we predict Y by a linear combination of \(X_{1}\) and \(X_{2}\), the predictor function corresponds to a plane (hyperplane) in the three-dimensional space of \(X_{1}\), \(X_{2}\), Y. Given a pair of \(X_{1}\) and \(X_{2}\) we could find the corresponding point on the plane to decide Y by drawing a perpendicular line to the hyperplane, starting from the point in the plane spanned by the two predictor variables.

graph

For accurate prediction, hopefully, the data will lie close to this hyperplane, but they won’t lie exactly in the hyperplane (unless perfect prediction is achieved). In the plot above, the red points are the actual data points. They do not lie on the plane but are close to it.

How should we choose this hyperplane?

We choose a plane such that the total squared distance from the red points (real data points) to the corresponding predicted points in the plane is minimized. Graphically, if we add up the squares of the lengths of the line segments drawn from the red points to the hyperplane, the optimal hyperplane should yield the minimum sum of squared lengths.

Estimation

The issue of finding the regression function \(E ( Y | X )\) is converted to estimating \(\beta _ { j } , j = 0,1 , \dots , p\).

Remember in earlier discussions we talked about the trade-off between model complexity and accurate prediction on training data. In this case, we start with a linear model, which is relatively simple. The model complexity issue is taken care of by using a simple linear function. In basic linear regression, there is no explicit action taken to restrict model complexity. Although variable selection, which we cover in Lesson 5: Variable Selection, can be considered a way to control model complexity.

With the model complexity under check, the next thing we want to do is to have a predictor that fits the training data well.

Let the training data be:

\[\left\{ \left( x _ { 1 } , y _ { 1 } \right) , \left( x _ { 2 } , y _ { 2 } \right) , \dots , \left( x _ { N } , y _ { N } \right) \right\} , \text { where } x _ { i } = \left( x _ { i 1 } , x _ { i 2 } , \ldots , x _ { i p } \right)\]

Denote \(\beta = \left( \beta _ { 0 } , \beta _ { 1 } , \ldots , \beta _ { p } \right) ^ { T }\).

Without knowing the true distribution for X and Y, we cannot directly minimize the expected loss.

Instead, the expected loss \(E ( Y - f ( X ) ) ^ { 2 }\) is approximated by the empirical loss \(R S S ( \beta ) / N\) :

\[ \begin {align}RSS(\beta)&=\sum_{i=1}^{N}\left(y_i - f(x_i)\right)^2 &=\sum_{i=1}^{N}\left(y_i - \beta_0 -\sum_{j=1}^{p}x_{ij}\beta_{j}\right)^2 \\ \end {align} \]

This empirical loss is basically the accuracy you computed based on the training data. This is called the residual sum of squares, RSS.

The x’s are known numbers from the training data.

Notation

Here is the input matrix X of dimension N × (p +1):

\[\begin{pmatrix} 1 & x_{1,1} &x_{1,2} & ... &x_{1,p} \\ 1 & x_{2,1} & x_{2,2} & ... &x_{2,p} \\ ... & ... & ... & ... & ... \\ 1 & x_{N,1} &x_{N,2} &... & x_{N,p} \end{pmatrix}\]

Earlier we mentioned that our training data had N number of points. So, in the example where we were predicting the number of doctors, there were 101 metropolitan areas that were investigated. Therefore, N =101. Dimension p = 3 in this example. The input matrix is augmented with a column of 1’s (for the intercept term). So, above you see the first column contains all 1’s. Then if you look at every row, every row corresponds to one sample point and the dimensions go from one to p. Hence, the input matrix X is of dimension N × (p +1).

Output vector y:

\[ y= \begin{pmatrix} y_{1}\\ y_{2}\\ ...\\ y_{N} \end{pmatrix} \]

Again, this is taken from the training data set.

The estimated \(\beta\) is \(\hat{\beta}\) and this is also put in a column vector, \(\left( \beta _ { 0 } , \beta _ { 1 } , \dots , \beta _ { p } \right)\).

The fitted values (not the same as the true values) at the training inputs are

\[\hat{y}_{i}=\hat{\beta}_{0}+\sum_{j=1}^{p}x_{ij}\hat{\beta}_{j}\]

and

\[ \hat{y}= \begin{pmatrix} \hat{y}_{1}\\ \hat{y}_{2}\\ ...\\ \hat{y}_{N} \end{pmatrix} \]

For instance, if you are talking about sample i, the fitted value for sample i would be to take all the values of the x’s for sample i, (denoted by \(x_{ij}\)) and do a linear summation for all of these \(x_{ij}\)’s with weights \(\hat{\beta}_{j}\) and the intercept term \(\hat{\beta}_{0}\).

4.2 Point Estimate

  • The least square estimation of \(\hat{\beta}\) is:

\[\hat{\beta} =(X^{T}X)^{-1}X^{T}y \]

  • The fitted value vector is:

\[\hat{y} =X\hat{\beta}=X(X^{T}X)^{-1}X^{T}y \]

  • Hat matrix:

\[H=X(X^{T}X)^{-1}X^{T} \]

Geometric Interpretation

Each column of X is a vector in an N-dimensional space (not the \(p + 1\)* dimensional feature vector space). Here, we take out columns in matrix X, and this is why they live in N*-dimensional space. Values for the same variable across all of the samples are put in a vector. I represent this input matrix as the matrix formed by the column vectors:

\[X = \left( X _ { 0 } , x _ { 1 } , \ldots , x _ { p } \right)\]

Here \(x_0\) is the column of 1’s for the intercept term. It turns out that the fitted output vector \(\hat{y}\) is a linear combination of the column vectors \(x _ { j } , j = 0,1 , \dots , p\). Go back and look at the matrix and you will see this.

This means that \(\hat{y}\) lies in the subspace spanned by \(x _ { j } , j = 0,1 , \dots , p\).

The dimension of the column vectors is N, the number of samples. Usually, the number of samples is much bigger than the dimension p. The true y can be any point in this N-dimensional space. What we want to find is an approximation constraint in the \(p+1\) dimensional space such that the distance between the true y and the approximation is minimized. It turns out that the residual sum of squares is equal to the square of the Euclidean distance between y and \(\hat{y}\).

\[RSS(\hat{\beta})=\parallel y - \hat{y}\parallel^2 \]

For the optimal solution, \(y-\hat{y}\) has to be perpendicular to the subspace, i.e., \(\hat{y}\) is the projection of y on the subspace spanned by \(x _ { j } , j = 0,1 , \dots , p\).

Geometrically speaking let’s look at a really simple example. Take a look at the diagram below. What we want to find is a \(\hat{y}\) that lies in the hyperplane defined or spanned by \(x _ {1}\) and \(x _ {2}\). You would draw a perpendicular line from y to the plane to find \(\hat{y}\). This comes from a basic geometric fact. In general, if you want to find some point in a subspace to represent some point in a higher dimensional space, the best you can do is to project that point to your subspace.

Hyperplane with y, x_1 and x_2

The difference between your approximation and the true vector has to be perpendicular to the subspace.

The geometric interpretation is very helpful for understanding coefficient shrinkage and subset selection (covered in Lesson 5 and 6).


4.3 Example Results

Let’s take a look at some results for our earlier example about the number of active physicians in a Standard Metropolitan Statistical Area (SMSA - data). If I do the optimization using the equations, I obtain these values below:

\[\hat{Y}_{i}= –143.89+0.341X_{i1}–0.019X_{i2}+0.254X_{i3} \]

\[RSS(\hat{\beta})=52,942,438 \]

Let’s take a look at some scatter plots. We plot one variable versus another. For instance, in the upper left-hand plot, we plot the pairs of \(x_{1}\) and y. These are two-dimensional plots, each variable plotted individually against any other variable.

Scatter plot matrix

STAT 501 on Linear Regression goes deeper into which scatter plots are more helpful than others. These can be indicative of potential problems that exist in your data. For instance, in the plots above you can see that \(x_{3}\) is almost a perfectly linear function of \(x_{1}\). This might indicate that there might be some problems when you do the optimization. What happens is that if \(x_{3}\) is a perfectly linear function of \(x_{1}\), then when you solve the linear equation to determine the \(β\)’s, there is no unique solution. The scatter plots help to discover such potential problems.

In practice, because there is always measurement error, you rarely get a perfect linear relationship. However, you might get something very close. In this case, the matrix, \(X ^ { T } X\), will be close to singular, causing large numerical errors in computation. Therefore, we would like to have predictor variables that are not so strongly correlated.

4.4 Theoretical Justification

If the Linear Model Is True

Here is some theoretical justification for why we do parameter estimation using least squares.

If the linear model is true, i.e., if the conditional expectation of Y given X indeed is a linear function of the Xj’s, and Y is the sum of that linear function and an independent Gaussian noise, we have the following properties for least squares estimation.

\[ E(Y|X)=\beta_0+\sum_{j=1}^{p}X_{j}\beta{j} \]

  1. The least squares estimation of \(\beta\) is unbiased,

    \[E(\hat{\beta}_{j}) =\beta_j, j=0,1, ... , p \]

  2. To draw inferences about \(\beta\), further assume: \(Y = E(Y | X) + \epsilon\) where \(\epsilon \sim N(0,\sigma^2)\) and is independent of X.

    \(X_{ij}\) are regarded as fixed, \(Y_i\) are random due to \(\epsilon\).

The estimation accuracy of \(\hat{\beta}\), the variance of \(\hat{\beta}\) is given here:

\[Var(\hat{\beta})=(X^{T}X)^{-1}\sigma^2\]

You should see that the higher \(\sigma^2\) is, the variance of \(\hat{\beta}\) will be higher. This is very natural. Basically, if the noise level is high, you’re bound to have a large variance in your estimation. But then, of course, it also depends on \(X^T X\). This is why in experimental design, methods are developed to choose X so that the variance tends to be small.

Note that \(\hat{\beta}\) is a vector and hence its variance is a covariance matrix of size (p + 1) × (p + 1). The covariance matrix not only tells the variance for every individual \(\beta_j\), but also the covariance for any pair of \(\beta_j\) and \(\beta_k\), \(j \ne k\).

Gauss-Markov Theorem

This theorem says that the least squares estimator is the best linear unbiased estimator.

Assume that the linear model is true. For any linear combination of the parameters \(\beta_0 , \cdots ,beta_p\) you get a new parameter denoted by \(\theta = a^{T}\beta\). Then \(a^{T}\hat{\beta}\) is just a weighted sum of \(\hat{\beta}_0, ..., \hat{\beta}_p\) and is an unbiased estimator since \(\hat{\beta}\) is unbiased.

We want to estimate \(θ\) and the least squares estimate of \(θ\) is:

\[ \begin {align} \hat{\theta} & = a^T\hat{\beta}\\ & = a^T(X^{T}X)^{-1}Xy \\ & \doteq \tilde{a}^{T}y, \\ \end{align} \]

which is linear in y. The Gauss-Markov theorem states that for any other linear unbiased estimator, \(c^Ty\), the linear estimator obtained from the least squares estimation on \(\theta\) is guaranteed to have a smaller variance than \(c^Ty\):

\[Var(\tilde{a}^{T}y) \le Var(c^{T}y).\]

Keep in mind that you’re only comparing with linear unbiased estimators. If the estimator is not linear, or is not unbiased, then it is possible to do better in terms of squared loss.

\(\beta_j\), j = 0, 1, …, p are special cases of \(a^T\beta\), where \(a^T\) only has one non-zero element that equals 1.

4.5 R Scripts

1. Acquire Data

Diabetes data

The diabetes data set is taken from the UCI machine learning database on Kaggle: Pima Indians Diabetes Database

  • 768 samples in the dataset
  • 8 quantitative variables
  • 2 classes; with or without signs of diabetes

Save the data into your working directory for this course as “diabetes.data.” Then load data into R as follows:

# comma delimited data and no header for each variable
RawData = read.table("diabetes.data",sep =",",header=FALSE)

In RawData, the response variable is its last column; and the remaining columns are the predictor variables.

responseY = RawData[,dim(RawData)[2]]
    predictorX = RawData[,1:(dim(RawData)[2]-1)]

2. Fitting a Linear Model

In order to fit linear regression models in R, lm can be used for linear models, which are specified symbolically. A typical model takes the form of response~predictors where response is the (numeric) response vector and predictors is a series of predictor variables.

Take the full model and the base model (no predictors used) as examples:

fullModel = lm(responseY~predictorX[,1]+predictorX[,2]
        +predictorX[,3]+predictorX[,4]+predictorX[,5]
        +predictorX[,6]+predictorX[,7]+predictorX[,8]) 
    baseModel = lm(responseY~1)

For the full model, coefficients shows the least square estimation for \(\hat{\beta}\) and fitted.values are the fitted values for the response variable.

fullModel$coefficients
    fullModel$fitted.values

The results for the coefficients should be as follows:

(Intercept) predictorX[, 1] predictorX[, 2] predictorX[, 3] predictorX[, 4] predictorX[, 5]
        -0.8538942665 0.0205918715 0.0059202729 -0.0023318790 0.0001545198 -0.0001805345
     predictorX[, 6] predictorX[, 7] predictorX[, 8] 
    0.0132440315 0.1472374386 0.0026213938

The fitted values should start with 0.6517572852.