Lesson 3 : Linear Regression

Printer-friendly version

Introduction to Regression

Key Learning Goals for this Lesson:

Review of linear regression model focusing on prediction
Use least square estimation for linear regression
Apply model developed in training data to an independent test data
Context setting for more complex supervised prediction methods

Textbook reading: Consult Course Schedule

A quick review of regression, expectation, variance, and parameter estimation.

Input vector: \( X = (X_1, X_2, ... , X_p)\).

Output Y is real-valued.

Predict Y from X by f(X) so that the expected loss function \(E(L(Y, f(X)))\) is minimized.

Review: Expectation

Intuitively, the expectation of a random variable is its "average" value under its distribution.

Formally, the expectation of a random variable X, denoted E[X], is its Lebesgue integral with respect to its distribution.

If X takes values in some countable numeric set \(\chi\), then

\[E(X) =\sum_{x \in \chi}xP(X=x) \]

If \(X \in \mathbb{R}^m\) has a density p, then

\[E(X) =\int_{\mathbb{R}^m}xp(x)dx \]

Expectation is linear: \(E(aX +b)=aE(X) + b\).

Also, \(E(X+Y) = E(X) +E(Y)\).

Expectation is monotone: if X ≥Y, then E(X) ≥ E(Y).

Review: Variance

The variance of a random variable X is defined as:

\(Var(X) = E[(X-E[X])^2]=E[X^2]-(E[X])^2\)

and the variance obeys the following \(a, b \in \mathbb{R}\):

\(Var(aX + b) =a^2Var(X)\)

Review: Frequentist Basics

The data x₁, ... , x_n is generally assumed to be independent and identically distributed (i.i.d.).

We would like to estimate some unknown value θ associated with the distribution from which the data was generated.

In general, our estimate will be a function of the data (i.e., a statistic)

\[\hat{\theta} =f(x_1, x_2, ... , x_n)\]

Example: Given the results of n independent flips of a coin, determine the probability p with which it lands on heads.

Review: Parameter Estimation

In practice, we often seek to select a distribution (model) corresponding to our data.

If our model is parameterized by some set of values, then this problem is that of parameter estimation.

How can we obtain estimates in general? One Answer: Maximize the likelihood and the estimate is called the maximum likelihood estimate, MLE.

\( \begin {align} \hat{\theta} & = argmax_{\theta} \prod_{i=1}^{n}p_{\theta}(x_i) \\
& =argmax_{\theta} \sum_{i=1}^{n}log (p_{\theta}(x_i)) \\
\end {align} \)

Discussion

Let's look at the setup for linear regression. We have an input vector: X = (X₁, X₂, ..., X_p). This vector is p dimensional.

The output Y is a real value and is ordered.

We want to predict Y from X.

Before we actually do the prediction we have to train the function f(X). By the end of the training, I would have a function f(X) to map every X into an estimated Y. Then, we need some way to measure how good this predictor function is. This is measured by the expectation of a loss.

Why do we have loss in the estimation?

Y is actually a random variable given X. For instance, consider predicting someones weight based on the person's height. People can have different weights given the same height. If you think of the weight as Y and the height as X, Y is random given X. We therefore cannot have a perfect prediction for every subject because f(X) is a fixed function, impossible to be correct all the time. The loss measures how different the true Y is from your prediction.

Why do we have the overall loss expressed as an expectation?

The loss may be different for different subjects. In statistics, a common thing to do is to average the losses over the entire population.

Squared loss:

L(Y, f(X)) = (Y - f(X))² .

We simply measure the difference between the two variables and square them so that we can handle negative and positive difference symmetrically.

Suppose the distribution of Y given X is known, the optimal predictor is:

f*(X) = argmin_f_(X) E(Y - f(X))²

= E(Y | X) .

This is the conditional expectation of Y given X. The function E(Y | X) is called the regression function.

Example

We want to predict the number of physicians in a metropolitan area.

Problem: The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y, is expected to be related to total population (X₁, measured in thousands), land area (X₂, measured in square miles), and total personal income (X₃, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table.

i :	1	2	3	...	139	140	141
X₁	9387	7031	7017	...	233	232	231
X₂	1348	4069	3719	...	1011	813	654
X₃	72100	52737	54542	...	1337	1589	1148
Y	25627	15389	13326	...	264	371	140

Our Goal: To predict Y from X₁, X₂, and X₃.

This is a typical regression problem.

Printer-friendly version

Lesson 3 : Linear Regression

Introduction to Regression

Review: Expectation

Review: Variance

Review: Frequentist Basics

Review: Parameter Estimation

Example

Navigation

Start Here!

Lessons

Resources