Lesson 3 : Linear Regression
Introduction to Regression
Key Learning Goals for this Lesson: 
Textbook reading: Consult Course Schedule 
A quick review of regression, expectation, variance, and parameter estimation.
Input vector: \( X = (X_1, X_2, ... , X_p)\).
Output Y is realvalued.
Predict Y from X by f(X) so that the expected loss function \(E(L(Y, f(X)))\) is minimized.
Review: Expectation
Intuitively, the expectation of a random variable is its "average" value under its distribution.
Formally, the expectation of a random variable X, denoted E[X], is its Lebesgue integral with respect to its distribution.
If X takes values in some countable numeric set \(\chi\), then
\[E(X) =\sum_{x \in \chi}xP(X=x) \]
If \(X \in \mathbb{R}^m\) has a density p, then
\[E(X) =\int_{\mathbb{R}^m}xp(x)dx \]
Expectation is linear: \(E(aX +b)=aE(X) + b\).
Also, \(E(X+Y) = E(X) +E(Y)\).
Expectation is monotone: if X ≥Y, then E(X) ≥ E(Y).
Review: Variance
The variance of a random variable X is defined as:
\(Var(X) = E[(XE[X])^2]=E[X^2](E[X])^2\)
and the variance obeys the following \(a, b \in \mathbb{R}\):
\(Var(aX + b) =a^2Var(X)\)
Review: Frequentist Basics
The data x_{1}, ... , x_{n} is generally assumed to be independent and identically distributed (i.i.d.).
We would like to estimate some unknown value θ associated with the distribution from which the data was generated.
In general, our estimate will be a function of the data (i.e., a statistic)
\[\hat{\theta} =f(x_1, x_2, ... , x_n)\]
Example: Given the results of n independent flips of a coin, determine the probability p with which it lands on heads.
Review: Parameter Estimation
In practice, we often seek to select a distribution (model) corresponding to our data.
If our model is parameterized by some set of values, then this problem is that of parameter estimation.
How can we obtain estimates in general? One Answer: Maximize the likelihood and the estimate is called the maximum likelihood estimate, MLE.
\( \begin {align} \hat{\theta} & = argmax_{\theta} \prod_{i=1}^{n}p_{\theta}(x_i) \\
& =argmax_{\theta} \sum_{i=1}^{n}log (p_{\theta}(x_i)) \\
\end {align} \)
Discussion
Let's look at the setup for linear regression. We have an input vector: X = (X_{1}, X_{2}, ..., X_{p }). This vector is p dimensional.
The output Y is a real value and is ordered.
We want to predict Y from X.
Before we actually do the prediction we have to train the function f(X). By the end of the training, I would have a function f(X) to map every X into an estimated Y. Then, we need some way to measure how good this predictor function is. This is measured by the expectation of a loss.
Why do we have loss in the estimation?
Y is actually a random variable given X. For instance, consider predicting someones weight based on the person's height. People can have different weights given the same height. If you think of the weight as Y and the height as X, Y is random given X. We therefore cannot have a perfect prediction for every subject because f(X) is a fixed function, impossible to be correct all the time. The loss measures how different the true Y is from your prediction.
Why do we have the overall loss expressed as an expectation?
The loss may be different for different subjects. In statistics, a common thing to do is to average the losses over the entire population.
Squared loss:
L(Y, f(X)) = (Y  f(X))^{2} .
We simply measure the difference between the two variables and square them so that we can handle negative and positive difference symmetrically.
Suppose the distribution of Y given X is known, the optimal predictor is:
f*(X) = argmin_{f}_{(X)} E(Y  f(X))^{2}
= E(Y  X) .
This is the conditional expectation of Y given X. The function E(Y  X) is called the regression function.
Example
We want to predict the number of physicians in a metropolitan area.
Problem: The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y, is expected to be related to total population (X_{1}, measured in thousands), land area (X_{2}, measured in square miles), and total personal income (X_{3}, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table.
i :

1

2

3

...

139

140

141

X_{1}

9387

7031

7017

...

233

232

231

X_{2}

1348

4069

3719

...

1011

813

654

X_{3}

72100

52737

54542

...

1337

1589

1148

Y

25627

15389

13326

...

264

371

140

Our Goal: To predict Y from X_{1}, X_{2}, and X_{3}.
This is a typical regression problem.