3.4 - Theoretical Justification

Printer-friendly versionPrinter-friendly version

If the Linear Model Is True

Here is some theoretical justification for why we do parameter estimation using least squares.

If the linear model is true, i.e., if the conditional expectation of Y given X indeed is a linear function of the Xj's, and Y is the sum of that linear function and an independent Gaussian noise, we have the following properties for least squares estimation.

\[ E(Y|X)=\beta_0+\sum_{j=1}^{p}X_{j}\beta{j}  \]

1. The least squares estimation of \(\beta\) is unbiased,

\(E(\hat{\beta}_{j}) =\beta_j,   j=0,1, ... , p  \)

2. To draw inferences about \(\beta\), further assume: \(Y = E(Y | X) + \epsilon\) where \(\epsilon \sim N(0,\sigma^2)\) and is independent of X.

\(X_{ij}\) are regarded as fixed, \(Y_i\) are random due to ε.

The estimation accuracy of \(\hat{\beta}\), the variance of \(\hat{\beta}\) is given here:

\[Var(\hat{\beta})=(X^{T}X)^{-1}\sigma^2  \]

You should see that the higher \(\sigma^2\) is, the variance of \(\hat{\beta}\) will be higher. This is very natural. Basically if the noise level is high, you're bound to have a large variance in your estimation. But then of course it also depends on \(X^T X\). This is why in experimental design, methods are developed to choose X so that the variance tends to be small.

Note that \(\hat{\beta}\) is a vector and hence its variance is a covariance matrix of size (p + 1) × (p + 1). The covariance matrix not only tells the variance for every individual \(\beta_j\), but also the covariance for any pair of \(\beta_j\) and \(\beta_k\), \(j \ne k\).

Gauss-Markov Theorem

This theorem says that the least squares estimator is the best linear unbiased estimator.

Assume that the linear model is true. For any linear combination of the parameters \(\beta_0 , \cdots ,beta_p\) you get a new parameter denoted by \(\theta = a^{T}\beta\). Then \(a^{T}\hat{\beta}\) is just a weighted sum of \(\hat{\beta}_0, ..., \hat{\beta}_p\) and is an unbiased estimator since \(\hat{\beta}\) is unbiased.

We want to estimate θ and the least squares estimate of θ is:

\( \begin {align} \hat{\theta} & = a^T\hat{\beta}\\
& = a^T(X^{T}X)^{-1}Xy \\
& \doteq \tilde{a}^{T}y, \\
\end{align} \)

which is linear in y. The Gauss-Markov theorem states that for any other linear unbiased estimator, \(c^Ty\), the linear estimator obtained from the least squares estimation on \(\theta\) is guaranteed to have a smaller variance than \(c^Ty\):

\[Var(\tilde{a}^{T}y) \le  Var(c^{T}y). \]

Keep in mind that you're only comparing with linear unbiased estimators. If the estimator is not linear, or is not unbiased, then it is possible to do better in terms of squared loss.

\(\beta_j\), j = 0, 1, ..., p are special cases of \(a^T\beta\), where \(a^T\) only has one non-zero element that equals 1.