7.3 - Least Squares: The Theory

Now that we have the idea of least squares behind us, let's make the method more practical by finding a formula for the intercept \(a_1\) and slope \(b\). We learned that in order to find the least squares regression line, we need to minimize the sum of the squared prediction errors, that is:

\(Q=\sum\limits_{i=1}^n (y_i-\hat{y}_i)^2\)

We just need to replace that \(\hat{y}_i\) with the formula for the equation of a line:

\(\hat{y}_i=a_1+bx_i\)

to get:

\(Q=\sum\limits_{i=1}^n (y_i-\hat{y}_i)^2=\sum\limits_{i=1}^n (y_i-(a_1+bx_i))^2\)

We could go ahead and minimize \(Q\) as such, but our textbook authors have opted to use a different form of the equation for a line, namely:

\(\hat{y}_i=a+b(x_i-\bar{x})\)

Each form of the equation for a line has its advantages and disadvantages. Statistical software, such as Minitab, will typically calculate the least squares regression line using the form:

\(\hat{y}_i=a_1+bx_i\)

Clearly a plus if you can get some computer to do the dirty work for you. A (minor) disadvantage of using this form of the equation, though, is that the intercept \(a_1\) is the predicted value of the response \(y\) when the predictor \(x=0\), which is typically not very meaningful. For example, if \(x\) is a student's height (in inches) and \(y\) is a student's weight (in pounds), then the intercept \(a_1\) is the predicted weight of a student who is 0 inches tall..... errrr.... you get the idea. On the other hand, if we use the equation:

\(\hat{y}_i=a+b(x_i-\bar{x})\)

then the intercept \(a\) is the predicted value of the response \(y\) when the predictor \(x_i=\bar{x}\), that is, the average of the \(x\) values. For example, if \(x\) is a student's height (in inches) and \(y\) is a student's weight (in pounds), then the intercept \(a\) is the predicted weight of a student who is average in height. Much better, much more meaningful! The good news is that it is easy enough to get statistical software, such as Minitab, to calculate the least squares regression line in this form as well.

Okay, with that aside behind us, time to get to the punchline.

Least Squares Estimates Section

Theorem

The least squares regression line is:

\(\hat{y}_i=a+b(x_i-\bar{x})\)

with least squares estimates:

\(a=\bar{y}\) and \(b=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n (x_i-\bar{x})^2}\)

Proof

In order to derive the formulas for the intercept \(a\) and slope \(b\), we need to minimize:

\(Q=\sum\limits_{i=1}^n (y_i-(a+b(x_i-\bar{x})))^2\)

Time to put on your calculus cap, as minimizing \(Q\) involves taking the derivative of \(Q\) with respect to \(a\) and \(b\), setting to 0, and then solving for \(a\) and \(b\). Let's do that. Starting with the derivative of \(Q\) with respect to \(a\), we get:

Now knowing that \(a\) is \(\bar{y}\), the average of the responses, let's replace \(a\) with \(\bar{y}\) in the formula for \(Q\):

\(Q=\sum\limits_{i=1}^n (y_i-(\bar{y}+b(x_i-\bar{x})))^2\)

and take the derivative of \(Q\) with respect to \(b\). Doing so, we get:

As was to be proved.

desert island By the way, you might want to note that the only assumption relied on for the above calculations is that the relationship between the response \(y\) and the predictor \(x\) is linear.

Another thing you might note is that the formula for the slope \(b\) is just fine providing you have statistical software to make the calculations. But, what would you do if you were stranded on a desert island, and were in need of finding the least squares regression line for the relationship between the depth of the tide and the time of day? You'd probably appreciate having a simpler calculation formula! You might also appreciate understanding the relationship between the slope \(b\) and the sample correlation coefficient \(r\).

With that lame motivation behind us, let's derive alternative calculation formulas for the slope \(b\).

Theorem

An alternative formula for the slope \(b\) of the least squares regression line:

\(\hat{y}_i=a+b(x_i-\bar{x})\)

is:

\(b=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})y_i}{\sum\limits_{i=1}^n (x_i-\bar{x})^2}=\dfrac{\sum\limits_{i=1}^n x_iy_i-\left(\dfrac{1}{n}\right) \left(\sum\limits_{i=1}^n x_i\right) \left(\sum\limits_{i=1}^n y_i\right)}{\sum\limits_{i=1}^n x^2_i-\left(\dfrac{1}{n}\right) \left(\sum\limits_{i=1}^n x_i\right)^2}\)

Proof

The proof, which may or may not show up on a quiz or exam, is left for you as an exercise.