7.2 - Least Squares: The Idea

Example 7-1

Before delving into the theory of least squares, let's motivate the idea behind the method of least squares by way of example.

A student was interested in quantifying the (linear) relationship between height (in inches) and weight (in pounds), so she measured the height and weight of ten randomly selected students in her class. After taking the measurements, she created the adjacent scatterplot of the obtained heights and weights. Wanting to summarize the relationship between height and weight, she eyeballed what she thought were two good lines (solid and dashed), but couldn't decide between:

\(\text{weight} = −266.5 + 6.1\times \text{height}\)
\(\text{weight} = −331.2 + 7.1\times \text{height}\)

Which is the "best fitting line"?

Answer

In order to facilitate finding the best fitting line, let's define some notation. Recalling that an experimental unit is the thing being measured (in this case, a student):

let \(y_i\) denote the observed response for the \(i^{th}\) experimental unit
let \(x_i\) denote the predictor value for the \(i^{th}\) experimental unit
let \(\hat{y}_i\) denote the predicted response (or fitted value) for the \(i^{th}\) experimental unit

Therefore, for the data point circled in red:

we have:

\(x_i=75\) and \(y_i=208\)

And, using the unrounded version of the proposed line, the predicted weight of a randomly selected 75-inch tall student is:

\(\hat{y}_i=-266.534+6.13758(75)=193.8\)

pounds. Now, of course, the estimated line does not predict the weight of a 75-inch tall student perfectly. In this case, the prediction is 193.8 pounds, when the reality is 208 pounds. We have made an error in our prediction. That is, in using \(\hat{y}_i\) to predict the actual response \(y_i\) we make a prediction error (or a residual error) of size:

\(e_i=y_i-\hat{y}_i\)

Now, a line that fits the data well will be one for which the \(n\) prediction errors (one for each of the \(n\) data points — \(n=10\), in this case) are as small as possible in some overall sense. This idea is called the "least squares criterion." In short, the least squares criterion tells us that in order to find the equation of the best fitting line:

\(\hat{y}_i=a_1+bx_i\)

we need to choose the values \(a_1\) and \(b\) that minimize the sum of the squared prediction errors. That is, find \(a_1\) and \(b\) that minimize:

\(Q=\sum\limits_{i=1}^n (y_i-\hat{y}_i)^2=\sum\limits_{i=1}^n (y_i-(a_1+bx_i))^2\)

So, using the least squares criterion to determine which of the two lines:

\(\text{weight} = −266.5 + 6.1 \times\text{height}\)
\(\text{weight} = −331.2 + 7.1 \times \text{height}\)

is the best fitting line, we just need to determine \(Q\), the sum of the squared prediction errors for each of the two lines, and choose the line that has the smallest value of \(Q\). For the dashed line, that is, for the line:

\(\text{weight} = −331.2 + 7.1\times\text{height}\)

here's what the work would look like:

`i`	`\( x_i \)`	`\( y_i \)`	`\( \hat{y_i} \)`	`\( y_i -\hat{y_i} \)`	`\( (y_i - \hat{y_i})^2 \)`
1	64	121	123.2	-2.2	4.84
2	73	181	187.1	-6.1	37.21
3	71	156	172.9	-16.9	285.61
4	69	162	158.7	3.3	10.89
5	66	142	137.4	4.6	21.16
6	69	157	158.7	-1.7	2.89
7	75	208	201.3	6.7	44.89
8	71	169	172.9	-3.9	15.21
9	63	127	116.1	10.9	118.81
10	72	165	180.0	-15.0	225.00
					------
					766.51

The first column labeled \(i\) just keeps track of the index of the data points, \(i=1, 2, \ldots, 10\). The columns labeled \(x_i\) and \(y_i\) contain the original data points. For example, the first student measured is 64 inches tall and weighs 121 pounds. The fourth column, labeled \(\hat{y}_i\), contains the predicted weight of each student. For example, the predicted weight of the first student, who is 64 inches tall, is:

\(\hat{y}_1=-331.2+7.1(64)=123.2\)

pounds. The fifth column contains the errors in using \(\hat{y}_i\) to predict \(y_i\). For the first student, the prediction error is:

\(e_1=121-123.3=-2.2\)

And, the last column contains the squared prediction errors. The squared prediction error for the first student is:

\(e^2_1=(-2.2)^2=4.84\)

By summing up the last column, that is, the column containing the squared prediction errors, we see that \(Q= 766.51\) for the dashed line. Now, for the solid line, that is, for the line:

\(\text{weight} = −266.5 + 6.1\times\text{height}\)

here's what the work would look like:

`i`	`\( x_i \)`	`\( y_i \)`	`\( \hat{y_i} \)`	`\( y_i -\hat{y_i} \)`	`\( (y_i - \hat{y_i})^2 \)`
1	64	121	126.271	-5.3	28.9
2	73	181	181.509	-0.5	0.25
3	71	156	169.234	-13.2	174.24
4	69	162	156.959	5.0	25.00
5	66	142	138.546	3.5	12.25
6	69	157	156.959	0.0	0.00
7	75	208	193.784	14.2	201.64
8	71	169	169.234	-0.2	0.04
9	63	127	120.133	6.9	47.61
10	72	165	175.371	-10.4	108.16
					------
					597.28

The calculations for each column are just as described previously. In this case, the sum of the last column, that is, the sum of the squared prediction errors for the solid line is \(Q= 597.28\). Choosing the equation that minimizes \(Q\), we can conclude that the solid line, that is:

\(\text{weight} = −266.5 + 6.1\times\text{height}\)

is the best fitting line.

In the preceding example, there's one major problem with concluding that the solid line is the best fitting line! We've only considered two possible candidates. There are, in fact, an infinite number of possible candidates for best fitting line. The approach we used above clearly won't work in practice. On the next page, we'll instead derive some formulas for the slope and the intercept for least squares regression line.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility