To conclude this lesson we'll digress slightly to consider the lack of fit test for linearity —the "L" in the "LINE" conditions. The reason we consider this here is that, like the ANOVA test of earlier, this test is an F-test based on decomposing sums of squares.
However, before we "derive" the lack of fit F-test, it is important to note that the test requires repeat observations — called "replicates" — for at least one of the values of the predictor x. That is, if each x value in the data set is unique, then the lack of fit test can't be conducted on the data set. Even when we do have replicates, we typically need quite a few for the test to have any power. As such, this test generally only applies to specific types of datasets with plenty of replicates.
As is often the case before we learn a new hypothesis test, we have to get some new notation under our belt. In doing so, we'll look at some (contrived) data that purports to describe the relationship between the size of the minimum deposit required when opening a new checking account at a bank (x) and the number of new accounts at the bank (y) (New Accounts data). Suppose the trend in the data looks curved, but we fit a line through the data nonetheless:
If you select each of the specific x values (75, 100, 125, 150, 175, and 200) in the video above, you will see the standard notation used for the lack of fit F-test. Let's take the case where x = 75 dollars:
- \(y_{11}\) denotes the first measurement (28) made at the first x-value (x = 75) in the data set
- \(y_{12}\) denotes the second measurement (42) made at the first x-value (x = 75) in the data set
- \(\bar{y}_{1}\) denotes the average (35) of all of the y values at the first x-value (x = 75)
- \(\hat{y}_{11}\) denotes the predicted response (87.5) for the first measurement made at the first x-value (x = 75)
- \(\hat{y}_{12}\) denotes the predicted response (87.5) for the second measurement made at the first x-value (x = 75)
You should now understand the notation that appears when you roll your cursor over the other x values (100, 125, and so on). In general:
- \(y_{ij}\) denotes the \(j^{th}\) measurement made at the \(i^{th}\) x-value in the data set
- \(\bar{y}_{i}\) denotes the average of all of the y values at the \(i^{th}\) x-value
- \(\hat{y}_{ij}\) denotes the predicted response for the \(j^{th}\) measurement made at the \(i^{th}\) x-value