7.4 - The Model

7.4 - The Model

What do \(a\) and \(b\) estimate?

So far, we've formulated the idea, as well as the theory, behind least squares estimation. But, now we have a little problem. When we derived formulas for the least squares estimates of the intercept \(a\) and the slope \(b\), we never addressed for what parameters \(a\) and \(b\) serve as estimates. It is a crucial topic that deserves our attention. Let's investigate the answer by considering the (linear) relationship between high school grade point averages (GPAs) and scores on a college entrance exam, such as the ACT exam. Well, let's actually center the high school GPAs so that if \(x\) denotes the high school GPA, then \(x-\bar{x}\) is the centered high school GPA. Here's what a plot of \(x-\bar{x}\), the centered high school GPA, and \(y\), the college entrance test score might look like:

-2610141822-1012College entrance test score(Centered) high school G.P.A.

Well, okay, so that plot deserves some explanation:

 

 

So far, in summary, we are assuming two things. First, among the entire population of college students, there is some unknown linear relationship between \(\mu_y\), (or alternatively \(E(Y)\)), the average college entrance score, and \(x-\bar{x}\), centered high school GPA. That is:

\(\mu_Y=E(Y)=\alpha+\beta(x-\bar{x})\)

Second, individual students deviate from the mean college entrance test score of the population of students having the same centered high school GPA by some unknown amount \(\epsilon_i\). That is, if \(Y_i\) denotes the college entrance test score for student \(i\), then:

\(Y_i=\alpha+\beta(x-\bar{x})+\epsilon_i\)

Unfortunately, we don't have the luxury of collecting data on all of the college students in the population. So, we can never know the population intercept \(\alpha\) or the population slope \(\beta\). The best we can do is estimate \(\alpha\) and \(\beta\) by taking a random sample from the population of college students. Suppose we randomly select fifteen students from the population, in which three students have a centered high school GPA of −2, three students have a centered high school GPA of −1, and so on. We can use those fifteen data points to determine the best fitting (least squares) line:

\(\hat{y}_i=a+b(x_i-\bar{x})\)

Now, our least squares line isn't going to be perfect, but it should do a pretty good job of estimating the true unknown population line:

-2610141822-1012College entrance test score(Centered) high school G.P.A.μY = E(Y)a+β(x-x)y = a+b(x-x)

 

That's it in a nutshell. The intercept \(a\) and the slope \(b\) of the least squares regression line estimate, respectively, the intercept \(\alpha\) and the slope \(\beta\) of the unknown population line. The only assumption we make in doing so is that the relationship between the predictor \(x\) and the response \(y\) is linear.

 

Now, if we want to derive confidence intervals for \(\alpha\) and \(\beta\), as we are going to want to do on the next page, we are going to have to make a few more assumptions. That's where the simple linear regression model comes to the rescue.

The Simple Linear Regression Model

So that we can have properly drawn normal curves, let's borrow (steal?) an example from the textbook called Applied Linear Regression Models (4th edition, by Kutner, Nachtsheim, and Neter). Consider the relationship between \(x\), the number of bids contracting companies prepare, and \(y\), the number of hours it takes to prepare the bids:

E(Y) = 9.5+2.1X060100Y4525xlesson 7.4HoursNumber of Bids PreparedE(Yi) = 104Yi = 108εi = + 4

A couple of things to note about this graph. Note that again, the mean number of hours, \(E(Y)\), is assumed to be linearly related to \(X\), the number of bids prepared. That's the first assumption. The textbook authors even go as far as to specify the values of typically unknown \(\alpha\) and \(\beta\). In this case, \(\alpha\) is 9.5 and \(\beta\) is 2.1.

Note that if \(X=45\) bids are prepared, then the expected number of hours it took to prepare the bids is:

\(\mu_Y=E(Y)=9.5+2.1(45)=104\)

In one case, it took a contracting company 108 hours to prepare 45 bids. In that case, the error \(\epsilon_i\) is 4. That is:

\(Y_i=108=E(Y)+\epsilon_i=104+4\)

The normal curves drawn for each value of \(X\) are meant to suggest that the error terms \(\epsilon_i\), and therefore the responses \(Y_i\), are normally distributed. That's a second assumption.

Did you also notice that the two normal curves in the plot are drawn to have the same shape? That suggests that each population (as defined by \(X\)) has a common variance. That's a third assumption. That is, the errors, \(\epsilon_i\), and therefore the responses \(Y_i\), have equal variances for all \(x\) values.

There's one more assumption that is made that is difficult to depict on a graph. That's the one that concerns the independence of the error terms. Let's summarize!

In short, the simple linear regression model states that the following four conditions must hold:

  • The mean of the responses, \(E(Y_i)\), is a Linear function of the \(x_i\).
  • The errors, \(\epsilon_i\), and hence the responses \(Y_i\), are Independent.
  • The errors, \(\epsilon_i\), and hence the responses \(Y_i\), are Normally distributed.
  • The errors, \(\epsilon_i\), and hence the responses \(Y_i\), have Equal variances (\(\sigma^2\)) for all \(x\) values.

Did you happen to notice that each of the four conditions is capitalized and emphasized in red? And, did you happen to notice that the capital letters spell L-I-N-E? Do you get it? We are investigating least squares regression lines, and the model effectively spells the word line! You might find this mnemonic an easy way to remember the four conditions.

Maximum Likelihood Estimators of \(\alpha\) and \(\beta\)

We know that \(a\) and \(b\):

\(\displaystyle{a=\bar{y} \text{ and } b=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n (x_i-\bar{x})^2}}\)

are ("least squares") estimators of \(\alpha\) and \(\beta\) that minimize the sum of the squared prediction errors. It turns out though that \(a\) and \(b\) are also maximum likelihood estimators of \(\alpha\) and \(\beta\) providing the four conditions of the simple linear regression model hold true.

Theorem

If the four conditions of the simple linear regression model hold true, then:

\(\displaystyle{a=\bar{y}\text{ and }b=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n (x_i-\bar{x})^2}}\)

are maximum likelihood estimators of \(\alpha\) and \(\beta\).

Answer

The simple linear regression model, in short, states that the errors \(\epsilon_i\) are independent and normally distributed with mean 0 and variance \(\sigma^2\). That is:

\(\epsilon_i \sim N(0,\sigma^2)\)

The linearity condition:

\(Y_i=\alpha+\beta(x_i-\bar{x})+\epsilon_i\)

therefore implies that:

\(Y_i \sim N(\alpha+\beta(x_i-\bar{x}),\sigma^2)\)

Therefore, the likelihood function is:

\(\displaystyle{L_{Y_i}(\alpha,\beta,\sigma^2)=\prod\limits_{i=1}^n \dfrac{1}{\sqrt{2\pi}\sigma} \text{exp}\left[-\dfrac{(Y_i-\alpha-\beta(x_i-\bar{x}))^2}{2\sigma^2}\right]}\)

which can be rewritten as:

\(\displaystyle{L=(2\pi)^{-n/2}(\sigma^2)^{-n/2}\text{exp}\left[-\dfrac{1}{2\sigma^2} \sum\limits_{i=1}^n (Y_i-\alpha-\beta(x_i-\bar{x}))^2\right]}\)

Taking the log of both sides, we get:

\(\displaystyle{\text{log}L=-\dfrac{n}{2}\text{log}(2\pi)-\dfrac{n}{2}\text{log}(\sigma^2)-\dfrac{1}{2\sigma^2} \sum\limits_{i=1}^n (Y_i-\alpha-\beta(x_i-\bar{x}))^2} \)

Now, that negative sign in front of that summation on the right hand side:

\(\color{black}\text{log}L=-\dfrac{n}{2} \text{log} (2\pi)-\dfrac{n}{2}\text{log}\left(\sigma^{2}\right)\color{blue}\boxed{\color{black}-}\color{black}\dfrac{1}{2\sigma^{2}} \color{blue}\boxed{\color{black}\sum\limits_{i=1}^{n}\left(Y_{i}-\alpha-\beta\left(x_{i}-\bar{x}\right)\right)^{2}}\)

tells us that the only way we can maximize \(\text{log}L(\alpha,\beta,\sigma^2)\) with respect to \(\alpha\) and \(\beta\) is if we minimize:

\(\displaystyle{\sum\limits_{i=1}^n (Y_i-\alpha-\beta(x_i-\bar{x}))^2}\)

with respect to \(\alpha\) and \(\beta\). Hey, but that's just the least squares criterion! Therefore, the ML estimators of \(\alpha\) and \(\beta\) must be the same as the least squares estimators \(\alpha\) and \(\beta\). That is:

\(\displaystyle{a=\bar{y}\text{ and }b=\dfrac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sum\limits_{i=1}^n (x_i-\bar{x})^2}}\)

are maximum likelihood estimators of \(\alpha\) and \(\beta\) under the assumption that the error terms are independent, normally distributed with mean 0 and variance \(\sigma^2\). As was to be proved!

What about the (Unknown) Variance \(\sigma^2\)?

In short, the variance \(\sigma^2\) quantifies how much the responses (\(y\)) vary around the (unknown) mean regression line \(E(Y)\). Now, why should we care about the magnitude of the variance \(\sigma^2\)? The following example might help to illuminate the answer to that question.

Example 7-2

thermometer

We know that there is a perfect relationship between degrees Celsius (C) and degrees Fahrenheit (F), namely:

\(F=\dfrac{9}{5}C+32\)

Suppose we are unfortunate, however, and therefore don't know the relationship. We might attempt to learn about the relationship by collecting some temperature data and calculating a least squares regression line. When all is said and done, which brand of thermometers do you think would yield more precise future predictions of the temperature in Fahrenheit? The one whose data are plotted on the left? Or the one whose data are plotted on the right?

lesson 7.4Celsiusfahrenheit = 34.1233 + 1.61538 celsiusS = 4.76923R-Sq = 96.1%R-Sq(adj) = 95.5 %Regression PlotFahrenheit05060708090403010010203040
lesson 7.4Celsiusfahrenheit = 17.0709 + 2.30583 celsiusS = 29.7918R-Sq = 70.6%R-Sq(adj) = 66.4 %Regression PlotFahrenheit005010010203040

Answer

As you can see, for the plot on the left, the Fahrenheit temperatures do not vary or "bounce" much around the estimated regression line. For the plot on the right, on the other hand, the Fahrenheit temperatures do vary or "bounce" quite a bit around the estimated regression line. It seems reasonable to conclude then that the brand of thermometers on the left will yield more precise future predictions of the temperature in Fahrenheit.

Now, the variance \(\sigma^2\) is, of course, an unknown population parameter. The only way we can attempt to quantify the variance is to estimate it. In the case in which we had one population, say the (normal) population of IQ scores:

lesson 7.4IQ0.0250.0200.0150.0100.0050.000526884100116132148Probability Density

we would estimate the population variance \(\sigma^2\) using the sample variance:

\(s^2=\dfrac{\sum\limits_{i=1}^n (Y_i-\bar{Y})^2}{n-1}\)

We have learned that \(s^2\) is an unbiased estimator of \(\sigma^2\), the variance of the one population. But what if we no longer have just one population, but instead have many populations? In our bids and hours example, there is a population for every value of \(x\):

E(Y) = 9.5+2.1X060100Y4525xlesson 7.4HoursNumber of Bids PreparedE(Yi) = 104Yi = 108εi = +4

In this case, we have to estimate \(\sigma^2\), the (common) variance of the many populations. There are two possibilities − one is a biased estimator, and one is an unbiased estimator.

Theorem

The maximum likelihood estimator of \(\sigma^2\) is:

\(\hat{\sigma}^2=\dfrac{\sum\limits_{i=1}^n (Y_i-\hat{Y}_i)^2}{n}\)

(It is a biased estimator of \(\sigma^2\), the common variance of the many populations.

Answer

We have previously shown that the log of the likelihood function is:

\(\text{log}L=-\dfrac{n}{2}\text{log}(2\pi)-\dfrac{n}{2}\text{log}(\sigma^2)-\dfrac{1}{2\sigma^2} \sum\limits_{i=1}^n (Y_i-\alpha-\beta(x_i-\bar{x}))^2 \)

To maximize the log likelihood, we have to take the partial derivative of the log likelihood with respect to \(\sigma^2\). Doing so, we get:

\(\dfrac{\partial L_{Y_i}(\alpha,\beta,\sigma^2)}{\partial \sigma^2}=-\dfrac{n}{2\sigma^2}-\dfrac{1}{2}\sum (Y_i-\alpha-\beta(x_i-\bar{x}))^2 \cdot \left(- \dfrac{1}{(\sigma^2)^2}\right)\)

Setting the derivative equal to 0, and multiplying through by \(2\sigma^4\):

\(\frac{\partial L_{Y_{i}}\left(\alpha, \beta, \sigma^{2}\right)}{\partial \sigma^{2}}=\left[-\frac{n}{2 \sigma^{2}}-\frac{1}{2} \sum\left(Y_{i}-\alpha-\beta\left(x_{i}-\bar{x}\right)\right)^{2} \cdot-\frac{1}{\left(\sigma^{2}\right)^{2}} \stackrel{\operatorname{SET}}{\equiv} 0\right] 2\left(\sigma^{2}\right)^{2}\)

we get:

\(-n\sigma^2+\sum (Y_i-\alpha-\beta(x_i-\bar{x}))^2 =0\)

And, solving for and putting a hat on \(\sigma^2\), as well as replacing \(\alpha\) and \(\beta\) with their ML estimators, we get:

\(\hat{\sigma}^2=\dfrac{\sum (Y_i-\alpha-\beta(x_i-\bar{x}))^2 }{n}=\dfrac{\sum(Y_i-\hat{Y}_i)^2}{n}\)

As was to be proved!

Mean Square Error

The mean square error, on the other hand:

\(MSE=\dfrac{\sum\limits_{i=1}^n(Y_i-\hat{Y}_i)^2}{n-2}\)

is an unbiased estimator of \(\sigma^2\), the common variance of the many populations.

We'll need to use these estimators of \(\sigma^2\) when we derive confidence intervals for \(\alpha\) and \(\beta\) on the next page.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility