8: Regression (General Linear Models Part I)

Overview Section

Case-Study: Land development

Bob works in a local government and is responsible for approving development proposals. As part of his work, developers come to him with proposals for developing land in his town. His work involves assessing the amount of “critical areas” such as wetlands, rivers, streams, landslide hazard areas, etc. in a given area of land. Lately he notices that the dollar amount of the proposal appears to increase as the number of critical areas in the land increases. However, he needs to test his theory, let’s see if we can help him.

The first step is for Bob to look at his data. He wants to use the number of critical areas to predict the dollar amount in the proposal. Both of these variables are quantitative. Below are the descriptive statistics for Bob’s data.

Descriptive Statistics: Critical Areas, Cost
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Critical Areas 95 0 4.7990 0.0979 0.9539 2.1296 4.0708 4.8344 5.4800 6.7657
Cost 95 0 99.53 1.03 9.99 72.71 92.56 100.71 106.17 118.04

Bob is looking to predict a quantitative variable based on what he knows about a his actual value of a quantitative variable. Because he wants to predict, he needs to consider a regression.

Before we get started thinking about regression, let’s take a step back. A regression is a simple linear model. If that sounds strange to you, you can think about a linear model as an equation where we consider Y to be some function of X. In other words, Y = f(X). Both the topic in this unit (regression) and that in the next unit (ANOVA) are linear models. With advances in both computing power and the complexity of designs, separating regression and ANOVA are really a matter of semantics than substance. That said, this unit will focus on regression, with ANOVA coming later in the course. Let’s get back to Bob.

Bob did a great job understanding correlations and scatterplots, so he creates a scatterplot of the data.

He recognizes the fact that he has two quantitative variables, dollar amount and number of critical areas and that they have a positive strong linear relationship. However, he learned that the limitation of correlation is that the technique cannot lead to insights about causality between variables. Now he needs a new statistical technique.

Regression analysis provides the evidence that Bob is seeking, specifically how a specific variable of interest is affected by one or more variables. For Bob’s example, he is using number of critical areas to predict dollar amount.

Before we get starting with regression, it is important to distinguish between the variable of interest and the variable(s) we will use to predict the variable of interest.

Response Variable
Denoted, Y, is also called the variable of interest or dependent variable. In Bob's example, this is the dollar amount
Predictor Variable
Denoted, X, is also called the explanatory variable or independent variable. In Bob’s example, this is the number of critical features

When there is only one predictor variable, we refer to the regression model as a simple linear regression model.

In statistics, we can describe how variables are related using the mathematical function as we described as a linear model. We refer to this model as the simple linear regression model.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify the slope,  intercept, and coefficient of determination
  • Calculate predicted and residual (error) values
  • Test the significance of the slope, including statement of the null hypothesis for the slope
  • State the assumptions for regression