10: Logistic Regression

Overview Section

 Case-Study: Campaign Marketing

Elections are a critical event in the United States. There are always a plethora of polls to predict if a candidate will win an election. In the lead-up to a recent gubernatorial election, Serena, the state’s first female candidate for governor, wanted to specifically target households that might be on the fence about voting for her. Her campaign hired a marketing firm to conduct a study of a random sample of 1,000 voters to report the annual household income and whether or not they would vote for Serena. The firm was able to identify the relationship between household income and voting for Serena. The firm also was able to target marketing efforts toward households who might be less certain about voting for Serena. Let’s take a closer look at how the marketing firm made these predictions.

As a first step, let’s take a look at the data from the polling.

Descriptive Statistics: Household Income and Voting
Variable Vote Yes N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Household Income 0 505 0 58.192 0.0874 1.964 51.019 57.084 58.389 59.438 64.081
  1 495 0 62.247 0.0936 2.083 52.529 60.817 61.983 63.551 70.585

We can see that there are two categories, Vote Yes (0,1), and we can compare the mean household income for the two categories. The important point is that the categories are actually the RESPONSE variable in this example! Unlike linear regression, our PREDICTOR variable is the continuous one.

But, let’s say the polling firm made a mistake and used a simple linear regression to model the election prediction problem. If you recall, the regression equation (\(y=b_0+b_1X_1\)) yields a predicted value of y. So in our example, voting for Serena would be a function of household wealth.

A partial scatterplot of the data looks like this:

Notice how there are only two values for Y (0,1) representing a voter voting for Serena or not. This clearly illustrates the bivariate nature of the response.

We can enter our data into Minitab and produce a regression equation along with predicted values for Y. Let’s take a look at the output for the predicted values.

C2 C3 C4
Vote Yes Household Income Predicted Value
0 51.0189 -0.64025
0 51.0639 -0.63468
0 51.4582 -0.58592
0 51.7075 -0.55510
0 52.4308 -0.46564
1 52.5291 -0.45349
1 52.5586 -0.44984
1 53.0822 -0.38510
1 53.1185 -0.38061
1 53.1301 -0.37916
0 53.3159 -0.35619
0 53.6321 -0.31709

But wait, the scatterplot showed us that winning had to be a value of zero or one! These values range from -.64 to -.31 in this snippet of output. These really do not make any sense at all.

We need another tool to be able to model a “yes or no” (bivariate) outcome. This tool is a logistic regression.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify the difference between a OLS and binary logistic regression response variable type
  • Correctly interpret the coefficient of a logistic regression model
  • Correctly interpret the probability of a response variable in a logistic regression model
  • Identify the indicators of model fit in logistic regression