10: Logistic Regression

10: Logistic Regression

Overview

 Case-Study: Campaign Marketing

Elections are a critical event in the United States. There are always a plethora of polls to predict if a candidate will win an election. In the lead-up to a recent gubernatorial election, Serena, the state’s first female candidate for governor, wanted to specifically target households that might be on the fence about voting for her. Her campaign hired a marketing firm to conduct a study of a random sample of 1,000 voters to report the annual household income and whether or not they would vote for Serena. The firm was able to identify the relationship between household income and voting for Serena. The firm also was able to target marketing efforts toward households who might be less certain about voting for Serena. Let’s take a closer look at how the marketing firm made these predictions.

As a first step, let’s take a look at the data from the polling.

Descriptive Statistics: Household Income and Voting
Variable Vote Yes N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Household Income 0 505 0 58.192 0.0874 1.964 51.019 57.084 58.389 59.438 64.081
  1 495 0 62.247 0.0936 2.083 52.529 60.817 61.983 63.551 70.585

We can see that there are two categories, Vote Yes (0,1), and we can compare the mean household income for the two categories. The important point is that the categories are actually the RESPONSE variable in this example! Unlike linear regression, our PREDICTOR variable is the continuous one.

But, let’s say the polling firm made a mistake and used a simple linear regression to model the election prediction problem. If you recall, the regression equation (\(y=b_0+b_1X_1\)) yields a predicted value of y. So in our example, voting for Serena would be a function of household wealth.

A partial scatterplot of the data looks like this:

Notice how there are only two values for Y (0,1) representing a voter voting for Serena or not. This clearly illustrates the bivariate nature of the response.

We can enter our data into Minitab and produce a regression equation along with predicted values for Y. Let’s take a look at the output for the predicted values.

C2 C3 C4
Vote Yes Household Income Predicted Value
0 51.0189 -0.64025
0 51.0639 -0.63468
0 51.4582 -0.58592
0 51.7075 -0.55510
0 52.4308 -0.46564
1 52.5291 -0.45349
1 52.5586 -0.44984
1 53.0822 -0.38510
1 53.1185 -0.38061
1 53.1301 -0.37916
0 53.3159 -0.35619
0 53.6321 -0.31709

But wait, the scatterplot showed us that winning had to be a value of zero or one! These values range from -.64 to -.31 in this snippet of output. These really do not make any sense at all.

We need another tool to be able to model a “yes or no” (bivariate) outcome. This tool is a logistic regression.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify the difference between a OLS and binary logistic regression response variable type
  • Correctly interpret the coefficient of a logistic regression model
  • Correctly interpret the probability of a response variable in a logistic regression model
  • Identify the indicators of model fit in logistic regression

10.1 - Linear Model

10.1 - Linear Model

A typical linear regression model is presented as \(y=b_0+b_1X_1+ e\). However, as we pointed out, this model will not work to predict voting outcomes of yes or no. As an alternative, we can think about the odds of the event happening, as opposed to predicting the "value" of the event (as we did with OLS regression).

When we have a binary response, y, the expected value of y, \(E(y) = \pi\), where \(\pi\) denotes \(P\left(y = 1\right)\) (as a reminder the P is the probability).

Let's think through this a bit. In the above definition, we are saying that the expected value of y is the probability of the event occurring. Let's say that there is a 50% probability of voting for Serena. Our expected value for y is a 50% probability. That may make a bit more sense.

So now we can begin to think about using a linear model to model predicted values of probabilities instead of values. But how do we go from our observed data to a probability?

We need to return back to a basic concept of odds. As a matter of review, an odds is the ratio of an event occurring to the event not occurring. In our example, this would be the counts of "yes" votes for Serena to the counts of "no" votes for Serena.

We also learned that the odds ratio is the ratio of two odds. Since \(\pi = P\left(y = 1\right)\), then \(1 – \pi = P\left(y = 0\right)\). The ratio \(\frac{\pi}{1− \pi} = \frac{P\left(y = 1\right)}{P\left(y = 0\right)}\) is known as the odds of the event y = 1 occurring. For example, if \(\pi = 0.8\) then the odds of y = 1 occurring are \(\frac{0.80}{0.20} = 4\), or 4 to 1.

Hopefully, the odds ratio sounds familiar to you and is the basic principle of what is going on in logistic regression. There are a few other modeling parts that have to occur in order for the mathematics to make sense. First, we have been working with linear models in this course. Because our model is now a ratio, it is nonlinear. Therefore the software actually uses a "logit link" function. You do not need to fully understand what this is, just that it is different than the OLS method of fitting the model.

Because we have to use the logit link function, we also need to express the odds ratio as a log odds. This is actually where the name "logistic regression" comes from.

The resulting log model is:

\(\ln \left(\dfrac{\pi}{1-\pi}\right)=B_{0}+B_{1} x_{1}+\cdots+B_{k} x_{k}\)

But interpreting a log odds is really hard. Fortunately, the relationship between the log odds and a probability is fairly easy to translate. So we can algebraically manipulate the log odds model into a probability form. When we do this, we can now state the outcome (i.e. "fitted" value) in terms of a probability of an event happening!

\(\pi=\dfrac{\exp \left(B_{0}+B_{1} x_{1}+\cdots+B_{k} x_{k}\right)}{1 \exp \left(B_{0}+B_{1} x_{1}+\cdots+B_{k} x_{k}\right)}\)

Looking at a plot of the logit link function is helpful as it comes closer to the scatterplot we produced for the polling data.

1.0 0.5 0.0 -10 10 0 x

From this graph, we can see that instead of a straight line fitting the binary response variable (in our example a voter voting for Serena, instead we have an S-shaped curved line (representing the logit link function). Now, our model will have a minimum value of 0 and a maximum value of 1, solving the problem of values beyond 0 and 1 observed when we incorrectly applied a simple linear regression to the voter prediction model. Also, notice that the limits of 0 and 1 are appropriate values for probabilities! Problem solved!


10.2 - Binary Logistic Regression

10.2 - Binary Logistic Regression

Let's take a closer look at the binary logistic regression model. Similar to the linear regression model, the equation looks the same as Y is some function of X:

\(Y = f(X)\)

However, as stated previously, the function is different as we employ the logit link function. Again, not going into too much detail about how the logit link function is calculated in this class, the output is in the form of a “log odds”.

Notice in the “logistic regression table” that the log odds is actually listed as the “coefficient”. The nomenclature is similar to that of the simple linear regression coefficient for the slope. Moving further down the row of the table, we can see that just like the slope, the log odds contains a significance test, only using a “z” test as opposed to a “t” test due to the categorical response variable. Fortunately, we interpret the log odds in a very similar logic to the slope, specifically.

Interpreting a Log Odds

(Predictor Coefficient)

  • If \(\beta > 0\) then the log odds of observing the event become higher if X is higher.
  • If \(\beta < 0\) then the log odds of observing the event become lower if X is higher.
  • If \(\beta = 0\) then X does not tell us anything about the log odds of observing the event.

When we run a logistic regression on Serena's polling data the output indicates a log odds of 1.21. We look at the “Z-Value” and see a large value (15.47) which leads us to reject the null hypothesis that household incomes does not tell us anything about the log odds of voting for Serena. Because the coefficient is greater than zero, we can also conclude that greater household income increases the log odds of voting for Serena.

Coefficients
Predictor Coef SE Coef 95% CI Z-Value P-Value VIF
Constant -73.39 4.74 (-82.68, -64.09) -15.47 0.000  
Household Income 1.2183 0.0787 (1.0640, 1.3726) 15.47 0.000 1

But what is the log odds? Well, simply this is the result of using the logit link function. But this is not easily interpretable so we tend to focus on the output related to the odds. The odds returns us to a basic categorical statistical function. As a reminder, an odds ratio is the ratio of an event occurring to not occurring. An odds ratio of 1 indicates there is no difference in the frequency of the event occurring vs. Not. So with the odds ratio in the output, we are comparing our results to an odds ratio of 1. Typically, these odds ratios are accompanied by a confidence interval, again, looking for the value of “1” in the interval to conclude no relationship.

The polling output tells us the odds of voting for Serena increase by 3.38 with every one unit increase in household income (measured in 1,000’s).

Helpfully, the result of the log odds hypothesis test and the odds ratio confidence interval will always be the same!

From our example below, we can reject the null hypothesis in both cases and conclude that household income significantly predicts a voter voting for Serena!

Overall Model Significance

While we will not go into too much detail, a measure of model fit is represented in the minitab output as the deviance. Again, like the F test in ANOVA, the chi square statistic tests the null hypothesis that all the coefficients associated with predictors (i.e. the slopes) equal zero versus these coefficients not all being equal to zero. In this example, Chi-Square = 732 with a p-value of 0.000, indicating that there is sufficient evidence the coefficient for household income is different from zero.

Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value
Regression 1 732.7 52.86% 732.7 732.712 732.71 0.000
Household Income 1 732.7 52.86% 732.7 732.712 732.71 0.000
Error 998 653.5 47.14% 653.5 0.655    
Total 999 1386.2 100.00%        

Probability Model

The final question we can answer is to respond to the original question about predicting the likelihood that Serena will win. The easiest interpretation of the logistic regression fitted values are the predicted values for each value of X (recall the logistic regression model can be algebraically manipulated to take the form of a probability!).  In Minitab we can request that the probabilities for each value of X be stored in the data. The result would look something like:

C2 C3 C4
Vote Yes Household Income FITS
0 51.0189 0.00001
0 51.0639 0.00001
0 51.4582 0.00002
0 51.7075 0.00003
0 52.4308 0.00007
1 52.5291 0.00008
1 52.5586 0.00009
1 53.0822 0.00016
1 53.1185 0.00017
1 53.1301 0.00017
0 53.3159 0.00022

From this output we can now see the probability that a household will vote for Serena. Lower values in the “fits” column represent lower probabilities of voting for Serena. For example, the household income of 52.5291 has a probability of .00008 of voting for Serena. Serena’s campaign can take advantages of the ability to predict this probability and target marketing and outreach to those households “on the fence” (for example between 40 and 60 percent likely) to vote for her.

The marketing firm might make a recommendation to Serena’s campaign to focus on households that are in the 40-60% range. These households might be those who could be “convinced” that voting for Serena would be not only history in the making, but the right decision for leading the state for the next four years.

Minitab®

Minitab: Binary Logistic Regression

To perform the binary logistic regression in Minitab use the following:

Stat > Regression > Binary Logistic and enter 'Vote Yes' for Response and 'Household Income' in Model.

Note: the window for Factors refers to any variable(s)which are categorical.


10.3 - Lesson Summary

10.3 - Lesson Summary

 Case-Study: Campaign Marketing

Serena and her campaign staff now understand how an OLS regression would not be appropriate to collect information on voters. She and her campaign can collect data and utilize a binary logistic regression model to identify households that might be less inclined to vote for her and propel her to victory!


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility