10.2 - Binary Logistic Regression

10.2 - Binary Logistic Regression

Let's take a closer look at the binary logistic regression model. Similar to the linear regression model, the equation looks the same as Y is some function of X:

\(Y = f(X)\)

However, as stated previously, the function is different as we employ the logit link function. Again, not going into too much detail about how the logit link function is calculated in this class, the output is in the form of a “log odds”.

Notice in the “logistic regression table” that the log odds is actually listed as the “coefficient”. The nomenclature is similar to that of the simple linear regression coefficient for the slope. Moving further down the row of the table, we can see that just like the slope, the log odds contains a significance test, only using a “z” test as opposed to a “t” test due to the categorical response variable. Fortunately, we interpret the log odds in a very similar logic to the slope, specifically.

Interpreting a Log Odds

(Predictor Coefficient)

  • If \(\beta > 0\) then the log odds of observing the event become higher if X is higher.
  • If \(\beta < 0\) then the log odds of observing the event become lower if X is higher.
  • If \(\beta = 0\) then X does not tell us anything about the log odds of observing the event.

When we run a logistic regression on Serena's polling data the output indicates a log odds of 1.21. We look at the “Z-Value” and see a large value (15.47) which leads us to reject the null hypothesis that household incomes does not tell us anything about the log odds of voting for Serena. Because the coefficient is greater than zero, we can also conclude that greater household income increases the log odds of voting for Serena.

Coefficients
Predictor Coef SE Coef 95% CI Z-Value P-Value VIF
Constant -73.39 4.74 (-82.68, -64.09) -15.47 0.000  
Household Income 1.2183 0.0787 (1.0640, 1.3726) 15.47 0.000 1

But what is the log odds? Well, simply this is the result of using the logit link function. But this is not easily interpretable so we tend to focus on the output related to the odds. The odds returns us to a basic categorical statistical function. As a reminder, an odds ratio is the ratio of an event occurring to not occurring. An odds ratio of 1 indicates there is no difference in the frequency of the event occurring vs. Not. So with the odds ratio in the output, we are comparing our results to an odds ratio of 1. Typically, these odds ratios are accompanied by a confidence interval, again, looking for the value of “1” in the interval to conclude no relationship.

The polling output tells us the odds of voting for Serena increase by 3.38 with every one unit increase in household income (measured in 1,000’s).

Helpfully, the result of the log odds hypothesis test and the odds ratio confidence interval will always be the same!

From our example below, we can reject the null hypothesis in both cases and conclude that household income significantly predicts a voter voting for Serena!

Overall Model Significance

While we will not go into too much detail, a measure of model fit is represented in the minitab output as the deviance. Again, like the F test in ANOVA, the chi square statistic tests the null hypothesis that all the coefficients associated with predictors (i.e. the slopes) equal zero versus these coefficients not all being equal to zero. In this example, Chi-Square = 732 with a p-value of 0.000, indicating that there is sufficient evidence the coefficient for household income is different from zero.

Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value
Regression 1 732.7 52.86% 732.7 732.712 732.71 0.000
Household Income 1 732.7 52.86% 732.7 732.712 732.71 0.000
Error 998 653.5 47.14% 653.5 0.655    
Total 999 1386.2 100.00%        

Probability Model

The final question we can answer is to respond to the original question about predicting the likelihood that Serena will win. The easiest interpretation of the logistic regression fitted values are the predicted values for each value of X (recall the logistic regression model can be algebraically manipulated to take the form of a probability!).  In Minitab we can request that the probabilities for each value of X be stored in the data. The result would look something like:

C2 C3 C4
Vote Yes Household Income FITS
0 51.0189 0.00001
0 51.0639 0.00001
0 51.4582 0.00002
0 51.7075 0.00003
0 52.4308 0.00007
1 52.5291 0.00008
1 52.5586 0.00009
1 53.0822 0.00016
1 53.1185 0.00017
1 53.1301 0.00017
0 53.3159 0.00022

From this output we can now see the probability that a household will vote for Serena. Lower values in the “fits” column represent lower probabilities of voting for Serena. For example, the household income of 52.5291 has a probability of .00008 of voting for Serena. Serena’s campaign can take advantages of the ability to predict this probability and target marketing and outreach to those households “on the fence” (for example between 40 and 60 percent likely) to vote for her.

The marketing firm might make a recommendation to Serena’s campaign to focus on households that are in the 40-60% range. These households might be those who could be “convinced” that voting for Serena would be not only history in the making, but the right decision for leading the state for the next four years.

Minitab®

Minitab: Binary Logistic Regression

To perform the binary logistic regression in Minitab use the following:

Stat > Regression > Binary Logistic and enter 'Vote Yes' for Response and 'Household Income' in Model.

Note: the window for Factors refers to any variable(s)which are categorical.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility