10: Logistic Regression

Overview Section

Case-Study: Campaign Marketing

Elections are a critical event in the United States. There are always a plethora of polls to predict if a candidate will win an election. In the lead-up to a recent gubernatorial election, Serena, the state’s first female candidate for governor, wanted to specifically target households that might be on the fence about voting for her. Her campaign hired a marketing firm to conduct a study of a random sample of 1,000 voters to report the annual household income and whether or not they would vote for Serena. The firm was able to identify the relationship between household income and voting for Serena. The firm also was able to target marketing efforts toward households who might be less certain about voting for Serena. Let’s take a closer look at how the marketing firm made these predictions.

As a first step, let’s take a look at the data from the polling.

Descriptive Statistics: Household Income and Voting

Variable	Vote Yes	N	N*	Mean	SE Mean	StDev	Minimum	Q1	Median	Q3	Maximum
Household Income	0	505	0	58.192	0.0874	1.964	51.019	57.084	58.389	59.438	64.081
	1	495	0	62.247	0.0936	2.083	52.529	60.817	61.983	63.551	70.585

We can see that there are two categories, Vote Yes (0,1), and we can compare the mean household income for the two categories. The important point is that the categories are actually the RESPONSE variable in this example! Unlike linear regression, our PREDICTOR variable is the continuous one.

But, let’s say the polling firm made a mistake and used a simple linear regression to model the election prediction problem. If you recall, the regression equation (\(y=b_0+b_1X_1\)) yields a predicted value of y. So in our example, voting for Serena would be a function of household wealth.

A partial scatterplot of the data looks like this:

Notice how there are only two values for Y (0,1) representing a voter voting for Serena or not. This clearly illustrates the bivariate nature of the response.

We can enter our data into Minitab and produce a regression equation along with predicted values for Y. Let’s take a look at the output for the predicted values.

C2	C3	C4
Vote Yes	Household Income	Predicted Value
0	51.0189	-0.64025
0	51.0639	-0.63468
0	51.4582	-0.58592
0	51.7075	-0.55510
0	52.4308	-0.46564
1	52.5291	-0.45349
1	52.5586	-0.44984
1	53.0822	-0.38510
1	53.1185	-0.38061
1	53.1301	-0.37916
0	53.3159	-0.35619
0	53.6321	-0.31709
⋮	⋮	⋮

But wait, the scatterplot showed us that winning had to be a value of zero or one! These values range from -.64 to -.31 in this snippet of output. These really do not make any sense at all.

We need another tool to be able to model a “yes or no” (bivariate) outcome. This tool is a logistic regression.

Objectives

Upon completion of this lesson, you should be able to:

Identify the difference between a OLS and binary logistic regression response variable type
Correctly interpret the coefficient of a logistic regression model
Correctly interpret the probability of a response variable in a logistic regression model
Identify the indicators of model fit in logistic regression