Overview Section
Logistic Regression
Logistic regression can describe the relationship between a categorical outcome (response variable) and a set of covariates (predictor variables). The categorical outcome may be binary (e.g., presence or absence of disease) or ordinal (e.g., normal, mild and severe). The predictor variable(s) may be continuous or categorical. For example, consider modeling the presence or absence of coronary heart disease using age as a predictor variable.
Age  CHD  Age  CHD  Age  CHD 

22  0  40  0  54  0 
23  0  41  1  55  1 
24  0  46  0  58  1 
27  0  47  0  60  1 
28  0  48  0  60  0 
30  0  49  1  62  1 
30  0  49  0  65  1 
32  0  50  1  67  1 
33  0  51  0  71  1 
35  1  51  1  77  1 
38  0  52  0  81  1 
Many students are familiar with linear regression. Would linear regression model these data well? Why or why not?
Answer:
No. With the responses limited to 0 or 1, the error terms are not normally distributed. Nor is the error variance constant. The graph of the data does not look like a regression line, but two lines, one at 0 and another at 1.
Error terms: If \(Y_i = 1 \Rightarrow \epsilon_i = 1  \beta_0  \beta_{1xi}\)
If \(Y_i = 0 \Rightarrow \epsilon_i =  \beta_0  \beta_{1xi}\)
Instead of using the 0/1 responses, let's consider the proportion of individuals with CHD by age group.
Diseased  

Age group  # in group  #  % 
2029  5  0  0 
3039  6  1  17 
4049  7  2  29 
5059  7  4  57 
6069  5  4  80 
7079  2  2  100 
8089  1  1  100 
Plot of Data from Table 2
The plot of the proportions follows a curvilinear pattern which can be modeled using logistic regression. The logistic regression model satisfies the constraint
\(0 \le E(Y) = \pi \le 1\)
The binomial distribution, instead of the normal distribution, is used to describe the distribution of the errors in the logistic model.
\(\begin{align} \sigma^2 {\epsilon_i} &= \pi_i(1\pi_i) \\
&= E(Y_i)(1E(Y_i)) \\
&= (\beta_0  \beta_1x_1)(1\beta_0  \beta_1x_1) \ \end{align}\)
The logistic function models the conditional probability of the response.
\(P(yx) =\frac{e^{\alpha+\beta x}}{1+ e^{\alpha+\beta x}}\)
\(ln\left[\frac{P(yx)}{1P(yx)} \right]=\alpha+\beta x\)
where \(ln\left[\frac{P(yx)}{1P(yx)} \right]\) is the logit of \(P(yx)\).
Taking the logarithm of the logistic function, the logit, results in terms that resemble a linear regression model.
Advantages of the Logit
 Allows properties of a linear regression model to be exploited
 The logit itself can take values between  ∞ and + ∞
 Probability remains constrained between 0 and 1
 The logit can be directly related to odds of disease
\(ln\left( \frac{P}{1P} \right)=\alpha+\beta x\)
\(\frac{P}{1P}=e^{\alpha+\beta x}\)
Interpretation of coefficient \(\beta\)
The probabilities for an individual to fall into categories of exposure to a risk factor and presence or absence of disease are defined below:
Exposure x



Disease y  yes  no 
yes  \(P( y  x = 1 )\)  \(P( y  x = 0 )\) 
no  \(1  P( y  x = 1 )\)  \(1  P( y  x = 0 )\) 
\(\frac{P}{1P}=e^{\alpha+\beta x}\)
\(\begin{align}Odds_{de} &= e^{\alpha + \beta}\\ Odds_{d\bar{e}} &= e^{\alpha} \end{align}\)
\(\begin{align}OR &= \frac{e^{\alpha + \beta}}{e^{\alpha}} =e^{\beta} \\ ln(OR) &= \beta \end{align}\)
The odds of disease given exposure and the odds of disease among the unexposed are indicated in the middle column above. The odds ratio for the odds of disease among the exposed as compared to the odds of disease among the nonexposed simplifies to \(e^{\beta}\).^{ }
 \(\beta\) = increase in logarithm of the odds ratio for a one unit increase in x
 A Wald test can be used to test of the hypothesis that \(\beta=0\)
\(\chi^2=\frac{\beta^2}{Variance(\beta)} \;\;\;\; (1df)\)
 A confidence interval for the OR can be calculated.
\(95\% \; CI \;\; \text{for} \;\; \Theta = e^{\beta \pm 1.96 SE\beta}\)
Objectives
 increase your familiarity with statistical methods used in epidemiology, particularly logistic regression, Poisson regression, and effect modification.