Beyond Logistic Regression: Generalized Linear Models (GLM)

We saw this material at the end of the Lesson 6. But a Latin proverb says: "Repetition is the mother of study" (Repetitio est mater studiorum). Let's look at the basic structure of GLMs again, before studying a specific example of Poisson Regression.

The logistic regression model is an example of a broad class of models known as generalized linear models (GLM). For example, GLMs also include linear regression, ANOVA, poisson regression, etc.

There are three components to a GLM:

Random Component – refers to the probability distribution of the response variable (Y); e.g. binomial distribution for Y in the binary logistic regression.
Systematic Component - refers to the explanatory variables (X₁, X₂, ... X_k) as a combination of linear predictors; e.g. β₀ + β₁x₁ + β₂x₂ as we have seen in logistic regression.
Link Function, η or g(μ) - specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables; e.g. η = logit(π) for logistic regression.

For a more detailed discussion refer to Agresti(2007), Ch. 3, Agresti (2002), Ch.4, (pages 115-118, 135-132), Agresti (1996), Ch.4, and/or McCullagh & Nelder (1989).

Simple Linear Regression

Models how mean expected value of a continuous response variable depends on a set of explanatory variables.

Y_i = β₀ + βx_i + ε_i

E(Y_i) = β₀ + βx_i

Random component: Y is a response variable and has a normal distribution, and generally we assume e_i ~ N(0, σ²).
Systematic component: X is the explanatory variable (can be continuous, discrete, or both) and are linear in the parameters β₀ + βx_i
Link function: Identity Link η = g(E(Y_i)) = E(Y_i)

Binary Logistic Regression

Models how binary response variable depends on a set of explanatory variable

Random component: The distribution of Y is Binomial
Systematic component: Xs are explanatory variables (can be continuous, discrete, or both) and are linear in the parameters β₀ + βx_i + ... + β₀ + βx_k
Link function: Logit

formula

Loglinear Models

Model the expected cell counts as a function of levels of categorical variables

Random component: The distribution of counts is Poisson
Systematic component: Xs are discrete variables used in cross-classification, and are linear in the parameters
Link Function: Log η = log(μ)

They are related in a sense that the loglinear models are more general than logit models, and some logit models are equivalent to certain loglinear models (e.g. consider the admissions data example or boys scout example).

if you have a binary response variable in the loglinear model, you can construct the logits to help with the interpretation of the loglinear model.
some logit models with only categorical variables have equivalent loglinear models

On the next slide we will consider the boys scout data and the homogeneous model (DS, BS, DB), and see once again how this ties in with the discussion in the Section B of Lesson 5.

Loglinear model is also equivalent to poisson regression model when all explanatory variables are discrete. For more on poisson regression models see the next section of this lesson, Agresti(2007), Sec. 3.3, Agresti (2002), Section 4.3 (for counts), Section 9.2 (for rates), and Section 13.2 (for random effects) and Agresti (1996), Section 4.3.

Loglinear model:

formula

If we focus on delinquent status

π_ij = Pr(Yes Delinquent | S = i, B = j)

and the logit model for boy's delinquent status is

formula

Compare this to model (4) in Section B of Lesson 5 , where β₁, β₂ are equivalent to β_i for three levels of S, and β₃ is equivalent to β_j for two levels of B.

GLM Table based on Agresti (2002), pg. 118

Model	Random	Link	Systematic
Linear Regression	Normal	Identity	Continuous
ANOVA	Normal	Identity	Categorical
ANCOVA	Normal	Identity	Mixed
Logistic Regression	Binomial	Logit	Mixed
Loglinear	Poisson	Log	Categorical
Poisson Regression	Poisson	Log	Mixed
Multinomial response	Multinomial	Generalized Logit	Mixed

Advantage of GLM over Traditional Regression

We do not need to transform the response Y to have a normal distribution
The choice of link is separate from the choice of random component thus have more flexibility in modeling
If the link produces additive effects, then we do not need constant variance.
The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.
All the inference tools and model checking we discussed for logistic regression and loglinear models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, Deviance, Residuals, Confidence intervals, Overdispersion.
Often one procedure in a software package, e.g. PROC GENMOD in SAS or glm() in R, etc... with options to vary the three components.

But there are some limitations too

Linear function, e.g. linear predictors
Responses must be independent

There are ways around these restrictions; e.g. consider our analysis of matched data, or use NLMIXED in SAS, or consider other models and alternative software packages.

Some additional references:

Collett, D (1991). Analysis of Binary Data.
Fey, M. (2002). Measuring a binary response's range of influence in logistic regression. American Statistician, 56, 5-9.
Hosmer, D.W. & Lemeshow, S. (1989). Applied Logistic Regression.
Fienberg, S.E. The Analysis of Cross-Classified Categorical Data. 2nd ed. Cambridge, MA
McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models. 2nd Ed.
Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 9, 705-724.
Rice, J. C. (1994). "Logistic regression: An introduction". In B. Thompson, ed., Advances in social science methodology, Vol. 3: 191-245. Greenwich, CT: JAI Press. Popular introduction.
SAS Institute (1995). Logistic Regression Examples Using the SAS System, Version 6.
Strauss, David (1999). The Many faces of logistic regression. American Statistician.

Next we will see more on Poisson regression...