6.1 - Introduction to GLMs

As we introduce the class of models known as the generalized linear model, we should clear up some potential misunderstandings about terminology. The term "general" linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only). The form is \(y_i\sim N(x_i^T\beta, \sigma^2),\) where \(x_i\) contains known covariates and \(\beta\) contains the coefficients to be estimated. These models are fit by least squares and weighted least squares using, for example, SAS's GLM procedure or R's lm() function.

The term "generalized" linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989). In these models, the response variable \(y_i\) is assumed to follow an exponential family distribution with mean \(\mu_i\), which is assumed to be some (often nonlinear) function of \(x_i^T\beta\). Some would call these “nonlinear” because \(\mu_i\) is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear because the covariates affect the distribution of \(y_i\) only through the linear combination \(x_i^T\beta\).

The first widely used software package for fitting these models was called GLIM. Because of this program, "GLIM" became a well-accepted abbreviation for generalized linear models, as opposed to "GLM" which often is used for general linear models. Today, GLIMs are fit by many packages, including SAS's Genmod procedure and R's glm() function. Unfortunately, different authors and texts may use GLM to mean either "general" or "generalized" linear model, so it's best to rely on context to determine which is meant. We will prefer to use GLM to mean "generalized" linear model in this course.

There are three components to any GLM:

Random Component - specifies the probability distribution of the response variable; e.g., normal distribution for \(Y\) in the classical regression model, or binomial distribution for \(Y\) in the binary logistic regression model. This is the only random component in the model; there is not a separate error term.

Systematic Component - specifies the explanatory variables \((x_1, x_2, \ldots, x_k)\) in the model, more specifically, their linear combination; e.g., \(\beta_0 + \beta_1x_1 + \beta_2x_2\), as we have seen in a linear regression, and as we will see in the logistic regression in this lesson.

Link Function, \(\eta\) or \(g(\mu)\) - specifies the link between the random and the systematic components. It indicates how the expected value of the response relates to the linear combination of explanatory variables; e.g., \(\eta = g(E(Y_i)) = E(Y_i)\) for classical regression, or \(\eta = \log(\dfrac{\pi}{1-\pi})=\mbox{logit}(\pi)\) for logistic regression.

Assumptions

The data \(Y_1, Y_2, \ldots, Y_n\) are independently distributed, i.e., cases are independent.
The dependent variable \(Y_i\)does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal, etc.).
A GLM does NOT assume a linear relationship between the response variable and the explanatory variables, but it does assume a linear relationship between the transformed expected response in terms of the link function and the explanatory variables; e.g., for binary logistic regression \(\mbox{logit}(\pi) = \beta_0 + \beta_1x\).
Explanatory variables can be nonlinear transformations of some original variables.
The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
Errors need to be independent but NOT normally distributed.
Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS).

The following are three popular examples of GLMs.

Simple Linear Regression

SLR models how the mean of a continuous response variable \(Y\) depends on a set of explanatory variables, where \(i\) indexes each observation:

\(\mu_i=\beta_0+\beta x_i\)

Random component - The distribution of \(Y\) has a normal distribution with mean \(\mu\) and constant variance \(\sigma^2\).
Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters \(\beta_0 + \beta_1x\). This can be extended to multiple linear regression where we may have more than one explanatory variable, e.g., \((x_1, x_2, \ldots, x_k)\). Also, the explanatory variables themselves could be transformed, e.g., \(x^2\), or \(\log(x)\), provided they are combined with the parameter coefficients in a linear fashion.
Link function - the identity link, \(\eta= g(E(Y)) = E(Y)\), is used; this is the simplest link function.

Binary Logistic Regression

Binary logistic regression models how the odds of "success" for a binary response variable \(Y\) depend on a set of explanatory variables:

\(\mbox{logit}(\pi_i)=\log \left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_i\)

Random component - The distribution of the response variable is assumed to be binomial with a single trial and success probability \(E(Y)=\pi\).
Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters. As with the above example, this can be extended to multiple variables of non-linear transformations.
Link function - the log-odds or logit link, \(\eta= g(\pi) =\log \left(\dfrac{\pi_i}{1-\pi_i}\right)\), is used.

Poisson Regression

models how the mean of a discrete (count) response variable \(Y\) depends on a set of explanatory variables

\(\log \lambda_i=\beta_0+\beta x_i\)

Random component - The distribution of \(Y\) is Poisson with mean \(\lambda\).
Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters. As with the above example, this can be extended to multiple variables of non-linear transformations.
Link function - the log link is used.

Summary of advantages of GLMs over traditional (OLS) regression Section

We do not need to transform the response to have a normal distribution.
The choice of link is separate from the choice of random component, giving us more flexibility in modeling.
The models are fitted via maximum likelihood estimation, so likelihood functions and parameter estimates benefit from asymptotic normal and chi-square distributions.
All the inference tools and model checking that we will discuss for logistic and Poisson regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, deviance, residuals, confidence intervals, and overdispersion.
There is often one procedure in a software package to capture all the models listed above, e.g. PROC GENMOD in SAS or glm() in R, etc., with options to vary the three components.