6: Binary Logistic Regression
6: Binary Logistic RegressionOverview
Thus far, our focus has been on describing interactions or associations between two or three categorical variables mostly via single summary statistics and with significance testing. From this lesson on, we will focus on modeling. Models can handle more complicated situations and analyze the simultaneous effects of multiple variables, including combinations of categorical and continuous variables. In the next two lessons, we study binomial logistic regression, a special case of a generalized linear model.
Logistic regression is applicable, for example, if we want to...
- model the probabilities of a response variable as a function of some explanatory variables, e.g., "success" of admission as a function of sex.
- perform descriptive discriminate analyses such as describing the differences between individuals in separate groups as a function of explanatory variables, e.g., student admitted and rejected as a function of sex.
- classify individuals into two categories based on explanatory variables, e.g., classify new students into "admitted" or "rejected" groups depending on sex.
As we'll see, there are two key differences between binomial (or binary) logistic regression and classical linear regression. One is that instead of a normal distribution, the logistic regression response has a binomial distribution (can be either "success" or "failure"), and the other is that instead of relating the response directly to a set of predictors, the logistic model uses the log-odds of success---a transformation of the success probability called the logit. Among other benefits, working with the log-odds prevents any probability estimates to fall outside the range (0, 1).
We begin with two-way tables, then progress to three-way tables, where all explanatory variables are categorical. Then, continuing into the next lesson, we introduce binary logistic regression with continuous predictors as well. In the last part, we will focus on more model diagnostics and model selection.
Lesson 6 Code Files
6.1 - Introduction to GLMs
6.1 - Introduction to GLMsAs we introduce the class of models known as the generalized linear model, we should clear up some potential misunderstandings about terminology. The term "general" linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only). The form is \(y_i\sim N(x_i^T\beta, \sigma^2),\) where \(x_i\) contains known covariates and \(\beta\) contains the coefficients to be estimated. These models are fit by least squares and weighted least squares using, for example, SAS's GLM procedure or R's lm() function.
The term "generalized" linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989). In these models, the response variable \(y_i\) is assumed to follow an exponential family distribution with mean \(\mu_i\), which is assumed to be some (often nonlinear) function of \(x_i^T\beta\). Some would call these “nonlinear” because \(\mu_i\) is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear because the covariates affect the distribution of \(y_i\) only through the linear combination \(x_i^T\beta\).
The first widely used software package for fitting these models was called GLIM. Because of this program, "GLIM" became a well-accepted abbreviation for generalized linear models, as opposed to "GLM" which often is used for general linear models. Today, GLIMs are fit by many packages, including SAS's Genmod procedure and R's glm() function. Unfortunately, different authors and texts may use GLM to mean either "general" or "generalized" linear model, so it's best to rely on context to determine which is meant. We will prefer to use GLM to mean "generalized" linear model in this course.
There are three components to any GLM:
- Random Component - specifies the probability distribution of the response variable; e.g., normal distribution for \(Y\) in the classical regression model, or binomial distribution for \(Y\) in the binary logistic regression model. This is the only random component in the model; there is not a separate error term.
- Systematic Component - specifies the explanatory variables \((x_1, x_2, \ldots, x_k)\) in the model, more specifically, their linear combination; e.g., \(\beta_0 + \beta_1x_1 + \beta_2x_2\), as we have seen in a linear regression, and as we will see in the logistic regression in this lesson.
- Link Function, \(\eta\) or \(g(\mu)\) - specifies the link between the random and the systematic components. It indicates how the expected value of the response relates to the linear combination of explanatory variables; e.g., \(\eta = g(E(Y_i)) = E(Y_i)\) for classical regression, or \(\eta = \log(\dfrac{\pi}{1-\pi})=\mbox{logit}(\pi)\) for logistic regression.
Assumptions
- The data \(Y_1, Y_2, \ldots, Y_n\) are independently distributed, i.e., cases are independent.
- The dependent variable \(Y_i\) does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal, etc.).
- A GLM does NOT assume a linear relationship between the response variable and the explanatory variables, but it does assume a linear relationship between the transformed expected response in terms of the link function and the explanatory variables; e.g., for binary logistic regression \(\mbox{logit}(\pi) = \beta_0 + \beta_1x\).
- Explanatory variables can be nonlinear transformations of some original variables.
- The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure.
- Errors need to be independent but NOT normally distributed.
- Parameter estimation uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS).
The following are three popular examples of GLMs.
-
Simple Linear Regression
- SLR models how the mean of a continuous response variable \(Y\) depends on a set of explanatory variables, where \(i\) indexes each observation:
-
\(\mu_i=\beta_0+\beta x_i\)
- Random component - The distribution of \(Y\) has a normal distribution with mean \(\mu\) and constant variance \(\sigma^2\).
- Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters \(\beta_0 + \beta_1x\). This can be extended to multiple linear regression where we may have more than one explanatory variable, e.g., \((x_1, x_2, \ldots, x_k)\). Also, the explanatory variables themselves could be transformed, e.g., \(x^2\), or \(\log(x)\), provided they are combined with the parameter coefficients in a linear fashion.
- Link function - the identity link, \(\eta= g(E(Y)) = E(Y)\), is used; this is the simplest link function.
-
Binary Logistic Regression
- Binary logistic regression models how the odds of "success" for a binary response variable \(Y\) depend on a set of explanatory variables:
-
\(\mbox{logit}(\pi_i)=\log \left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_i\)
- Random component - The distribution of the response variable is assumed to be binomial with a single trial and success probability \(E(Y)=\pi\).
- Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters. As with the above example, this can be extended to multiple variables of non-linear transformations.
- Link function - the log-odds or logit link, \(\eta= g(\pi) =\log \left(\dfrac{\pi_i}{1-\pi_i}\right)\), is used.
-
Poisson Regression
- models how the mean of a discrete (count) response variable \(Y\) depends on a set of explanatory variables
-
\(\log \lambda_i=\beta_0+\beta x_i\)
- Random component - The distribution of \(Y\) is Poisson with mean \(\lambda\).
- Systematic component - \(x\) is the explanatory variable (can be continuous or discrete) and is linear in the parameters. As with the above example, this can be extended to multiple variables of non-linear transformations.
- Link function - the log link is used.
Summary of advantages of GLMs over traditional (OLS) regression
- We do not need to transform the response to have a normal distribution.
- The choice of link is separate from the choice of random component, giving us more flexibility in modeling.
- The models are fitted via maximum likelihood estimation, so likelihood functions and parameter estimates benefit from asymptotic normal and chi-square distributions.
- All the inference tools and model checking that we will discuss for logistic and Poisson regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, deviance, residuals, confidence intervals, and overdispersion.
- There is often one procedure in a software package to capture all the models listed above, e.g. PROC GENMOD in SAS or glm() in R, etc., with options to vary the three components.
6.2 - Single Categorical Predictor
6.2 - Single Categorical PredictorOverview of Binary Logistic Regression
Binary logistic regression models the probability that a characteristic is present (i.e., "success"), given the values of explanatory variables \(x_1,\ldots,x_k\). We denote this by \(\pi(x_1,\ldots,x_k) = P(\mbox{success}|x_1,\ldots,x_k)\) or simply by \(\pi\) for convenience---but it should be understood that \(\pi\) will in general depend on one or more explanatory variables. For example, a physician may be interested in estimating the proportion of diabetic persons in a population. Naturally, she knows that all sections of the population do not have an equal probability of "success", i.e., being diabetic. Certain conditions, such as advanced age and hypertension, would likely contribute to a higher proportion.
Assumptions for logistic regression:
- The response variable \(Y\) is a binomial random variable with a single trial and success probability \(\pi\). Thus, \(Y=1\) corresponds to "success" and occurs with probability \(\pi\), and \(Y=0\) corresponds to "failure" and occurs with probability \(1-\pi\).
- The set of predictor or explanatory variables \(x = (x_1, x_2, \ldots, x_k)\) are fixed (not random) and can be discrete, continuous, or a combination of both. As with classical regression, two or more of these may be indicator variables to model the nominal categories of a single predictor, and others may represent interactions between two or more explanatory variables.
- Together, the data is collected for the \(i\)th individual in the vector \((x_{1i},\ldots,x_{ki},Y_i)\), for \(i=1,\ldots n\). These are assumed independent by the sampling mechanism. This also allows us to combine or group the data, which we do below, by summing over trials for which \(\pi\) is constant. In this section of the notes, we focus on a single explanatory variable \(x\).
The model is expressed as
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)= \beta_0+\beta_1 x_i\)
Or, by solving for \(\pi_i\), we have the equivalent expression
\(\pi_i=\dfrac{\exp(\beta_0+\beta_1 x_i)}{1+\exp(\beta_0+\beta_1 x_i)}\)
To estimate the parameters, we substitute this expression for \(\pi_i\) into the joint pdf for \(Y_1,\ldots,Y_n\)
\(\prod_{i=1}^n\pi_i^{y_i}(1-\pi_i)^{1-y_i}\)
to give us the likelihood function \(L(\beta_0,\beta_1)\) of the regression parameters. By maximizing this likelihood over all possible \(\beta_0\) and \(\beta_1\), we have the maximum likelihood estimates (MLEs): \(\hat{\beta}_0\) and \(\hat{\beta}_1\). Extending this to include additional explanatory variables is straightforward.
Working with Grouped Data
It's worth noting that when the same value of the predictor \(x\) occurs more than once, the success probability is constant for all \(Y\) responses associated with that \(x\) value, and we can work with the sum of such \(Y\)s as a binomial with the number of trials equal to the number of times that \(x\) value occurred. Our example with parent and student smoking illustrates this. Recall the data for the two-way table:
Student smokes | Student does not smoke | |
---|---|---|
1–2 parents smoke | 816 | 3203 |
Neither parent smokes | 188 | 1168 |
We have two approaches to work with. The first is for the individual or ungrouped data: \(x_i=1\) if the \(i\)th student's parents smoke (0 otherwise), \(Y_i=1\) if the \(i\)th student smokes, and \(Y_i\sim\mbox{Binomial}(1,\pi_i)\), for \(i=1,\ldots,5375\). The second is for the grouped data: \(x_i=1\) if the \(i\)th table row corresponds to parents smoking (0 otherwise), \(Y_i=\) number of students who smoke for the \(i\)th row, and \(Y_i\sim\mbox{Binomial}(n_i,\pi_i)\), for \(i=1,2\) and \((n_1,n_2)=(4019,1356)\).
With either approach, \(\pi\) is the probability that a student smokes, and its estimate will be the same as well. Even though the index has a much greater range in the ungrouped approach, there are still only two unique values for \(\pi\): that for the group whose parents smoke and that for the group whose parents don't. The likelihood function is simplified for the grouped data, however, and becomes
\(\prod_{i=1}^2\dfrac{n_i!}{y_i!(n_i-y_i)!}\pi_i^{y_i}(1-\pi_i)^{n_i-y_i}\)
The range for \(i\) in the grouped approach is denoted by \(N\) to distinguish it from the individual total \(n\) (so \(N=2\) in this example). Whenever possible, we will generally prefer to work with the grouped data because, in addition to having a more compact likelihood function, the larger binomial counts can be approximated by a normal distribution and will lend themselves to more meaningful residual diagnostics and large-sample significance tests.
Grouping is not always possible, however. Unless two responses have the same predictor value (or combination of predictor values if two or more predictors are present), their success probabilities would not be the same, and they cannot be added together to give a binomial variable. This is a common issue when lots of predictors and/or interactions are included in the model or with continuous predictor types because their values tend to be unique.
Interpretation of Parameter Estimates
One source of complication when interpreting parameters in the logistic regression model is that they're on the logit or log-odds scale. We need to be careful to convert them back before interpreting the terms of the original variables.
- \(\exp(\beta_0) =\) the odds that the success characteristic is present for an individual for which \(x = 0\), i.e., at the baseline. If multiple predictors are involved, all would need to be set to 0 for this interpretation.
- \(\exp(\beta_1) =\) the multiplicative increase in the odds of success for every 1-unit increase in \(x\). This is similar to simple linear regression but instead of an additive change, it is a multiplicative change in rate. If multiple predictors are involved, others would need to be held fixed for this interpretation.
- If \(\beta_1 > 0\), then \(\exp(\beta_j) > 1\), indicating a positive relationship between \(x\) and the probability and odds of the success event. If \(\beta_j < 0\), then the opposite holds.
6.2.1 - Fitting the Model in SAS
6.2.1 - Fitting the Model in SASThere are different ways to do this depending on the format of the data. As before, for details, you need to refer to SAS or R help. Here are some general guidelines to keep in mind. Please note that we make a distinction about the way the data are entered into SAS (or R).
If data come in a tabular form, i.e., the response pattern is with counts (as seen in the previous example).
model y/n = x1 x2 /link=logit dist=binomial;
model y/n = x1 x2;[enter code here]
- PROC GENMOD: We need a variable that specifies the number of cases that equals marginal frequency counts or number of trials (e.g. n), and the number of events (y)
- PROC LOGISTIC: We do need a variable that specifies the number of cases that equals marginal frequency counts
If data come in a matrix form, i.e., subject \(\times\) variables matrix with one line for each subject, like a database
model y/n = x1 x2 / link = logit dist = binomial;
model y = x1 x2;
- PROC GENMOD: We need a variable that specifies the number of cases that equals marginal frequency counts or number of trials (e.g. n), and the number of events (y)
- PROC LOGISTIC: We do NOT need a variable that specifies the number of cases that equals marginal frequency counts
For both GENMOD and LOGISTIC, as before, include interaction terms with *, and make sure to include all lower-order terms.
We will follow both the SAS output through to explain the different parts of model fitting. The outputs from R will be essentially the same.
Example 6-1: Student Smoking
Let's begin with collapsed \(2\times2\) table:
Student smokes | Student does not smoke | |
---|---|---|
1–2 parents smoke | 816 | 3203 |
Neither parent smokes | 188 | 1168 |
Let's look at one part of smoke.sas:
data smoke;
input s $ y n ;
cards;
smoke 816 4019
nosmoke 188 1356
;
proc logistic data=smoke descending;
class s (ref=first) / param=ref;
model y/n = s /scale=none;
run;
In the data step, the dollar sign \$ as before indicates that S is a character-string variable.
In the logistic step, the statement: descending
insures that you are modeling a probability of an "event" which takes value 1, otherwise by default SAS models the probability of "nonevent."
class S (ref=first) / param=ref;
This code says that S should be coded as a categorical variable using the first category as the reference or zero group. (The first category is "nosmoke," because it comes before "smoke" in alphabetical order). You can vary this order by additional options provided by class and/or by entering data in a different order. See SAS online help for details, and the rest of smoke.sas for more options.
model y/n = s /scale = none
Because we have grouped data (i.e. multiple trials per line of the data set), the model statement uses the "event/trial" syntax, in which y/n appears on the left-hand side of the equal sign. The predictors go on the right-hand side, separated by spaces if there are more than one. An intercept is added automatically by default.
scale=none
This option serves two purposes. One, it will give us the overall goodness-of-fit test statistics, deviance \(G^2\) and Person \(X^2\). It also enables us to specify a value for a dispersion parameter in order to correct for over-or under-dispersion. In this case, we are NOT controlling for either. Overdispersion is an important concept with discrete data. In the context of logistic regression, overdispersion occurs when there are discrepancies between the observed responses \(y_i\) and their predicted values \(\hat{\mu}_i = n_{i}\hat{\pi}_i\) and these values are larger than what the binomial model would predict.
If \(y_i \sim Binomial(n_i, \pi_i)\), the mean is \(\mu_i = n_i \pi_i\) and the variance is \(\dfrac{\mu_i(n_i − \mu_i)}{n_i}\). Overdispersion means that the data show evidence that the variance of the response \(y_i\) is greater than \(\dfrac{\mu_i(n_i − \mu_i)}{n_i}\).
If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodness-of-fit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.
Now let's review some of the output from the program smoke.sas that uses PROC LOGISTIC
Model Information
Model Information | |
---|---|
Data Set | WORK.SMOKE |
Response Variable (Events) | y |
Response Variable (Trials) | n |
Model | binary logit |
Optimization Technique | Fisher's scoring |
This section, as before, tells us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g. binary logit), and the type of scoring algorithm for parameter estimation. Fisher scoring is a variant of the Newton-Raphson method for ML estimation. In logistic regression they are equivalent.
Response Profile
Response Profile | ||
---|---|---|
Ordered Value |
Binary Outcome | Total Frequency |
1 | Event | 1004 |
2 | Nonevent | 4371 |
This information tells us how many observations were "successful", e.g., how many "event" observations take value 1, that is how many children smoke, 816 + 188 = 1004 and "nonevent", i.e., how many do NOT smoke, (4019 − 816) + (1356 − 188) = 4371. Recall that the model is estimating the probability of the "event".
Class Level Information Design
Class Level Information | ||
---|---|---|
Class | Value | Design Variables |
s | nosmoke | 0 |
smoke | 1 |
From an explanatory categorical (i.e, class) variable S with 2 levels (0,1), we created one dummy variable (e.g. design variable):
\(X_1 = 1\) ("smoke") if parent smoking = at least one,
\(X_1 = 0\) ("nosmoke") if parent smoking = neither
parent smoking = nosmoke is equal 0 and is the baseline.
Model Convergence Status
Model Convergence Status |
---|
Convergence criterion (GCONV=1E-8) satisfied. |
Since we are using an iterative procedure to fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.
Overall goodness-of-fit testing
Test: \(H_0 \colon\) current model vs. \(H_A \colon\) saturated model
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 0.0000 | 0 | . | . |
Pearson | 0.0000 | 0 | . | . |
The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zero because the model is saturated. However, suppose that we fit the intercept-only model. This is accomplished by removing the predictor from the model statement, like this:
model y/n = / scale=none;
The goodness-of-fit statistics are shown below.
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 29.1207 | 1 | 29.1207 | <.0001 |
Pearson | 27.6766 | 1 | 27.6766 | <.0001 |
The Pearson statistic \(X^2 = 27.6766\) is precisely equal to the usual \(X^2\) for testing independence in the \(2\times2\) table. And the deviance \(G^2 = 29.1207\) is precisely equal to the \(G^2\) for testing independence in the \(2\times2\) table. Thus by the assumption, the intercept-only model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or odds-ratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well. This example shows that analyzing a \(2\times2\) table for association is equivalent to logistic regression with a single dummy variable. Later on, we will compare these tests to the loglinear model of independence.
The goodness-of-fit statistics, \(X^2\) and \(G^2\), are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the \(\chi^2\) approximation to work well, we need the \(n_i\)s sufficiently large so that the expected values, \(\hat{\mu}_i\ge5\), and \(n_i − \hat{\mu}_i \ge 5\) for most of the rows. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.
With real data, we often find that the \(n_i\)s are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem
Analysis of Maximum Likelihood Estimates
Once an appropriate model is fitted, the success probabilities need to be estimated using the model parameters. Note that success probabilities are now NOT simply the ratio of observed number of successes and the number of trials. A model fit introduces a structure on the success probabilities. The estimates will now be functions of model parameters. What are the parameter estimates? What is the fitted model?
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.8266 | 0.0786 | 540.2949 | <.0001 | |
s | smoke | 1 | 0.4592 | 0.0878 | 27.3361 | <.0001 |
The fitted model is
\(\mbox{logit}(\hat{\pi_i})=\log\dfrac{\hat{\pi_i}}{1-\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=-1.837+0.459\mbox{smoke}\)
where smoke is a dummy variable (e.g. design variable) that takes value 1 if at least one parent smokes and 0 if neither smokes as discussed above.
Estimated \(\beta_0=-1.827\) with standard error 0.078 is significant and it says that log-odds of a child smoking versus not smoking if neither parent is smoking (the baseline level) is \(-1.827\) and it's statistically significant.
Estimated \(\beta_1=0.459\) with a standard error of 0.088 is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent smokes versus neither parents smokes (the baseline level) is 0.459 and it's statistically significant. \(\exp(0.459)=1.58\) are the estimated odds-ratios. Thus there is a strong association between parent and child's smoking behavior, and the model fits well.
And the estimated probability of a student smoking given the explanatory variable (e.g. parent smoking behavior) is
\(\hat{\pi}_i=\dfrac{\exp(-1.826+0.4592X_i)}{1+\exp(-1.826+0.4592x_i)}\)
In particular, the predicted probability of a student smoking given that neither parent smokes is
\(P(Y_i=1|x_i=0)=\dfrac{\exp(-1.826+0.4592\times 0)}{1+\exp(-1.826+0.4592\times 0)}=0.14\)
and the predicted probability of a student being a smoker if at least one parent smokes is
\(P(Y_i=1|x_i=1)=\dfrac{\exp(-1.826+0.4592(X_i=1)}{1+\exp(-1.826+0.4592(x_i=1))}=0.20\)
By invoking the following option in the MODEL statement, output out=predict pred=prob
, PROC LOGISTIC
will print the predicted probabilities in the output:
Obs | s | y | n | prob |
---|---|---|---|---|
1 | smoke | 816 | 4019 | 0.20304 |
2 | nosmoke | 188 | 1356 | 0.13864 |
So, we do not need to do this calculation by hand, but it may be useful to try it out to be sure of what's going on. In this model,\(\beta_0\) is the log-odds of children smoking for no-smoking parents /(x_i = 0\). Thus \(\exp(−1.8266)\) are the odds that a student smokes when the parents do not smoke.
Looking at the \(2\times2\) table, the estimated log-odds for nonsmokers is
\(\log\left(\dfrac{188}{1168}\right)=\log(0.161)=-1.8266\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
From the analysis of this example as a \(2\times2\) table, \(\exp(\beta_1)\) should be the estimated odds ratio. The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=0.4592\)
agrees exactly with the log-odds ratio from the \(2\times2\) table (i.e., \(\log(1.58) = (816\times1168) / (188\times 3203) = 0.459\)). That is exp(0.459) = 1.58, which is the estimated odds ratio of a student smoking.
To relate this to the interpretation of the coefficients in a linear regression, we could say that for every one-unit increase in the explanatory variable \(x_1\) (e.g., changing from no smoking parents to smoking parents), the odds of "success" \(\pi_i / (1 − \pi_i)\) will be multiplied by \(\exp(\beta_1)\), given that all the other variables are held constant.
This is not surprising, because in the logistic regression model \(\beta_1\) is the difference in the log-odds of children smoking as we move from "nosmoke" (neither parent smokes) /(x_i = 0/) to "smoke" (at least one parent smokes) \(x_i = 1\), and the difference in log-odds is a log-odds ratio.
Testing Individual Parameters
Testing the hypothesis that the probability of the characteristic depends on the value of the \(j\)th variable.
Testing \(H_0\colon \beta_j = 0\) versus \(H_A\colon \beta_j \ne 0\).
The Wald chi-squared statistics \(z^2=(\hat{\beta}_j/\mbox{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Analysis of Maximum Likelihood Estimates" section.
The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that we can calculate from the corresponding \(2\times2\) table.
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.8266 | 0.0786 | 540.2949 | <.0001 | |
s | smoke | 1 | 0.4592 | 0.0878 | 27.3361 | <.0001 |
The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior. Or, we can use the information from the "Type III Analysis of Effects" section.
Type 3 Analysis of Effects | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
s | 1 | 27.3361 | <.0001 |
Again, this information indicates that parents' smoking behavior is a significant factor in the model. We could compare \(z^2\) to a chi-square with one degree of freedom; the p-value would then be the area to the right of \(z^2\) under the \(\chi^2_1\) density curve. A value of \(z^2\) (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis \(\beta_j = 0\) at the .05-level.
\(\beta_1:\left(\dfrac{0.4592}{0.0878}\right)^2=27.354\)
Confidence Intervals of Individual Parameters
An approximate \((1 − \alpha) 100\%\) confidence interval for \(\beta_j\) is given by
\(\hat{\beta}_j \pm z_{(1-\alpha/2)}SE(\hat{\beta}_j)\)
For example, a 95% confidence interval for \(\beta_1\)
\(0.4592 \pm 1.96 (0.0878)= (0.287112, 0.63128)\)
where 0.0878 is the "Standard Error"' from the "Analysis of Maximum Likelihood Estimates" section. Then, the 95% confidence interval for the odds-ratio of a child smoking if one parent is smoking in comparison to neither smoking is
\((\exp(0.287112), \exp(0.63128)) = (1.3325, 1.880)\)
Compare this with the output we get from PROC LOGISTIC:
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits |
|
s smoke vs nosmoke | 1.583 | 1.332 | 1.880 |
When fitting logistic regression, we need to evaluate the overall fit of the model, the significance of individual parameter estimates and consider their interpretation. For assessing the fit of the model, we also need to consider the analysis of residuals. Definition of Pearson, deviance, and adjusted residuals is as before, and you should be able to carry this analysis.
If we include the statement...
output out=predict pred=phat reschi=pearson resdev=deviance;
in the PROC LOGISTIC call, then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), the Pearson residuals, and the deviance residuals. Then we can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) and failures \(n_i-\hat{\mu}_i=n_i(1-\hat{\pi}_i)\).
6.2.2 - Fitting the Model in R
6.2.2 - Fitting the Model in RThere are different ways to run logistic regression depending on the format of the data. Here are some general guidelines to keep in mind with a simple example outlined in dataformats.R where we created two binary random variables with \(n\) number of trials, e.g., \(n=100\).
##create two binary vectors of length 100
x=sample(c(0,1),100, replace=T)
y=sample(c(0,1),100, replace=T)
If the data come in a tabular form, i.e., response pattern is with counts (as seen in the previous example), the data are said to be "grouped".
> ##create a 2x2 table with counts
> xytab=table(x,y)
> xytab
y
x 0 1
0 24 29
1 26 21
glm()
: Let \(Y\) be the response variable capturing the number of events with the number of success (\(Y = 1\)) and failures (\(Y = 0\)). We need to create a response table that has a count for both the "success" and "failure" out of \(n\) trials in its columns. Notice that the count table below could be also the number of success \(Y = 1\), and then a column computed as \(n-Y\)
> count=cbind(xytab[,2],xytab[,1])
> count
[,1] [,2]
0 29 24
1 21 26
We also need a categorical predictor variable.
> xfactor=factor(c("0","1"))
> xfactor
[1] 0 1
Levels: 0 1
We need to specify the response distribution and a link, which in this case is done by specifying family=binomial("logit")
.
> tmp3=glm(count~xfactor, family=binomial("logit"))
> tmp3
Call: glm(formula = count ~ xfactor, family = binomial("logit"))
Coefficients:
(Intercept) xfactor1
0.1892 -0.4028
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 1.005
Residual Deviance: -4.441e-15 AIC: 12.72
If data come in a matrix form, i.e., subject \(\times\) variables matrix with one line for each subject, like a database, where data are "ungrouped".
> xydata=cbind(x,y)
> xydata ## 100 rows, we are showing first 7
x y
[1,] 0 1
[2,] 0 1
[3,] 0 0
[4,] 0 1
[5,] 1 0
[6,] 0 1
[7,] 0 0.....
glm()
: We need a binary response variable \(Y\) and a predictor variable \(x\), which in this case was also binary. We need to specify the response distribution and a link, which in this case is done by specifying family=binomial("logit")
> tmp1=glm(y~x, family=binomial("logit"))
> tmp1
Call: glm(formula = y ~ x, family = binomial("logit"))
Coefficients:
(Intercept) x
0.1892 -0.4028
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 137.6 AIC: 141.6
We will follow the R output through to explain the different parts of model fitting. The output from SAS (or from many other software) will be essentially the same.
Example 6-2: Student Smoking
Let's begin with the collapsed \(2\times2\) table:
Student smokes | Student does not smoke | |
---|---|---|
1–2 parents smoke | 816 | 3203 |
Neither parent smokes | 188 | 1168 |
In R, we can use the glm()
function and specify the family = binomial(link = logit)
. See the files smoke.R and the output generated in smoke.out. That R code corresponds to the SAS code discussed in the previous section:
#### define the explanatory variable with two levels:
#### 1=one or more parents smoke, 0=no parents smoke
parentsmoke=as.factor(c(1,0))
#### NOTE: if we do parentsmoke=c(1,0) R will treat this as
#### a numeric and not categorical variable
#### need to create a response vector so that it has counts for both "success" and "failure"
response<-cbind(yes=c(816,188),no=c(3203,1168))
response
#### fit the logistic regression model
smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))
#### OUTPUT
smoke.logistic
summary(smoke.logistic)
anova(smoke.logistic)
Here is the R output for the 2 × 2 table that we will use in R for logistics regression:
> response
yes no
[1,] 816 3203
[2,] 188 1168
Please Note: the table above is different from the one given from the SAS program. We input \(y\) and \(n\) as data columns in SAS; here we just input data columns as yes and no.
summary(smoke.logistic)
gives the following model information:
call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2.9121e+01 on 1 degrees of freedom
Residual deviance: -3.7170e-13 on 0 degrees of freedom
AIC: 19.242
Number of Fisher Scoring iterations: 2
Model Information & Model Convergence Status
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
(Dispersion parameter for binomial family taken to be 1)
Number of Fisher Scoring iterations: 2
These sections tell us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g., binary logit), and the type of scoring algorithm for parameter estimation. Fisher scoring is a variant of Newton-Raphson method for ML estimation. In logistic regression they are equivalent. Since we are using an iterative procedure to fit the model, that is, to find the ML estimates, we need some indication if the algorithm converged.
Overdispersion is an important concept with discrete data. In the context of logistic regression, overdispersion occurs when the observed variance in the data tends to be larger than what the binomial model would predict. If \(Y_i\sim\mbox{Binomial}(n_i,\pi_i)\), the mean is \(μ_i = n_i\pi_i\), and the variance is \(n_i\pi_i(1-\pi_i\). Both of these rely on the parameter \(\pi_i\), which can be too restrictive. If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodness-of-fit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.
Table of Coefficients
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
--
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This information gives us the fitted model:
\(\log\dfrac{\hat{\pi_i}}{1-\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=-1.837+0.459\mbox{parentsmoke1}\)
where parentsmoke1 is a dummy variable ((e.g. design variable) that takes value 1 if at least one parents is smoking and 0 if neither is smoking.
\(x_1 = 1\) ("parentsmoke1") if parent smoking = at least one,
\(x_1= 0\) ("parentsmoke0") if parent smoking = neither
The group for parent smoking = 0 is the baseline. We are modeling the probability of "children smoking".
Estimated \(\beta_0=-1.827\) with a standard error of 0.078 is significant and it says that log-odds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is -1.827 and it's statistically significant.
Estimated \(\beta_1=0.459\) with a standard error of 0.088 is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is 0.459 and it's statistically significant. \(\exp(0.459)=1.58\) are the estimated odds-ratios.
Testing Individual Parameters
Testing the hypothesis that the probability of the characteristic depends on the value of the \(j\)th variable.
Testing \(H_0\colon\beta_j=0\) versus \(H_A\colon\beta_j\ne0\)
The Wald chi-squared statistics \(z^2=(\hat{\beta}_j/\text{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Coefficients" section.
The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that we can calculate from the corresponding \(2\times2\) table. The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior. This information indicates that parents' smoking behavior is a significant factor in the model. We could compare \(z^2\) to a chi-square with one degree of freedom; the p-value would then be the area to the right of \(z^2\) under the \(\chi^2_1\) density curve.
A value of \(z^2\) (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis \(\beta_j=0\) at the .05-level.
\(\beta_1:\left(\dfrac{0.4592}{0.0878}\right)^2=27.354\)
Confidence Intervals of Individual Parameters:
An approximate \((1 − \alpha) 100\%\) confidence interval for \( \beta_j \) is given by
\(\hat{\beta}_j \pm z_{(1-\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% confidence interval for \(\beta_1\)
\(0.4592\pm 1.96 (0.0878) = (0.287112, 0.63128)\)
where 0.0878 is the "Standard Error"' from the "Coefficients" section. Then, the 95% confidence interval for the odds-ratio of a child smoking if one parent is smoking in comparison to neither smoking is
\((\exp(0.287112), \exp(0.63128)) = (1.3325, 1.880)\)
Thus, there is a strong association between parent's and children smoking behavior. But does this model fit?
Overall goodness-of-fit testing
Test: \(H_0\colon\) current model vs. \(H_A\colon\) saturated model
Residual deviance: -3.7170e-13 on 0 degrees of freedom
AIC: 19.242
The goodness-of-fit statistics, deviance, \(G^2\) from this model is zero, because the model is saturated. If we want to know the fit of the intercept only model that is provided by
Test: \(H_0\colon\) intercept only model vs. \(H_A\colon\) saturated model
Null deviance: 2.9121e+01 on 1 degrees of freedom
Suppose that we fit the intercept-only model. This is accomplished by removing the predictor from the model statement, like this:
glm(response~1, family=binomial(link=logit)
The goodness-of-fit statistics are shown below.
ull deviance: 29.121 on 1 degrees of freedom
Residual deviance: 29.121 on 1 degrees of freedom
AIC: 46.362
N
The deviance \(G^2 = 29.1207\) is precisely equal to the \(G^2\) for testing independence in this \(2\times2\) table. Thus by the assumption, the intercept-only model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or odds-ratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well.
Analysis of deviance table
In R, we can test factors’ effects with the anova
function to give an analysis of deviance table. We only include one factor in this model. So R dropped this factor (parentsmoke) and fit the intercept-only model to get the same statistics as above, i.e., the deviance \(G^2 = 29.121\).
> anova(smoke.logistic)
Analysis of Deviance Table
Model: binomial, link: logit
Response: response
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 1 29.121
parentsmoke 1 29.121 0 0.000
This example shows that analyzing a \(2\times2\) table for association is equivalent to logistic regression with a single dummy variable. We can further compare these tests to the loglinear model of independence in Lesson 10.
The goodness-of-fit statistics, \(X^2\) and \(G^2\), are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the \(chi^2\) approximation to work well, we need the \(n_i\)s to be sufficiently large so that the expected values, \(\hat{\mu}_i \ge 5\), and \(n_i − \hat{\mu}_i\ge 5\) for most of the rows. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.
With real data, we often find that the \(n_i\)s are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem.
6.2.3 - More on Model-fitting
6.2.3 - More on Model-fittingSuppose two models are under consideration, where one model is a special case or "reduced" form of the other obtained by setting \(k\) of the regression coefficients (parameters) equal to zero. The larger model is considered the "full" model, and the hypotheses would be
\(H_0\): reduced model versus \(H_A\): full model
Equivalently, the null hypothesis can be stated as the \(k\) predictor terms associated with the omitted coefficients have no relationship with the response, given the remaining predictor terms are already in the model. If we fit both models, we can compute the likelihood-ratio test (LRT) statistic:
\(G^2 = −2 (\log L_0 - \log L_1)\)
where \(L_0\) and \(L_1\) are the max likelihood values for the reduced and full models, respectively. The degrees of freedom would be \(k\), the number of coefficients in question. The p-value is the area under the \(\chi^2_k\) curve to the right of \( G^2)\).
To perform the test in SAS, we can look at the "Model Fit Statistics" section and examine the value of "−2 Log L" for "Intercept and Covariates." Here, the reduced model is the "intercept-only" model (i.e., no predictors), and "intercept and covariates" is the full model. For our running example, this would be equivalent to testing "intercept-only" model vs. full (saturated) model (since we have only one predictor).
Model Fit Statistics | |||
---|---|---|---|
Criterion | Intercept Only | Intercept and Covariates | |
Log Likelihood | Full Log Likelihood | ||
AIC | 5178.510 | 5151.390 | 19.242 |
SC | 5185.100 | 5164.569 | 32.421 |
-2 Log L | 5176.510 | 5147.390 | 15.242 |
Larger differences in the "-2 Log L" values lead to smaller p-values more evidence against the reduced model in favor of the full model. For our example, \( G^2 = 5176.510 − 5147.390 = 29.1207\) with \(2 − 1 = 1\) degree of freedom. Notice that this matches the deviance we got in the earlier text above.
Also, notice that the \(G^2\) we calculated for this example is equal to 29.1207 with 1df and p-value <.0001 from "Testing Global Hypothesis: BETA=0" section (the next part of the output, see below).
Testing the Joint Significance of All Predictors
Testing the null hypothesis that the set of coefficients is simultaneously zero. For example, consider the full model
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\cdots+\beta_k x_k\)
and the null hypothesis \(H_0\colon \beta_1=\beta_2=\cdots=\beta_k=0\) versus the alternative that at least one of the coefficients is not zero. This is like the overall F−test in linear regression. In other words, this is testing the null hypothesis of the intercept-only model:
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\)
versus the alternative that the current (full) model is correct. This corresponds to the test in our example because we have only a single predictor term, and the reduced model that removes the coefficient for that predictor is the intercept-only model.
In the SAS output, three different chi-square statistics for this test are displayed in the section "Testing Global Null Hypothesis: Beta=0," corresponding to the likelihood ratio, score, and Wald tests. Recall our brief encounter with them in our discussion of binomial inference in Lesson 2.
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 29.1207 | 1 | <.0001 |
Score | 27.6766 | 1 | <.0001 |
Wald | 27.3361 | 1 | <.0001 |
Large chi-square statistics lead to small p-values and provide evidence against the intercept-only model in favor of the current model. The Wald test is based on asymptotic normality of ML estimates of \(\beta\)s. Rather than using the Wald, most statisticians would prefer the LR test. If these three tests agree, that is evidence that the large-sample approximations are working well and the results are trustworthy. If the results from the three tests disagree, most statisticians would tend to trust the likelihood-ratio test more than the other two.
In our example, the "intercept only" model or the null model says that student's smoking is unrelated to parents' smoking habits. Thus the test of the global null hypothesis \(\beta_1=0\) is equivalent to the usual test for independence in the \(2\times2\) table. We will see that the estimated coefficients and standard errors are as we predicted before, as well as the estimated odds and odds ratios.
Residual deviance is the difference between −2 logL for the saturated model and −2 logL for the currently fit model. The high residual deviance shows that the model cannot be accepted. The null deviance is the difference between −2 logL for the saturated model and −2 logL for the intercept-only model. The high residual deviance shows that the intercept-only model does not fit.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In our \(2\times2\) table smoking example, the residual deviance is almost 0 because the model we built is the saturated model. And notice that the degree of freedom is 0 too. Regarding the null deviance, we could see it equivalent to the section "Testing Global Null Hypothesis: Beta=0," by likelihood ratio in SAS output.
For our example, Null deviance = 29.1207 with df = 1. Notice that this matches the deviance we got in the earlier text above.
The Homer-Lemeshow Statistic
An alternative statistic for measuring overall goodness-of-fit is the Hosmer-Lemeshow statistic.
This is a Pearson-like chi-square statistic that is computed after the data are grouped by having similar predicted probabilities. It is more useful when there is more than one predictor and/or continuous predictors in the model too. We will see more on this later.
\(H_0\): the current model fits well
\(H_A\): the current model does not fit well
To calculate this statistic:
- Group the observations according to model-predicted probabilities ( \(\hat{\pi}_i\))
- The number of groups is typically determined such that there is roughly an equal number of observations per group
- The Hosmer-Lemeshow (HL) statistic, a Pearson-like chi-square statistic, is computed on the grouped data but does NOT have a limiting chi-square distribution because the observations in groups are not from identical trials. Simulations have shown that this statistic can be approximated by a chi-squared distribution with \(g − 2\) degrees of freedom, where \(g\) is the number of groups.
Warning about the Hosmer-Lemeshow goodness-of-fit test:
- It is a conservative statistic, i.e., its value is smaller than what it should be, and therefore the rejection probability of the null hypothesis is smaller.
- It has low power in predicting certain types of lack of fit such as nonlinearity in explanatory variables.
- It is highly dependent on how the observations are grouped.
- If too few groups are used (e.g., 5 or less), it almost always fails to reject the current model fit. This means that it's usually not a good measure if only one or two categorical predictor variables are involved, and it's best used for continuous predictors.
In the model statement, the option lackfit tells SAS to compute the HL statistic and print the partitioning. For our example, because we have a small number of groups (i.e., 2), this statistic gives a perfect fit (HL = 0, p-value = 1). Instead of deriving the diagnostics, we will look at them from a purely applied viewpoint. Recall the definitions and introductions to the regression residuals and Pearson and Deviance residuals.
Residuals
The Pearson residuals are defined as
\(r_i=\dfrac{y_i-\hat{\mu}_i}{\sqrt{\hat{V}(\hat{\mu}_i)}}=\dfrac{y_i-n_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1-\hat{\pi}_i)}}\)
The contribution of the \(i\)th row to the Pearson statistic is
\(\dfrac{(y_i-\hat{\mu}_i)^2}{\hat{\mu}_i}+\dfrac{((n_i-y_i)-(n_i-\hat{\mu}_i))^2}{n_i-\hat{\mu}_i}=r^2_i\)
and the Pearson goodness-of fit statistic is
\(X^2=\sum\limits_{i=1}^N r^2_i\)
which we would compare to a \(\chi^2_{N-p}\) distribution. The deviance test statistic is
\(G^2=2\sum\limits_{i=1}^N \left\{ y_i\text{log}\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_i-y_i)\text{log}\left(\dfrac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right\}\)
which we would again compare to \(\chi^2_{N-p}\), and the contribution of the \(i\)th row to the deviance is
\(2\left\{ y_i\log\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_i-y_i)\log\left(\dfrac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right\}\)
We will note how these quantities are derived through appropriate software and how they provide useful information to understand and interpret the models.
6.2.4 - Multi-level Predictor
6.2.4 - Multi-level PredictorThe concepts discussed with binary predictors extend to predictors with multiple levels. In this lesson we consider \(Y_i\) a binary response, \(x_i\) a discrete explanatory variable (with \(k = 3\) levels, and make connections to the analysis of \(2\times3\) tables. But the basic ideas extend to any \(2\times J\) table.
We begin by replicating the analysis of the original \(3\times2\) table with logistic regression.
How many parents smoke? | Student smokes? | |
---|---|---|
Yes (Z = 1) | No (Z = 2) | |
Both (Y = 1) | 400 | 1380 |
One (Y = 2) | 416 | 1823 |
Neither (Y = 3) | 188 | 1168 |
First, we re-express the data in terms of \(Y_i=\) number of smoking students, and \(n_i=\) number of students for the three groups based on parents behavior (we can think of \(i\) as an index for the rows in the table).
How many parents smoke? | Student smokes? | |
---|---|---|
\(y_i\) | \(n_i\) | |
Both | 400 | 1780 |
One | 416 | 2239 |
Neither | 188 | 1356 |
Then we decide on a baseline level for the explanatory variable \(x\) and create \(k − 1\)indicators if \(x\) is a categorical variable with \(k\) levels. For our example, we set "Neither" as the baseline for the parent smoking predictor and define a pair of indicators that take one of two values, specifically,
\(x_{1i}=1\) if parent smoking = "One" ,
\(x_{1i}=0\) otherwise
\(x_{2i}=1\) if parent smoking = "Both",
\(x_{2i}=0\) otherwise.
If these two indicator variables were added as columns in the table above, \(x_{1i}\) would have values \((0,1,0)\), and \(x_{2i}\) would have values \((1,0,0)\). Next, we let \(\pi_i\) denote the probability of student smoking for the \(i\)th row group so that finally, the model is
\(\log\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}\)
for \(i=1,\ldots,3\). Thus, the log-odds of a student smoking is \(\beta_0\) for students whose parents don't smoke ("neither" group), \(\beta_0+\beta_1\) for students with one smoking parent ("one" group), and \(\beta_0+\beta_2\) for students with both smoking parents. In particular, \(\beta_1\) is the difference in log odds or, equivalently, the log odds ratio for smoking when comparing students with one smoking parent against students with neither smoking parent. Similarly, \(\beta_2\) represents that when comparing students with two smoking parents against students with neither smoking parent.
Based on the tabulated data, where we saw \(1.42=416 ( 1168)/(188( 1823))\) and \(1.80=400(1168)/(188(1380))\), we're not surprised to see \(\hat{\beta}_1=\log(1.42)=0.351\) and \(\hat{\beta}_2=\log(1.80)=0.588\). The estimated intercept should be \(\hat{\beta}_0=\log(188/1168)=-1.826\).
Fitting A Multi-level Predictor Model in SAS and R
There are different models fit within the smoke.sas SAS code. Here is one where '0' = neither parent smokes, '1' = one smokes, and '2' = both smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.
data smoke;
input s $ y n ;
cards;
2 400 1780
1 416 2239
0 188 1356
;
proc logistic data=smoke descending;
class s (ref=first)/ param=ref;
model y/n = s / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression fro 2x3 tables with residuals';
run;
data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;
proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;
The option param=ref
tells SAS to create a set of two dummy variables to distinguish among the three categories, where '0'=neither is a baseline because of option descending
and ref=first
(see the previous section for details).
Another way of doing the same in smoke.sas but using character labels (e.g, "both", "one", "neither") rather than numbers for the categories:
data smoke;
input s $ y n ;
cards;
both 400 1780
one 416 2239
neither 188 1356
;
proc logistic data=smoke descending;
class s (ref='neither') / order=data param=ref;
model y/n = s /scale=none lackfit;
output out=predict pred=prob;
title 'Logistic regression for 2x3 table';
run;
proc print data=predict;
run;
In the class
statement, the option order=data
tells SAS to sort the categories of S by the order in which they appear in the dataset rather than alphabetical order. The option ref='neither'
makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's look at some relevant portions of the output that differ from the analysis of the corresponding \(2\times2\) table from the previous section of the notes.
Model Information | |
---|---|
Data Set | WORK.SMOKE |
Response Variable (Events) | y |
Response Variable (Trials) | n |
Model | binary logit |
Optimization Technique | Fisher's scoring |
Number of Observations Read | 3 |
---|
Fisher scoring is a variant of Newton-Raphson method for ML estimation. In logistic regression they are equivalent. Notice how there are 3 observations since we have 3 groupings by the levels of the explanatory variable.
Class Level Information | |||
---|---|---|---|
Class | Value | Design Variables | |
s | 0 | 0 | 0 |
1 | 1 | 0 | |
2 | 0 | 1 |
From an explanatory variable S with 3 levels (0,1,2), we created two indicator variables:
\(x_1=1\) if parent smoking = One,
\(x_1=0\) otherwise,
\(x_2=1\) if parent smoking=Both,
\(x_2=0\) otherwise.
Since parent smoking = Neither is equal to 0 for both indicator variables, it serves as the baseline.
Class Level Information | |||
---|---|---|---|
Class | Value | Design Variables | |
s | both | 1 | 0 |
one | 0 | 1 | |
neither | 0 | 0 |
Here, we specified neither to be a reference variable. We used option data=order
so that both take values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based on alphabetical order, but in this case, they coincide.
The parameter estimates are reported in the output below.
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.8266 | 0.0786 | 540.2949 | <.0001 | |
s | 1 | 1 | 0.3491 | 0.0955 | 13.3481 | 0.0003 |
s | 2 | 1 | 0.5882 | 0.0970 | 36.8105 | <.0001 |
Here, \(s_1\) and \(s_2\) correspond to \(x_1\) and \(x_2\). The saturated model is
\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)
and the estimated probability of a child smoking, given the explanatory variable is
\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)
For example, the predicted probability of a student smoking, given that only one parent is smoking is
\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)
Residuals
In SAS, if we include the statement
output out=predict pred=phat reschi=pearson resdev=deviance;
then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), and the Pearson and deviance residuals. We can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) (e.g., "s-hat") and failures \(n_i-\hat{\mu}_i=n_i(1-\hat{\pi}_i)\) (e.g., "f-hat").
data diagnostics;
set predict;
shat = n*prob;
fhat = n*(1-prob);
run;
proc print data=diagnostics;
var s y n prob shat fhat pearson deviance;
title 'Logistic regression diagnostics for 2x3 table';
run;
Running this program gives a new output section:
Obs | s | y | n | prob | shat | fhat | pearson | deviance |
---|---|---|---|---|---|---|---|---|
1 | 2 | 400 | 1780 | 0.22472 | 400.000 | 1380.00 | -.000000031 | 0 |
2 | 1 | 416 | 2239 | 0.18580 | 416.000 | 1823.00 | -3.0886E-15 | 0 |
3 | 0 | 188 | 1356 | 0.13864 | 188.000 | 1168.00 | -.000001291 | -.000001196 |
Here "s" are the levels of the categorical predictor for parents' smoking behavior, "y" as before the number of students smoking for each level of the predictor, "n" the marginal counts for each level of the predictor", "prob" is the estimated probability of "success" (e.g. a student smoking given the level of the predictor), "s-hat" and "f-hat" expected number of successes and failures respectively, and "pearson" and "deviance" are Pearson and Deviance residuals.
All of the "s-hat" and "f-hat" values, that is predicted number of successes and failures are greater than 5.0, so the chi-square approximation is trustworthy.
Below is the R code (from smoke.R) that replicates the analysis of the original \(2\times3\) table with logistic regression.
#### Fitting logistic regression for a 2x3 table #####
#### Here is one way to read the data from the table and use glm()
#### Notice how we need to use family=binomial (link=logit)
#### while with log-linear models we used family=poisson(link=log)
parentsmoke=as.factor(c(2,1,0))
response<-cbind(c(400,416,188),c(1380,1823,1168))
response
smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))
First, let’s see the table we created for the analysis.
> response
[,1] [,2]
[1,] 400 1380
[2,] 416 1823
[3,] 188 1168
You should again notice the difference in data input. In SAS, we input y and n as data columns; here we just input data columns as yes and no.
Next, the model information is displayed.
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.8366e+01 on 2 degrees of freedom
Residual deviance: -3.7215e-13 on 0 degrees of freedom
AIC: 28.165
Number of Fisher Scoring iterations: 2
If we don’t specify the baseline using the "class" option, R will set the first level as the default. Here, it’s "0". The parentsmoke1 and parentsmoke2 variables correspond to the \(x_1\) and \(x_2\) indicators. The saturated model is
\(\mbox{logit}(\pi)=-1.8266+0.3491 x_1+0.5882 x_2\)
and the estimated probability of a child smoking given the explanatory variable:
\(\hat{\pi}_i=\dfrac{\exp(-1.8266+0.3491 x_1+0.5882 x_2)}{1+\exp(-1.8266+0.3491 x_1+0.5882 x_2)}\)
For example, the predicted probability of a student smoking given that only one parent is smoking is
\(P(Y=1|x_1=1,x_2=0)= \dfrac{\exp(-1.8266+0.3491)}{1+\exp(-1.8266+0.3491)}= 0.1858\)
Residuals
In R, deviance residuals are given directly in the output by summary() function.
> smoke.logistic<-glm(response~parentsmoke, family=binomial(link=logit))
> summary(smoke.logistic)
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e-09 ***
---
We can also obtain the deviance and Pearson residuals by using residuals.glm() function.
To obtain deviance residuals
> residuals(smoke.logistic)
[1] 0 0 0
To obtain Person residuals:
> residuals(smoke.logistic, type="pearson")
1 2 3
-2.440787e-13 -4.355932e-13 -1.477321e-11
To obtain the predicted values, use the function predict.glm()
with which we can specify the type of predicted values we want.
type
is the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model, the default predictions are of log-odds (probabilities on logit scale) and type = "response"
gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
For example, the code below gives predicted probabilities.
> predict(smoke.logistic, type="response")
1 2 3
0.2247191 0.1857972 0.1386431
All of the predicted number of successes and failures are greater than 5.0 so the chi-square approximation is trustworthy.
Overall goodness-of-fit
The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Suppose that we fit the intercept-only model as before by removing the predictors from the model statement:
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
title 'Logistic regression for intercept-only model';
run;
The goodness-of-fit statistics are shown below.
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 38.3658 | 2 | 19.1829 | <.0001 |
Pearson | 37.5663 | 2 | 18.7832 | <.0001 |
Number of events/trials observations: 3
The Pearson statistic \(X^2= 37.5663\) and the deviance \(G^2 = 38.3658\) are precisely equal to the usual \(X^2\) and \(G^2\) for testing independence in the \(2\times 3\) table. In this example, the saturated model fits perfectly (as always), but the independence model does not fit well.
We can use the Hosmer-Lemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
Partition for the Hosmer and Lemeshow Test | |||||
---|---|---|---|---|---|
Group | Total | Event | Nonevent | ||
Observed | Expected | Observed | Expected | ||
1 | 1356 | 188 | 188.00 | 1168 | 1168.00 |
2 | 2239 | 416 | 416.00 | 1823 | 1823.00 |
3 | 1780 | 400 | 400.00 | 1380 | 1380.00 |
Hosmer and Lemeshow Goodness-of-Fit Test | ||
---|---|---|
Chi-Square | DF | Pr > ChiSq |
0.0000 | 1 | 1.0000 |
Null deviance: 3.8366e+01 on 2 degrees of freedom
Residual deviance: -3.7215e-13 on 0 degrees of freedom
AIC: 28.165
The residual deviance is almost 0 because the model is saturated. We can find \(G^2= 38.366\) by null deviance, which is precisely equal to the usual \(G^2\) for testing independence in the \(2\times 3\) table. Since this statistic is large, it leads to a small p-value and provides significant evidence against the intercept-only model in favor of the current model. We can also find the same result in the output from the anova()
part. Clearly, in this example, the saturated model fits perfectly (as always), but the independence model does not fit well.
We can use the Hosmer-Lemeshow test (available in the R file HosmerLemeshow.R) to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
> ## hosmerlem() takes the vector of successes, predicted vector of success and g=# of groups as input
> ## produce the vector of predicted success "yhat"
> yhat=rowSums(response)*predict(smoke.logistic, type="response")
> yhat
1 2 3
400 416 188
> hosmerlem(response[,1], yhat, g=3) ## here run 3 groups
X^2 Df P(>Chi)
"-1.00593633284243e-24" "1" "."
We have shown that analyzing a \(2\times 3\) table for associations is equivalent to a binary logistic regression with two dummy variables as predictors. For \(2\times J\) tables, we would fit a binary logistic regression with \(J − 1\) indicator variables.
Testing the Joint Significance of All Predictors
Starting with the (full) model
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)
the null hypothesis \(H_0\colon\beta_1=\beta_2=0\) specifies the intercept-only (reduced) model:
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\)
In general, this test has degrees of freedom equal to the number of slope parameters, which is 2 in this case. Large chi-square statistics lead to small p-values and provide evidence to reject the intercept-only model in favor of the full model.
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 38.3658 | 2 | <.0001 |
Score | 37.5663 | 2 | <.0001 |
Wald | 37.0861 | 2 | <.0001 |
Testing for an Arbitrary Group of Coefficients
The likelihood-ratio statistic is
\(G^2 = −2 (\log \mbox{ likelihood from reduced model}−(−2 \log \mbox{ likelihood from full model}))\)
and the degrees of freedom is \(k=\) the number of parameters differentiating the two models. The p-value is \(P(\chi^2_k \geq G^2)\).
Model Fit Statistics | |||
---|---|---|---|
Criterion | Intercept Only | Intercept and Covariates | |
Log Likelihood | Full Log Likelihood | ||
AIC | 5178.510 | 5144.144 | 28.165 |
SC | 5185.100 | 5163.913 | 47.933 |
-2 Log L | 5176.510 | 5138.144 | 22.165 |
For our example, \(G^2 = 5176.510 − 5138.144 = 38.3658\) with \(3 − 1 = 2\) degrees of freedom. Notice that this matches
Likelihood Ratio 38.3658 2 <.0001
from the "Testing Global Hypothesis: BETA=0" section. Here is the model we just looked at in SAS.
data smoke;
input s1 s2 $ y n ;
cards;
0 1 400 1780
1 0 416 2239
0 0 188 1356
;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 s2 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;
proc logistic data=smoke descending;
class s1 s2 (ref=first)/ param=ref;
model y/n = s1 / scale=none lackfit;
output out=predict pred=prob reschi=pearson resdev=deviance;
run;
Compare this to
model y/n = s1 /scale=none lackfit;
Testing Individual Parameters
For the test of the significance of a single variable \(x_j\)
\(H_0\colon\beta_j=0\) versus \(H_A\colon\beta_j\ne0\)
we can use the (Wald) test statistic and p-value. A large value of \(z\) (relative to standard normal) or \(z^2\) (relative to chi-square with 1 degree of freedom) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.
From the second row of this part of the output,
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.8266 | 0.0786 | 540.2949 | <.0001 | |
s | 1 | 1 | 0.3491 | 0.0955 | 13.3481 | 0.0003 |
s | 2 | 1 | 0.5882 | 0.0970 | 36.8105 | <.0001 |
\(z^2=\left(\dfrac{0.3491}{0.0955}\right)^2=13.3481\)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A large value of \(z\) (relative to standard normal) indicates that we can reject the null hypothesis and conclude that \(\beta_j\) is not 0.
Confidence Intervals: An approximate \((1 − \alpha)100\)% confidence interval for \(\beta_j\) is given by
\(\hat{\beta}_j \pm z_{(1-\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% CI for \(\beta-1\) is
\(0.3491 \pm 1.96 (0.0955) = (0.16192, 0.5368)\)
Then, the 95% CI for the odds-ratio of a student smoking, if one parent is smoking in comparison to neither smoking, is
\((\exp(0.16192), \exp(0.5368)) = (1.176, 1.710)\)
Since this interval does not include the value 1, we can conclude that student and parents' smoking behaviors are associated. Furthermore,
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits |
|
s 1 vs 0 | 1.418 | 1.176 | 1.710 |
s 2 vs 0 | 1.801 | 1.489 | 2.178 |
- The estimated conditional odds ratio of a student smoking between one parent smoking and neither smoking is \(\exp(\beta_1) = \exp(0.3491) = 1.418\).
- The estimated conditional odds ratio of a student smoking between both parents smoking and neither smoking is \(\exp(\beta_2) = \exp(0.5882) = 1.801\).
- The estimated conditional odds ratio of a student smoking between both parents smoking and one smoking is \(\exp(\beta_2)/\exp(\beta_1) = 1.801/1.418 = 1.27 = \exp(\beta_2-\beta_1) = \exp(0.5882 − 0.3491) = 1.27\). That is, compared with a student who has only one parent smoking, a student who has both parents smoking has an odds of smoking 1.27 times as high.
6.3 - Correspondence to k-way tables
6.3 - Correspondence to k-way tablesJust as three-way tables allowed us to fix or condition on one variable while focusing on the relationship between two others, the logistic regression model can easily accommodate more than a single predictor. Each slope coefficient is a measure of the conditional relationship between its predictor and the response, given the other predictors are held constant. In other words, all relationships are conditional ones.
We start with the familiar example of the boy scouts and rates of delinquency and see how the results we calculated from the three-way table earlier (and more) can be produced from the logistic regression model.
6.3.1 - Estimating Odds Ratios
6.3.1 - Estimating Odds RatiosRecall the \(3\times2\times2\) table that we examined in Lesson 5 that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.
Socioeconomic status | Boy scout | Delinquent | |
---|---|---|---|
Yes | No | ||
Low | Yes | 11 | 43 |
No | 42 | 169 | |
Medium | Yes | 14 | 104 |
No | 20 | 132 | |
High | Yes | 8 | 196 |
No | 2 | 59 |
Because the outcome variable D is binary, we can express many models of interest using binary logistic regression. Before handling the full three-way table, let us consider the \(2\times2\) marginal table for B and D as we did in Lesson 5. We concluded that the boy scout status (B) and the delinquent status (D) are associated and that (from the table below)
Boy scout | Delinquent | |
---|---|---|
Yes | No | |
Yes | 33 | 343 |
No | 64 | 360 |
the estimated log-odds ratio is
\(\log\left(\dfrac{33( 360)}{64 (343)}\right)=-0.6140\)
with a standard error of \(\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272\). That is, we estimate that being a boy scout is associated a lower log-odds of delinquency by 0.614; the odds-ratio is 0.541. Now let’s fit the logistic regression model
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x\)
where \(x\) is a dummy variable
\(x\) = 0 if non-scout,
\(x\) = 1 if scout.
See the SAS code in the program scout.sas below:
options nocenter nodate nonumber linesize=72;
data scout1;
input S $ y n;
cards;
scout 33 376
nonscout 64 424
;
proc logist data=scout1;
class S / param=ref ref=first;
model y/n = S / scale=none;
run;
Note that SAS will set the baseline to "non-scout" because it comes before "scout" in alphabetical order. Some output from this part of the program:
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.7272 | 0.1357 | 162.1110 | <.0001 | |
S | scout | 1 | -0.6140 | 0.2272 | 7.3032 | 0.0069 |
The estimated coefficient of \(x\)
\(\hat{\beta}_1=-0.6140\)
is identical to the log-odds ratio from the analysis of the \(2\times2\) table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the \(2\times2\) table analysis. Also, in this model, \(\beta_0\) is the log-odds of delinquency for non-scouts (\(x=0\)). Looking at the \(2\times2\) table, the estimated log-odds for non-scouts is
\(\log\left(\dfrac{64}{360}\right)=-1.7272\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Let's fit the intercept-only model by removing the predictor from the model statement:
model y/n = / scale=none;
The goodness-of-fit statistics are shown below.
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 7.6126 | 1 | 7.6126 | 0.0058 |
Pearson | 7.4652 | 1 | 7.4652 | 0.0063 |
Number of events/trials observations: 2
The Pearson statistic \(X^2=7.4652\) is identical to the ordinary chi-square statistic for testing independence in the \(2\times2\) table. And the deviance \(G^2= 7.6126\) is identical to the \(G^2\) for testing independence in the \(2\times2\) table.
The R code for this model and other models in this section for the boy scout example is in the R program scout.R. Here is part of the code that works with the \(2\times2\) (marginal) table between scouting and delinquency.
#### corresponds to the data scout1 in scout.SAS
#### Fits a Logistic regression with S="scout" or "nonscout"
S=factor(c("scout","nonscout"))
Sscout=(S=="scout")
Snonscout=(S=="nonscout")
y=c(33,64)
n=c(376,424)
count=cbind(y,n-y)
result=glm(count~Sscout+Snonscout,family=binomial("logit"))
summary(result)
Part of the R output includes:
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7272 0.1357 -12.732 < 2e-16 ***
SscoutTRUE -0.6140 0.2272 -2.702 0.00688 **
SnonscoutTRUE NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that R sets the baseline to "non-scout" because it comes before "scout" in alphabetical order. The estimated coefficient for scout is
\(\hat{\beta}_1=−0.6140\)
and is identical to the log-odds ratio from the analysis of the \(2\times2\) table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the \(2\times2\) table analysis.
Also, in this model, \(\beta_0\) is the log-odds of delinquency for nonscouts (\(x_i=0\)). Looking at the \(2\times2\) table, the estimated log-odds for non-scouts is
\(\log(64/360) = −1.7272\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodness-of-fit statistics are shown below:
Null deviance: 7.6126e+00 on 1 degrees of freedom
Residual deviance: 4.0190e-14 on 0 degrees of freedom
AIC: 15.083
As we have seen before, the residual deviance is essentially 0 because the model is saturated. The null deviance corresponds to the deviance goodness-of-fit statistics in the SAS output. Here, the deviance \(G^2 = 7.6126\) is identical to the \(G^2\) for testing independence in the \(2\times2\) table.
Now let us do a similar analysis for the \(3\times2\) table that classifies subjects by S and D:
Socioeconomic status | Delinquent | |
---|---|---|
Yes | No | |
Low | 53 | 212 |
Medium | 34 | 236 |
High | 10 | 255 |
Two odds ratios of interest are
\((53(236)) / (34( 212)) = 1.735\)
\((53( 255)) / (10 (212)) = 6.375\)
We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are
\(\log 1.735 = .5512\) and \(\log 6.375 = 1.852\)
and the standard errors are
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392\)
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571\)
Now let us replicate this analysis using logistic regression. First, we re-express the data in terms of \y_i\) = number of delinquents and \(n_i\) = number of boys for the \(i\)th row (SES group):
\(y_i\) | \(n_i\) | |
---|---|---|
Low | 53 | 265 |
Medium | 34 | 270 |
High | 10 | 265 |
Then we define a pair of dummy indicators,
\(x_1=1\) if S=medium,
\(x_1=0\) otherwise,
\(x_2=1\) if S=high,
\(x_2=0\) otherwise.
Let \(\pi\) = probability of delinquency. Then, the model
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)
says that the log-odds of delinquency are \(\beta_0\) for S = low, \(\beta_0+\beta_1\) for S = medium, and \(\beta_0+\beta_2\) for S = high.
The SAS code for fitting this model is shown below (see scout.sas).
options nocenter nodate nonumber linesize="72";
data new;
input S $ y n;
cardsl
low 53 265
medium 34 270
hight 10 265
;
proc logist data=new;
class S / order data param=ref ref=first;
model y/n = S scale=none;
run;
Some relevant portions of the output are shown below.
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 0.0000 | 0 | . | . |
Pearson | 0.0000 | 0 | . | . |
Number of events/trials observations: 3
Think about the following question, then click on the button to show the answer.
Since we are fitting the saturated model here, we are using all three lines of data. So, our current model IS the saturated model.
So, in comparing the current model with the saturated model, the deviance has to be zero.
As far the information about boy scout status, we have collapsed over this status. This is similar to when we collapsed down from a 3-way table down into a 2-way table and their is no information about boy scout status.
Model Fit Statistics | |||
---|---|---|---|
Criterion | Intercept Only | Intercept and Covariates | |
Log Likelihood | Full Log Likelihood | ||
AIC | 593.053 | 560.801 | 20.942 |
SC | 597.738 | 574.855 | 34.995 |
-2 Log L | 591.053 | 554.801 | 14.942 |
Testing Global Null Hypothesis: BETA=0 | |||
---|---|---|---|
Test | Chi-Square | DF | Pr > ChiSq |
Likelihood Ratio | 36.2523 | 2 | <.0001 |
Score | 32.8263 | 2 | <.0001 |
Wald | 27.7335 | 2 | <.0001 |
Type 3 Analysis of Effects | |||
---|---|---|---|
Effect | DF | Wald Chi-Square |
Pr > ChiSq |
S | 2 | 27.7335 | <.0001 |
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.3863 | 0.1536 | 81.4848 | <.0001 | |
S | medium | 1 | -0.5512 | 0.2392 | 5.3080 | 0.0212 |
S | high | 1 | -1.8524 | 0.3571 | 26.9110 | <.0001 |
Odds Ratio Estimates | |||
---|---|---|---|
Effect | Point Estimate | 95% Wald Confidence Limits |
|
S medium vs low | 0.576 | 0.361 | 0.921 |
S high vs low | 0.157 | 0.078 | 0.316 |
In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis \(\beta_1=\beta_2=0\) is equivalent to the usual test for independence in the \(3\times2\) table. The estimated coefficients and SEs are as we predicted, and the estimated odds ratios are
\(\exp(−.5512) = 0.576 = 1/1.735\),
\(\exp(−1.852) = 0.157 = 1/6.375\)
Here is how we can fit this model in R (from scout.R, corresponds to the scout2 data in the scout.sas program).
S=factor(c("low","medium","high"))
y=c(53,34,10)
n=c(265,270,265)
count=cbind(y,n-y)
Smedium=(S=="medium")
Shigh=(S=="high")
result=glm(count~Smedium+Shigh,family=binomial("logit"))
summary(result)
Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as "low" for this model.
R output:
Call:
glm(formula = count ~ Smedium + Shigh, family = binomial("logit"))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3863 0.1536 -9.027 < 2e-16 ***
SmediumTRUE -0.5512 0.2392 -2.304 0.0212 *
ShighTRUE -1.8524 0.3571 -5.188 2.13e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.6252e+01 on 2 degrees of freedom
Residual deviance: 6.8390e-14 on 0 degrees of freedom
AIC: 20.942
Number of Fisher Scoring iterations: 3
Since we are fitting the saturated model here, we are using all three lines of data. So, our current model IS the saturated model.
So, in comparing the current model with the saturated model, the deviance has to be zero.
As far the information about boy scout status, we have collapsed over this status. This is similar to when we collapsed down from a 3-way table down into a 2-way table and their is no information about boy scout status.
The null deviance is the \(G^2\) that corresponds to the deviance goodness-of-fit statistics found in the SAS output. Here, the deviance \(G^2 = 36.252\), so we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis \(\beta_1=\beta_2=0\) is equivalent to the usual test for independence in the \(3\times2\) table. The estimated coefficients and SEs are as we predicted, and the estimated odds ratios are
\(\exp(−.5512) = 0.576 = 1/1.735\),
\(\exp(−1.852) = 0.157 = 1/6.375\).
Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable. Next, we will see how this plays out with logistic regression.
6.3.2 - Collapsing Tables
6.3.2 - Collapsing TablesIn this example, we could have also arranged the input data like this:
S | B | \(y_i\) | \(n_i\) |
---|---|---|---|
low | scout 11 | 54 | |
low | non-scout | 42 | 211 |
medium | scout | 14 | 118 |
medium | non-scout | 20 | 152 |
high | scout | 8 | 204 |
high | non-scout | 2 | 61 |
A SAS program for fitting the same model is shown below.
data new;
input S $ B $ y n;
cards;
low scout 11 54
low nonscout 42 211
medium scout 14 118
medium nonscout 20 152
high scout 8 204
high nonscout 2 61
;
proc logist data=new;
class S / order=data param=ref ref=first;
model y/n = S / scale=none;
run;
The parameter estimates from this new program are exactly the same as before:
Analysis of Maximum Likelihood Estimates | ||||||
---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald Chi-Square |
Pr > ChiSq | |
Intercept | 1 | -1.3863 | 0.1536 | 81.4848 | <.0001 | |
S | medium | 1 | -0.5512 | 0.2392 | 5.3080 | 0.0212 |
S | high | 1 | -1.8524 | 0.3571 | 26.9110 | <.0001 |
But the overall fit statistics are different! Before, we had \(X^2=0\) and \(G^2=0\) because the model was saturated (there were three parameters and \(N = 3\) lines of data). But now, the fit statistics are:
Deviance and Pearson Goodness-of-Fit Statistics | ||||
---|---|---|---|---|
Criterion | Value | DF | Value/DF | Pr > ChiSq |
Deviance | 0.1623 | 3 | 0.0541 | 0.9834 |
Pearson | 0.1602 | 3 | 0.0534 | 0.9837 |
Number of events/trials observations: 6
The model appears to fit very well, but it is no longer saturated. What happened? Recall that \(X^2\) and \(G^2\) are testing the null hypothesis that the current model is correct, versus the alternative of a saturated model. When we disaggregated the data by levels of B, using six input lines rather than three, the current model did not change but the saturated model did; the saturated model was enlarged to six parameters.
It is very important for you to understand how you entered the data and what model you are fitting. If you understand the basic concepts, then you can apply model comparisons with any statistical software application.
Another way to interpret the overall \(X^2\) and \(G^2\) goodness-of-fit tests is that they are testing the significance of all omitted covariates. If we collapse the data over B and use only three lines of data, then SAS is unaware of the existence of B. But if we disaggregate the data by levels of B and do not include it in the model, then SAS has the opportunity to test the fit of the current model—in which the probability of delinquency varies by S alone—against the saturated alternative in which the probability of delinquency varies by each combination of the levels of S and B. When the data are disaggregated, the goodness-of-fit tests are actually testing the hypothesis that D is unrelated to B once S has been taken into account—i.e., that D and B are conditionally independent given S.
Here’s another way to think about it. The current model has three parameters:
- an intercept, and
- two indicators for S.
But the alternative has six parameters. We can think of these six parameters as an intercept and five dummies to distinguish among the six rows of data, but then it's not easy to see how the current model becomes a special case of it. So, we think of the six parameters as
- an intercept,
- two dummies for S,
- one dummy for B, and
- two interaction terms for SB.
Now it has become clear that the current model is a special case of this model in which the coefficients for B and the SB interactions are zero. The overall \(X^2\) and \(G^2\) statistics for the disaggregated data are testing the joint significance of the B dummy and the SB interactions.
So, should we aggregate, or should we not? If the current model is true, then it doesn’t matter; we get exactly the same estimated coefficients and standard errors either way. But dis-aggregating gives us the opportunity to test the significance of the omitted terms for B and SB.
Therefore, it often makes sense to dis-aggregate your dataset by variables that are not included in the model, because it gives you the opportunity to test the overall fit of your model. But that strategy has limits. If you dis-aggregate the data too much, the \(n_i\)s may become too small to reliably test the fit by \(X^2\) and \(G^2\).
6.3.3 - Different Logistic Regression Models for Three-way Tables
6.3.3 - Different Logistic Regression Models for Three-way TablesIn this part of the lesson we will consider different binary logistic regression models for three-way tables and their link to log-linear models. Let us return to the \(3\times2\times2\) table:
Socioeconomic status | Boy scout | Delinquent | |
---|---|---|---|
Yes | No | ||
Low | Yes | 11 | 43 |
No | 42 | 169 | |
Medium | Yes | 14 | 104 |
No | 20 | 132 | |
High | Yes | 8 | 196 |
No | 2 | 59 |
As we discussed in Lesson 5, there are many different models that we could fit to this table. But if we think of D as a response and B and S as potential predictors, we can focus on a subset of models that make sense.
Let \(\pi\) be the probability of delinquency. The simplest model to consider is the null or intercept-only model,
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0\) (1)
in which D is unrelated to B or S. If we were to fit this model in PROC LOGISTIC using the disaggregated data (all six lines), we would find that the\(X^2\) and \(G^2\) statistics are identical to those we obtained in Lesson 5 from testing the null hypothesis "S and B are independent of D". That is, testing the overall fit of model (1), i.e., intercept only model, is equivalent to testing the fit of the log-linear model (D, SB), because (1) says that D is unrelated to S and B but makes no assumptions about whether S and B are related.
After (1), we may want to fit the logit model that has a main effect for B,
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1\) (2)
where
\(x_1=1\) if B=scout,
\(x_1=0\) otherwise.
If the data are disaggregated into six lines, the goodness-of-fit tests for model (2) will be equivalent to the test for (DB, SB) log-linear model, which says that D and S are conditionally independent given B. This makes sense, because (2) says that S has no effect on D once B has been taken into account.
The model that has main effects for S,
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_2 x_2+\beta_3 x_3\) (3)
where
\(x_2=1\) if S = medium,
\(x_2=0\) otherwise,
\(x_3=1\) if S = high,
\(x_2=0\) otherwise,
says that B has no effect on D once S has been taken into account. The goodness-of-fit tests for (3) are equivalent to testing the null hypothesis that (DS, BS) fits, i.e. that D and B are conditionally independent given S.
The logit model
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_3 x_3\) (4)
has main effects for both B and S, corresponds to the model of homogeneous association which we discussed in Lesson 5. We could not fit the model at that time, because the ML estimates have no closed-form solution, but we calculated CMH statistic and Breslow-Day statistic. But with logistic regression software, fitting this model is no more difficult than for any other model.
This model says that the effect of B on D, when expressed in terms of odds ratios, is identical across the levels of S. Equivalently, it says that the odds ratios describing the relationship between S and D are identical across the levels of B. If this model does not fit, we have evidence that the effect of B on D varies across the levels of S, or that the effect of S on D varies across the levels of B.
Finally, the saturated model can be written as
\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_3x_3+\beta_4x_1x_2+\beta_5 x_1x_3\)
which has the main effects for B and S and their interactions. This model has \(X^2=G^2=0\) with zero degrees of freedom (see scout.sas).
6.4 - Lesson 6 Summary
6.4 - Lesson 6 SummaryThe logit model represents how a binary (or multinomial) response variable is related to a set of explanatory variables, which can be discrete and/or continuous. In this lesson, we focused on binary logistic regression and categorical predictors, which allowed direct comparisons to the results we saw earlier for two and three-way tabulated data.
We continue the discussion in the next lesson by focusing on continuous predictors. While most of the concepts from classical regression will carry over to the logistic model, there are some additional challenges now. For example, recall how for categorical predictors we were able to group the data into a table with \(N\) rows, where each row represented a unique predictor combination? That may not be possible when the predictor is continuous because every value may be unique, and this will complicate how the goodness-of-fit tests work.
Additional References
- Collett, D (1991). Analysis of Binary Data.
- Fey, M. (2002). Measuring a binary response's range of influence in logistic regression. American Statistician, 56, 5-9.
- Hosmer, D.W. & Lemeshow, S. (1989). Applied Logistic Regression.
- Fienberg, S.E. The Analysis of Cross-Classified Categorical Data. 2nd ed. Cambridge, MA
- McCullagh, P. & Nelder, J.A. (1989). Generalized Linear Models. 2nd Ed.
- Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 9, 705-724.
- Rice, J. C. (1994). "Logistic regression: An introduction". In B. Thompson, ed., Advances in social science methodology, Vol. 3: 191-245. Greenwich, CT: JAI Press. Popular introduction.
- SAS Institute (1995). Logistic Regression Examples Using the SAS System, Version 6.
- Strauss, David (1999). The Many faces of logistic regression. American Statistician.