Thus far our focus has been on describing interactions or associations between two or three categorical variables mostly via single summary statistics and with significance testing. From this lesson on, we will focus on modeling. Models can handle more complicated situations and analyze the simultaneous effects of multiple variables, including mixtures of categorical and continuous variables. In Lesson 6 and Lesson 7 [1], we study the binary logistic regression, which we will see is an example of a generalized linear model [2].
Binary Logistic Regression is a special type of regression where binary response variable is related to a set of explanatory variables, which can be discrete and/or continuous. The important point here to note is that in linear regression, the expected values of the response variable are modeled based on combination of values taken by the predictors. In logistic regression Probability or Odds of the response taking a particular value is modeled based on combination of values taken by the predictors. Like regression (and unlike loglinear models [3] that we will see later), we make an explicit distinction between a response variable and one or more predictor (explanatory) variables. We begin with twoway tables, then progress to threeway tables, where all explanatory variables are categorical. Then we introduce binary logistic regression with continuous predictors as well. In the last part we will focus on more model diagnostics and model selection.
Logistic regression is applicable, for example, if:
Key Concepts
Objectives

Thus far our focus has been on describing interactions or associations between two or three categorical variables mostly via single summary statistics and with significance testing. Models can handle more complicated situations and analyze the simultaneous effects of multiple variables, including mixtures of categorical and continuous variables. For example, the BreslowDay statistics only works for 2 × 2 × K tables, while loglinear models will allow us to test of homogeneous associations in I × J × K and higherdimensional tables. We will focus on a special class of models known as the generalized linear models (GLIMs or GLMs in Agresti).
The structural form of the model describes the patterns of interactions and associations. The model parameters provide measures of strength of associations. In models, the focus is on estimating the model parameters. The basic inference tools (e.g., point estimation, hypothesis testing, and confidence intervals) will be applied to these parameters. When discussing models, we will keep in mind:
For example, recall a simple linear regression model
For a review, if you wish, see a handout labeled LinRegExample.doc [6] on modeling average water usage given the amount of bread production, e.g., estimated water production is positively related to the bread production:
Water = 2273 + 0.0799 Production
First, let’s clear up some potential misunderstandings about terminology. The term general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only). The form is $y_i\sim N(x_i^T\beta, \sigma^2),$ where $x_i$ contains known covariates and $\beta$ contains the coefficients to be estimated. These models are fit by least squares and weighted least squares using, for example: SAS Proc GLM or R functions lsfit() (older, uses matrices) and lm() (newer, uses data frames).
The term generalized linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989). In these models, the response variable $y_i$ is assumed to follow an exponential family distribution with mean $\mu_i$, which is assumed to be some (often nonlinear) function of $x_i^T\beta$. Some would call these “nonlinear” because $\mu_i$ is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear, because the covariates affect the distribution of $y_i$ only through the linear combination $x_i^T\beta$. The first widely used software package for fitting these models was called GLIM. Because of this program, “GLIM” became a wellaccepted abbreviation for generalized linear models, as opposed to “GLM” which often is used for general linear models. Today, GLIM’s are fit by many packages, including SAS Proc Genmod and R function glm(). Notice, however, that Agresti uses GLM instead of GLIM shorthand, and we will use GLM.
The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, loglinear models etc. The table below provides a good summary of GLMs following Agresti (ch. 4, 2013):
Model  Random  Link  Systematic 
Linear Regression  Normal  Identity  Continuous 
ANOVA  Normal  Identity  Categorical 
ANCOVA  Normal  Identity  Mixed 
Logistic Regression  Binomial  Logit  Mixed 
Loglinear  Poisson  Log  Categorical 
Poisson Regression  Poisson  Log  Mixed 
Multinomial response  Multinomial  Generalized Logit  Mixed 
There are three components to any GLM:
Assumptions:
For a more detailed discussion refer to Agresti(2007), Ch. 3, Agresti (2013), Ch.4, and/or McCullagh & Nelder (1989).
Following are examples of GLM components for models that we are already familiar, such as linear regression, and for some of the models that we will cover in this class, such as logistic regression and loglinear models.
Simple Linear Regression models how mean expected value of a continuous response variable depends on a set of explanatory variables, where index i stands for each data point:
\(Y_i=\beta_0+\beta x_i+ \epsilon_i\)
or
\(E(Y_i)=\beta_0+\beta x_i\)
Binary Logistic Regression models how binary response variable Y depends on a set of k explanatory variables, X=(X_{1}, X_{2}, ... X_{k}).
\(\text{logit}(\pi)=\text{log} \left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta x_i+\ldots+\beta_0+\beta x_{k'}\)
which models the log odds of probability of "success" as a function of explanatory variables.
\(\eta=\text{logit}(\pi)=\text{log} \left(\dfrac{\pi}{1\pi}\right)\)
More generally, the logit link models the log odds of the mean, and the mean here is π. Binary logistic regression models are also known as logit models when the predictors are all categorical.
Loglinear Model models the expected cell counts as a function of levels of categorical variables, e.g., for a twoway table the saturated model
\(\text{log}(\mu_{ij})=\lambda+\lambda^A_i+\lambda^B_j+\lambda^{AB}_{ij}\)
where μ_{ij}=E(n_{ij}) as before are expected cell counts (mean in each cell of the twoway table), A and B represent two categorical variables, and λ_{ij}'s are model parameters, and we are modeling the natural log of the expected counts.
The loglinear models are more general than logit models, and some logit models are equivalent to certain loglinear models [3]. Loglinear model is also equivalent to Poisson regression model when all explanatory variables are discrete. For additional details see Agresti(2007), Sec. 3.3, Agresti (2013), Section 4.3 (for counts), Section 9.2 (for rates), and Section 13.2 (for random effects).
But there are some limitations of GLMs too, such as,
There are ways around these restrictions; e.g., consider analysis for matched data, or use NLMIXED [8] in SAS, or {nlme} [9] package in R, or consider other models, other software packages.
Key Concepts
Objectives

Binary logistic regression estimates the probability that a characteristic is present (e.g. estimate probability of "success") given the values of explanatory variables, in this case a single categorical variable ; π = Pr (Y = 1X = x). Suppose a physician is interested in estimating the proportion of diabetic persons in a population. Naturally she knows that all sections of the population do not have equal probability of ‘success’, i.e. being diabetic. Older population, population with hypertension, individuals with diabetes incidence in family are more likely to have diabetes. Consider the predictor variable X to be any of the risk factor that might contribute to the disease. Probability of success will depend on levels of the risk factor.
Variables:
Y_{i} = 1 if the trait is present in observation (person, unit, etc...) i
Y_{i} = 0 if the trait is NOT present in observation i
Model:
\[\pi_i=Pr(Y_i=1X_i=x_i)=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\]
or,
\[\begin{align}
\text{logit}(\pi_i)&=\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)\\
&= \beta_0+\beta_1 x_i\\
&= \beta_0+\beta_1 x_{i1} + \ldots + \beta_k x_{ik}\\
\end{align}\]
Assumptions:
Model Fit:
Parameter Estimation:
The maximum likelihood estimator (MLE) for (β_{0}, β_{1}) is obtained by finding \((\hat{\beta}_0,\hat{\beta}_1)\) that maximizes:
\(L(\beta_0,\beta_1)=\prod\limits_{i=1}^N \pi_i^{y_i}(1\pi_i)^{n_iy_i}=\prod\limits_{i=1}^N \dfrac{\text{exp}\{y_i(\beta_0+\beta_1 x_i)\}}{1+\text{exp}(\beta_0+\beta_1 x_i)}\)
In general, there are no closedform solutions, so the ML estimates are obtained by using iterative algorithms such as NewtonRaphson (NR), or Iteratively reweighted least squares (IRWLS). In Agresti (2013), see section 4.6.1 for GLMs, and for logistic regression, see sections 5.5.45.5.5.
More Advanced material (not required): Brief overview of NewtonRaphson (NR). Suppose we want to maximize a loglikelihood $l(\theta)$ with respect to a parameter $\theta=(\theta_1,\ldots,\theta_p)^T$. At each step of NR, the current estimate $\theta^{(t)}$ is updated as \[ Applying NR to logistic regression. For the logit model with $p1$ predictors $X$, $\beta=(\beta_0, \beta_1, \ldots, \beta_{p1})$, the likelihood is \begin{eqnarray} so the loglikelihood is $ The first derivative of $x_i^T\beta$ with respect to $\beta_j$ is $x_{ij}$, thus The second derivatives used in computing the standard errors of the parameter estimates, $\hat{\beta}$, are \begin{eqnarray} \frac{\partial^2 l}{\partial\beta_j\partial\beta_k} For reasons including numerical stability and speed, it is generally advisable to avoid computing matrix inverses directly. Thus in many implementations, clever methods are used to obtain the required information without directly constructing the inverse, or even the Hessian. 
Interpretation of Parameter Estimates:
\(\dfrac{\text{exp}(\beta_0+\beta_1(x_{i1}+1))}{\text{exp}(\beta_0+\beta_1 x_{i1})}=\text{exp}(\beta_1)\)
In general, the logistic model stipulates that the effect of a covariate on the chance of "success" is linear on the logodds scale, or multiplicative on the odds scale.
Inference for Logistic Regression:
The table below classifies 5375 high school students according to the smoking behavior of the student (Z) and the smoking behavior of the student's parents (Y ). We saw this example in Lesson 3 (Measuring Associations [7] in I × J tables, smokeindep.sas [8] and smokeindep.R [9]).
How many parents smoke?

Student smokes?


Yes (Z = 1)

No (Z = 2)


Both (Y = 1)

400

1380

One (Y = 2)

416

1823

Neither (Y = 3)

188

1168

The test for independence yields X^{2} = 37.6 and G^{2} = 38.4 with 2 df (pvalues are essentially zero), so Y and Z are related. It is natural to think of Z as a response and Y as a predictor in this example. We might be interested in exploring the dependency of student's smoking behavior on neither parent smoking versus at least one parent smoking. Thus we can combine or collapse the first two rows of our 3 × 2 table and look at a new 2 × 2 table:
Student
smokes

Student does
not smoke


1–2 parents smoke

816

3203

Neither parent smokes

188

1168

For the chisquare test of independence, this table has X^{2} = 27.7, G^{2} = 29.1, pvalue ≈ 0, and = 1.58. Therefore, we estimate that a student is 58% more likely, on the odds scale, to smoke if he or she has at least one smoking parent than none.
But what if:
These are just some of the possibilities of logistic regression, which cannot be handled within a framework of goodnessoffit only.
Consider the simplest case:
Arrange the data in our running example like this,
y_{i}

n_{i}


1–2 parents smoke

816

4019

Neither parent smokes

188

1356

where y_{i} is the number of children who smoke, n_{i} is the number of children for a given level of parents' smoking behaviour, and π_{i} = P(y_{i} = 1x_{i}) is the probability of a randomly chosen student i smoking given parents' smoking status. Here i takes values 1 and 2. Thus, we claim that all students who have at least one parent smoking will have the same probability of "success", and all student who have neither parent smoking will have the same probability of "success", though for the two groups success probabilities will be different. Then the response variable Y has a binomial distribution:
\(y_i \sim Bin(n_i,\pi_i)\)
Explanatory variable X is a dummy variable such that
X_{i} = 0 if neither parent smokes,
X_{i} = 1 if at least one parent smokes.
Understanding the use of dummy variables is important in logistic regression.
Think about the following question, then click on the icon to the left display an answer. Can you explain to someone what is meant by a "dummy variable"? 
Then the logistic regression model can be expressed as:
\(\text{logit}(\pi_i)=\text{log}\dfrac{\pi_i}{1\pi_i}=\beta_0+\beta_1 X_i\) (1) or
\(\pi_i=\dfrac{\text{exp}(\beta_0+\beta_1 x_i)}{1+\text{exp}(\beta_0+\beta_1 x_i)}\) (2)
and it says that logodds of a randomly chosen student smoking is β_{0} for "smoking parents" = neither, and β_{0} + β_{1} for "smoking parents" = at least one parent is smoking.
There are different ways to do this depending on the format of the data. As before, for details you need to refer to SAS or R help. Here are some general guidelines to keep in mind. Please note that we make a distinction about the way the data are entered into SAS (or R).
If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous example).
model y/n = x1 x2 /link=logit dist=binomial [any other options you may want];
model y/n = x1 x2 / [put any other options you may want here];
If data come in a matrix form, i.e., subject × variables matrix with one line for each subject, like a database
model y/n = x1 x2 / link = logit dist = binomial [put any other options you may want here];
model y = x1 x2 / [any other options you may want];
For both GENMOD and LOGISTIC, as before, include interaction terms with *, and make sure to include all lower order terms.
We will follow both the SAS output through to explain the different parts of model fitting. The outputs from R will be essentially the same.
Let's begin with collapsed 2x2 table:
Student
smokes

Student does
not smoke


1–2 parents smoke

816

3203

Neither parent smokes

188

1168

Let's look at one part of smoke.sas [11]:
In the data step, the dollar sign \$ as before indicates that S is a characterstring variable.
In the logistic step, the statement:
If y_{i} ~ Bin(n_{i}, π_{i}), the mean is μ_{i} = n_{i} π_{i} and the variance is μ_{i}(n_{i} − μ_{i})/n_{i}. Overdispersion means that the data show evidence that the variance of the response y_{i} is greater than μ_{i}(n_{i} − μ_{i})/n_{i}.
If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodnessoffit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.
Now let's review some of the output from the program smoke.sas [11] that uses PROC LOGISTIC ('jumpdown links' below):
This section, as before, tells us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g. binarly logit), and type of scoring algorithm for parameter estimation. Fisher scoring is a variant of NewtonRaphson method for ML estimation. In logistic regression they are equivalent.
This information tells us how many observations were "successful", e.g., how many "event" observations take value 1, that is how many children smoke, 816 + 188 = 1004 and "nonevent", i.e., how many do NOT smoke, (4019 − 816) + (1356 − 188) = 4371. Recall that the model is estimating the probability of the "event".
Class Level Information Design
From an explanatory categorical (i.e, class) variable S with 2 levels (0,1), we created one dummy variable (e.g. design variable):
X_{1} = 1 ("smoke") if parent smoking = at least one,
X_{1} = 0 ("nosmoke") if parent smoking = neitherparent smoking = nosmoke is equal 0 and is the baseline.
Since we are using an iterative procedure to fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.
Overall goodnessoffit testing
Test: H_{0} : current model vs. H_{A} : saturated model
The goodnessoffit statistics X^{2} and G^{2} from this model are both zero, because the model is saturated. However, suppose that we fit the interceptonly model. This is accomplished by removing the predictor from the model statement, like this:
model y/n = / scale=none;
The goodnessoffit statistics are shown below.
The Pearson statistic X^{2} = 27.6766 is precisely equal to the usual X^{2} for testing independence in the 2 × 2 table. And the deviance G^{2} = 29.1207 is precisely equal to the G^{2} for testing independence in the 2 × 2 table. Thus by the assumption, the interceptonly model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or oddsratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well. This example shows that analyzing a 2 × 2 table for association is equivalent to logistic regression with a single dummy variable. Later on we will compare these tests to the loglinear model of independence see smokelog.sas [12] and smokelog.lst [13].
The goodnessoffit statistics, X^{2} and G^{2}, are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the χ^{2} approximation to work well, we need the n_{i}'s sufficiently large so that the expected values, \(\hat{\mu}_i\) ≥ 5, and n_{i} − \(\hat{\mu}_i\) ≥ 5 for most of the rows i. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.
With real data, we often find that the n_{i}'s are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem.
Once an appropriate model is fitted, the success probabilities need to be estimated using the model parameters. Note that success probabilities are now NOT simply the ratio of observed number of successes and the number of trials. A model fit introduces a structure on the success probabilities. The estimates will now be functions of model parameters.
The fitted model is \(\text{logit}(\hat{\pi_i})=\text{log}\dfrac{\hat{\pi_i}}{1\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=1.837+0.459smoke\)
where smoke is a dummy variable (e.g. design variable) that takes value 1 if at least one parents is smoking and 0 if neither is smoking as discussed above.
Estimated $\beta_0=1.827$ with a standard error of 0.078 is significant and it says that logodds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is 1.827 and it's statistically significant.
Estimated $\beta_1=0.459$ with a standard error of 0.088 is significant and it says that logoddsratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the estimated oddsratios; compare with our analysis in Section 6.2.
Thus there is a strong association between parent's and children smoking behavior, and the model fits well.
And the estimated probability of a student smoking given the explanatory variable (e.g. parent smoking behavior):
\(\hat{\pi}_i=\dfrac{exp(1.826+0.4592X_i)}{1+exp(1.826+0.4592X_i)}\)
For example, the predicted probability of a student smoking given that neither parent smokes is
\(P(Y_i=1X_i=0)=\dfrac{exp(1.826+0.4592\times 0)}{1+exp(1.826+0.4592\times 0)}=0.14\)
and the predicted probability of a student being a smoker if at least one parent smokes is
\(P(Y_i=1X_i=1)=\dfrac{exp(1.826+0.4592(X_i=1)}{1+exp(1.826+0.4592(X_i=1))}=0.20\)
By invoking the following option in MODEL, output out=predict pred=prob the PROC LOGISTIC will print the predicted probabilities in the output file:
so you do NOT need to do this calculation by hand; but it maybe useful to try it out to see if you understand what's going on. In this model, β_{0} is the logodds of children smoking for nosmoking parents (X_{i} = 0). Thus exp(−1.8266) are the odds that a student smokes when the parents do not smoke.
Looking at the 2 × 2 table, the estimated logodds for nonsmokers is
\(\text{log}\left(\dfrac{188}{1168}\right)=\text{log}(0.161)=1.8266\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
From analysis of this example as a 2 × 2 table, exp(β_{1}), should be the estimate odds ratio. The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=0.4592\)
agrees exactly with the logodds ratio from the 2 × 2 table (e.g. ln(1.58) = (816×1168) / (188×3203) = 0.459). That is exp(0.459) = 1.58 which is the estimated odds ratio of a student smoking.
To relate this to interpretation of the coefficients in a linear regression, you could say that for every oneunit increase in the explanatory variable X_{1} (e.g. changing from no smoking parents to smoking parents), the odds of "success" π_{i }/ (1 − π_{i}) will be multiplied by exp(β_{1}), given that all the other variables are held constant.
This is not surprising, because in the logistic regression model β_{1} is the difference in the logodds of children smoking as we move from "nosmoke" (i.e. neither parent smokes) (X_{i} = 0) to "smoke" (i.e. at least one parent smokes) (X_{i} = 1), and the difference in logodds is a logodds ratio.
Testing the hypothesis that the probability of the characteristic depends on the value of the jth variable.
Testing H_{0} : β_{j} = 0 versus H_{1} : β_{j} ≠ 0.
The Wald chisquared statistics \(z^2=(\hat{\beta}_j/\text{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Analysis of Maximum Likelihood Estimates" section.
The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that you can calculate from the corresponding 2 × 2 table.
The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior.
Or, we can use the information from "Type III Analysis of Effects" section.
Again, this information indicates that parents' smoking behavior is a significant factor in the model. We could compare z^{2} to a chisquare with one degree of freedom; the pvalue would then be the area to the right of z^{2} under the \(\chi^2_1\) density curve.
A value of z^{2} (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis β_{j} = 0 at the .05level.
\(\beta_1:\left(\dfrac{0.4592}{0.0878}\right)^2=27.354\)
An approximate (1 − α) × 100% confidence interval for β_{j} is given by
\(\hat{\beta}_j \pm z_{(1\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% confidence interval for β_{1}
0.4592 ± 1.96 × 0.0878 = (0.287112, 0.63128)
where 0.0878 is the "Standard Error"' from the "Analysis of Maximum Likelihood Estimates" section. Then, the 95% confidence interval for the oddsratio of a child smoking if one parent is smoking in comparison to neither smoking is
(exp(0.287112), exp(0.63128)) = (1.3325, 1.880)
Compare this with the output we get from PROC LOGISTIC:
When fitting logistic regression, we need to evaluate the overall fit of the model, significance of individual parameter estimates and consider their interpretation. For assessing the fit of the model, we also need to consider the analysis of residuals. Definition of Pearson, deviance and adjusted residuals is as before, and you should be able to carry this analysis.
If we include the statement
output out=predict pred=phat reschi=pearson resdev=deviance;
in the PROC LOGISTIC call, then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), the Pearson residuals and the deviance residuals. Then we can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) and failures \(n_i\hat{\mu}_i=n_i(1\hat{\pi}_i)\).
For now you can try this on your own, or see the next sections for more analysis and code.
There are different ways to run logistic regression depending on the format of the data. As before, since there are many different options, for details you need to refer to R help. Here are some general guidelines to keep in mind with a simple example outlined in dataformats.R [14] where we created two binary random variables with n number of trials, e.g., n=100,
##create two binary vectors of length 100
x=sample(c(0,1),100, replace=T)
y=sample(c(0,1),100, replace=T)
If data come in a tabular form, i.e., response pattern is with counts (as seen in the previous example), that is data are grouped.
> ##create a 2x2 table with counts
> xytab=table(x,y)
> xytab
y
x 0 1
0 24 29
1 26 21
glm(): Let y be the response variable capturing the number of events with the number of success (y = 1) and failures (y = 0). We need to create a response table (e.g., count) that has count for both the "success" and "failure" out of n number of trials in its columns. Notice that count table below, could be also the number of success y = 1, and then a column computed as ny
> count=cbind(xytab[,2],xytab[,1])
> count
[,1] [,2]
0 29 24
1 21 26
We also need a categorical predictor variable.
> xfactor=factor(c("0","1"))
> xfactor
[1] 0 1
Levels: 0 1
We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit")
> tmp3=glm(count~xfactor, family=binomial("logit"))
> tmp3
Call: glm(formula = count ~ xfactor, family = binomial("logit"))
Coefficients:
(Intercept) xfactor1
0.1892 0.4028
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 1.005
Residual Deviance: 4.441e15 AIC: 12.72
If data come in a matrix form, i.e., subject × variables matrix with one line for each subject, like a database, where data are ungrouped
> xydata=cbind(x,y)
> xydata ## 100 rows, we are showing first 7
x y
[1,] 0 1
[2,] 0 1
[3,] 0 0
[4,] 0 1
[5,] 1 0
[6,] 0 1
[7,] 0 0.....
glm(): We need a binary response variable y and a predictor variable x, which in this case was also binary. We need to specify the the response distribution and a link, which in this case is done by specifying family=binomial("logit")
> tmp1=glm(y~x, family=binomial("logit"))
> tmp1
Call: glm(formula = y ~ x, family = binomial("logit"))
Coefficients:
(Intercept) x
0.1892 0.4028
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 137.6 AIC: 141.6
For more options see the course examples and homework. We will follow the R output through to explain the different parts of model fitting. The outputs from SAS (or from many other software) will be essentially the same.
Let's begin with the collapsed 2 × 2 table:
Student
smokes

Student does
not smoke


1–2 parents smoke

816

3203

Neither parent smokes

188

1168

In R, we can use the glm()function and specify the family = binomial(link = logit). See the files smoke.R [15] and the output generated in smoke.out [16]. That R code corresponds to SAS code discussed in the previous section:
Here is the R output for the 2 × 2 table that we will use in R for logistics regression:
Please Note: the table above is different from the one given from the SAS program. We input y and n as data columns in SAS; here we just input data columns as yes and no.
summary(smoke.logistic) gives the following model information:
Model Information & model convergence status
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
(Dispersion parameter for binomial family taken to be 1)
Number of Fisher Scoring iterations: 2
These sections tells us which dataset we are manipulating, the labels of the response and explanatory variables and what type of model we are fitting (e.g. binary logit), and type of scoring algorithm for parameter estimation. Fisher scoring is a variant of NewtonRaphson method for ML estimation. In logistic regression they are equivalent. Since we are using an iterative procedure to fit the model, that is to find the ML estimates, we need some indication if the algorithm converged.
Overdispersion is an important concept with discrete data. In the context of logistic regression, overdispersion occurs when there are discrepancies between the observed responses y_{i} and their predicted values \(\hat{\mu}_i = n_{i}\hat{\pi}_i\) and these values are larger than what the binomial model would predict.
If y_{i} ~ Bin(n_{i}, π_{i}), the mean is μ_{i} = n_{i} π_{i} and the variance is μ_{i}(n_{i} − μ_{i})/n_{i}. Overdispersion means that the data show evidence that the variance of the response y_{i} is greater than μ_{i}(n_{i} − μ_{i})/n_{i}.
If overdispersion is present in a dataset, the estimated standard errors and test statistics for individual parameters and the overall goodnessoffit will be distorted and adjustments should be made. We will look at this briefly later when we look into continuous predictors.
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.45918 0.08782 5.228 1.71e07 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This information gives us the fitted model:
\(\text{logit}(\hat{\pi_i})=\text{log}\dfrac{\hat{\pi_i}}{1\hat{\pi_i}}=\hat{\beta_0}+\hat{\beta_1} X_i=1.837+0.459parentsmoke1\)
where parentsmoke1 is a dummy variable ((e.g. design variable) that takes value 1 if at least one parents is smoking and 0 if neither is smoking.
X_{1} = 1 ("parentsmoke1") if parent smoking = at least one,
X_{1} = 0 ("parensmoke0") if parent smoking = neither
parent smoking = 0 is the baseline. We are modeling here probability of "children smoking".
Estimated $\beta_0=1.827$ with a standard error of 0.078 is significant and it says that logodds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is 1.827 and it's statistically significant.
Estimated $\beta_1=0.459$ with a standard error of 0.088 is significant and it says that logoddsratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is 0.459 and it's statistically significant. $exp(0.459)=1.58 are the estimated oddsratios; compare with our analysis in Section 6.2.
Testing the hypothesis that the probability of the characteristic depends on the value of the jth variable.
Testing H_{0} : β_{j} = 0 versus H_{1} : β_{j} ≠ 0.
The Wald chisquared statistics \(z^2=(\hat{\beta}_j/\text{SE}(\hat{\beta}_k))^2\) for these tests are displayed along with the estimated coefficients in the "Coefficients" section.
The standard error for \(\hat{\beta}_1\), 0.0878, agrees exactly with the standard error that you can calculate from the corresponding 2 × 2 table. The values indicate the significant relationship between the logit of the odds of student smoking in parents' smoking behavior. This information indicates that parents' smoking behavior is a significant factor in the model. We could compare z^{2} to a chisquare with one degree of freedom; the pvalue would then be the area to the right of z^{2} under the \(\chi^2_1\) density curve.
A value of z^{2} (Wald statistic) bigger than 3.84 indicates that we can reject the null hypothesis β_{j} = 0 at the .05level.
\(\beta_1:\left(\dfrac{0.4592}{0.0878}\right)^2=27.354\)
An approximate (1 − α) × 100% confidence interval for β_{j} is given by
\(\hat{\beta}_j \pm z_{(1\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% confidence interval for β_{1}
0.4592 ± 1.96 × 0.0878 = (0.287112, 0.63128)
where 0.0878 is the "Standard Error"' from the "Coefficients" section. Then, the 95% confidence interval for the oddsratio of a child smoking if one parent is smoking in comparison to neither smoking is
(exp(0.287112), exp(0.63128)) = (1.3325, 1.880)
Thus there is a strong association between parent's and children smoking behavior. But does this model fit?
Overall goodnessoffit testing
Test: H_{0} : current model vs. H_{A} : saturated model
Residual deviance: 3.7170e13 on 0 degrees of freedom
AIC: 19.242
The goodnessoffit statistics, deviance, G^{2} from this model is zero, because the model is saturated. If we want to know the fit of the intercept only model that is provided by
Test: H_{0} : intercept only model vs. H_{A} : saturated model
Null deviance: 2.9121e+01 on 1 degrees of freedom
Suppose that we fit the interceptonly model. This is accomplished by removing the predictor from the model statement, like this:
glm(response~1, family=binomial(link=logit)
The goodnessoffit statistics are shown below.
Null deviance: 29.121 on 1 degrees of freedom
Residual deviance: 29.121 on 1 degrees of freedom
AIC: 46.362
The deviance G^{2} = 29.1207 is precisely equal to the G^{2} for testing independence in this 2 × 2 table. Thus by the assumption, the interceptonly model or the null logistic regression model states that student's smoking is unrelated to parents' smoking (e.g., assumes independence, or oddsratio=1). But clearly, based on the values of the calculated statistics, this model (i.e., independence) does NOT fit well.
Analysis of deviance table [17]: In R, we can test factors’ effects with the anova function to give an analysis of deviance table. We only include one factor in this model. So R dropped this factor (parentsmoke) and fit the interceptonly model to get the same statistics as above, i.e., the deviance G^{2} = 29.121.
This example shows that analyzing a 2 × 2 table for association is equivalent to logistic regression with a single dummy variable. We can further compare these tests to the loglinear model of independence, e.g. see smokelog.R [18] in Lesson 10.
The goodnessoffit statistics, X^{2} and G^{2}, are defined as before in the tests of independence and loglinear models (e.g. compare observed and fitted values). For the χ^{2} approximation to work well, we need the n_{i}'s sufficiently large so that the expected values, \(\hat{\mu}_i\) ≥ 5, and n_{i} − \(\hat{\mu}_i\) ≥ 5 for most of the rows i. We can afford to have about 20% of these values less than 5, but none of them should fall below 1.
With real data, we often find that the n_{i}'s are not big enough for an accurate test, and there is no single best solution to handle this but a possible solution may rely strongly on the data and context of the problem.
Suppose two alternative models are under consideration, one model is simpler or more parsimonious than the other. ore often than not, one of the models is the saturated model. Another common situation is to consider ‘nested’ models, where one model is obtained from the other one by putting some of the parameters to be zero. Suppose now we test
H_{0}: reduced model is true vs. H_{A}: current model is true
Notice the difference in the null and alternative hypothesis from the section above. Here to test the null hypothesis that an arbitrary group of k coefficients from the model is set equal to zero (e.g. no relationship with the response), we need to fit two models:
The likelihoodratio statistic is
ΔG^{2} = −2 log L from reduced model
−(−2 log L from current model)
and the degrees of freedom is k (the number of coefficients in question). The pvalue is \(P(\chi^2_k \geq \Delta G^2)\).
To perform the test, we must look at the "Model Fit Statistics" section and examine the value of "−2 Log L" for "Intercept and Covariates." Here, the reduced model is the "interceptonly" model (e.g. no predictors) and "intercept and covariates" is the current model we fitted. For our running example, this would be equivalent to testing "interceptonly" model vs. full (saturated) model (since we have only one covariate).
Larger values of ΔG^{2} ("−2 Log L" ) lead to small pvalues, which provide evidence against the reduced model in favor of the current model; you can explore AIC (Akaike Information Criterion) and SC (Schwarz Criterion) on your own through SAS help files or see Lesson 5 for AIC.
For our example, ΔG^{2} = 5176.510 − 5147.390 = 29.1207 with df = 2 − 1 = 1. Notice that this matches Deviance we got in the earlier text above.
Another way to calculate the test statistic is
ΔG^{2} = G^{2} from reduced model
−G^{2} from current model,
where the G^{2}'s are the overall goodnessoffit statistics.
This value of 2 Log L is useful to compare two nested models which differ by an arbitrary set of coefficients.
Also notice that the ΔG^{2} we calculated for this example equals to
Likelihood Ratio 29.1207 1 <.0001
from "Testing Global Hypothesis: BETA=0" section (the next part of the output, see below).
Testing the null hypothesis that the set of coefficients is simultaneously zero. For example,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\ldots+\beta_k X_k\)
test H_{0} : β_{1} = β_{2} = ... = 0 versus the alternative that at least one of the coefficients β_{1}, . . . , β_{k} is not zero.
This is like the overall F−test in linear regression. In other words, this is testing the null hypothesis that an interceptonly model is correct,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0\)
versus the alternative that the current model is correct
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\ldots+\beta_k X_k\)
In our example, we are testing the null hypothesis that an interceptonly model is correct,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0\)
versus the alternative that the current model (in this case saturated model) is correct
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1\)
In the SAS output, three different chisquare statistics for this test are displayed in the section "Testing Global Null Hypothesis: Beta=0," corresponding to the likelihood ratio, score and Wald tests. Recall their definitions from the very first lessons.
This test has k degrees of freedom (e.g. the number of dummy indicators (design variables), that is the number of βparameters (except the intercept)).
Large chisquare statistics lead to small pvalues and provide evidence against the interceptonly model in favor of the current model.
The Wald test is based on asymptotic normality of ML estimates of β's. Rather than using the Wald, most statisticians would prefer the LR test.
If these three tests agree, that is evidence that the largesample approximations are working well and the results are trustworthy. If the results from the three tests disagree, most statisticians would tend to trust the likelihoodratio test more than the other two.
In our example, the "intercept only" model or the null model says that student's smoking is unrelated to parents' smoking habits. Thus the test of the global null hypothesis β_{1} = 0 is equivalent to the usual test for independence in the 2 × 2 table. We will see that the estimated coefficients and SE's are as we predicted before, as well as the estimated odds and odds ratios.
Residual deviance is the difference in G^{2} = −2 logL between a saturated model and the built model. The high residual deviance shows that the model cannot be accepted.
The null deviance is the difference in G^{2} = −2 logL between a saturated model and the interceptonly model. The high residual deviance shows that the interceptonly model does not fit.
In our 2 × 2 table smoking example, the residual deviance is almost 0 because the model we built is the saturated model. And notice that the degree of freedom is 0, too. Regarding the null deviance, we could see it equivalent to the section "Testing Global Null Hypothesis: Beta=0," by likelihood ratio in SAS output.
For our example, Null deviance = 29.1207 with df = 1. Notice that this matches Deviance we got in the earlier text above.
An alternative statistic for measuring overall goodnessoffit is HosmerLemeshow statistic.
Note: we use one predictor model here, that is, at least one parent smokes.
This is a Pearsonlike χ^{2} that is computed after data are grouped by having similar predicted probabilities. It is more useful when there is more than one predictor and/or continuous predictors in the model too. We will see more on this later, and in your homework.
H_{0 }: the current model fits well
H_{A }: the current model does not fit well
To calculate this statistic:
Warning about HosmerLemeshow goodnessoffit test:
In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning.
For our example, because we have small number of groups (e.g, 2), this statistic gives a perfect fit (HL = 0, pvalue = 1).
______
Instead of deriving the diagnostics we will look at them from a purely applied viewpoint. Recall the definitions and introductions to the regression residuals and Pearson and Deviance residuals.
The Pearson residuals are defined as
\(r_i=\dfrac{y_i\hat{\mu}_i}{\sqrt{\hat{V}(\hat{\mu}_i)}}=\dfrac{y_in_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1\hat{\pi}_i)}}\)
The contribution of the ith row to the Pearson statistic X^{2} is
\(\dfrac{(y_i\hat{\mu}_i)^2}{\hat{\mu}_i}+\dfrac{((n_iy_i)(n_i\hat{\mu}_i))^2}{n_i\hat{\mu}_i}=r^2_i\)
and the Pearson goodnessof fit statistic is
\(X^2=\sum\limits_{i=1}^N r^2_i\)
which we would compare to a \(\chi^2_{Np}\) distribution. The deviance test statistic is
\(G^2=2\sum\limits_{i=1}^N \left\{ y_i\text{log}\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_iy_i)\text{log}\left(\dfrac{n_iy_i}{n_i\hat{\mu}_i}\right)\right\}\)
which we would again compare to \(\chi^2_{Np}\), and the contribution of the ith row to the deviance is
\(2\left\{ y_i\text{log}\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_iy_i)\text{log}\left(\dfrac{n_iy_i}{n_i\hat{\mu}_i}\right)\right\}\)
We will note how these quantities are derived through appropriate software and how they provide useful information to understand and interpret the models. For an example see the SAS or R analysis in the next section [19].
The concepts discussed with binary predictors extend to predictors with multiple levels. In this lesson we consider,
but the basic ideas easily extend to any 2 × J table.
We begin by replicating the analysis of the original 3 × 2 tables with logistic regression.
How many parents smoke?

Student smokes?


Yes (Z = 1)

No (Z = 2)


Both (Y = 1)

400

1380

One (Y = 2)

416

1823

Neither (Y = 3)

188

1168

First, we reexpress the data in terms of y_{i }= number of smoking students, and n_{i }= number of students for the three groups based on parents behavior:
How many parents smoke?

Student smokes?


y_{i}

n_{i}


Both

400

1780

One

416

2239

Neither

188

1356

Then we decide on a baseline level for the explanatory variable X, and create k − 1 dummy indicators if X is a categorical variable with k levels.
For our example, let's parent smoking = Neither be a baseline, and define a pair of dummy indicators (or design variables) that takes one of two values,
X_{1} = 1 if parent smoking = One ,
X_{1} = 0 if otherwise,X_{2} = 1 if parent smoking = Both,
X_{2} = 0 if otherwise.
Let \(\dfrac{\pi}{1\pi}=\) odds of student smoking; notice that we dropped the index i denoting each individual student to simplify the notation  we are still modeling each student or rather a randomly chosen student from the same classification cell. Then the model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)
says that the logodds of a student smoking is β_{0} for parents smoking = neither, β_{0} + β_{1} for parents smoking = one and β_{0} + β_{2} for parents smoking = both, and exp(β_{j}) = conditional odds ratio for level j of predictor X versus the baseline. Therefore,
\begin{array}{rcl}
\beta_1 &=& \text{log_odds for }one  \text{log_odds for }neither \\
exp(\beta_1) &=& \dfrac{\text{odds smoking}one}{\text{odds smoking}neither} \\
\beta_2 &=& \text{log_odds for }both  \text{log_odds for }neither \\
exp(\beta_2) &=& \dfrac{\text{odds smoking}both}{\text{odds smoking}neither} \\
\dfrac{exp(\beta_1)}{exp(\beta_2)}&=& exp(\beta_1\beta_2)=\dfrac{\text{odds smoking}one}{\text{odds smoking}both}
\end{array}
and we expect to get \(\hat{\beta}_1=ln(1.42)=0.351\), e.g. 1.42=(416 × 1168)/(188 × 1823), and \(\hat{\beta}_2=ln(1.80)=0.588\), e.g., 1.80=400 × 1168/188 × 1380. The estimated intercept should be \(\hat{\beta}_0=ln(188/1168)=1.826\).
There are different versions of SAS program on the course web site to fit this data. Here is one from smoke.sas [11] where '0' = neither parent smokes, '1' = one smokes, and '2' = both smoke, and we use PROC LOGISTIC; notice we could use proc GENMOD too.
The option param=ref tells SAS to create a set of two dummy variables to distinguish among the three categories, where '0'=neither is a baseline because of option descending and ref=first (see the previous section for details).
Another way of doing the same in smoke.sas but using character labels (e..g, "both", "one", "neither") rather than numbers for the categories:
In the class statement, the option order=data tells SAS to sort the categories of S by the order in which they appear in the dataset rather than alphabetical order. The option ref='neither' makes neither the reference group (i.e. the group for which both dummy variables are zero). Let's look at some relevant portions of the output of smoke.lst [20] that differ from the analysis of the corresponding 2 × 2 table from the previous section of the notes.
Fisher scoring is a variant of NewtonRaphson method for ML estimation. In logistic regression they are equivalent. Notice now there are 3 observations since we have 3 groupings by the levels of the explanatory variable.
From an explanatory variable S with 3 levels (0,1,2), we created two dummy variables, i.e., design variables:
X_{1} = 1 if parent smoking = One,
X_{1} = 0 otherwise,X_{2} = 1 if parent smoking=Both,
X_{2} = 0 otherwise.
Since parent smoking = Neither is equal 0 for both dummy variables, that's the baseline.
Here, we specified neither to be a reference variable. We used option data=order so that both takes values (1,0), and one (0,1). If we didn't use that option, the other levels would be set based on the alphabetical order, but in this case they coincide.
Here the s_{1} and s_{2} correspond to x_{1} and x_{2} dummy variables. The saturated model is,
\(\text{logit}(\pi)=1.8266+0.3491 X_1+0.5882 X_2\)
and the estimated probability of a child smoking given the explanatory variable:
\(\hat{\pi}_i=\dfrac{\text{exp}(1.8266+0.3491 X_1+0.5882 X_2)}{1+\text{exp}(1.8266+0.3491 X_1+0.5882 X_2)}\)
For example, the predicted probability of a student smoking given that only one parent is smoking is:
\begin{align}
P(Y_i=1neither=0,one=1,both=0) &= P(Y_i=1X_1=1,X_2=0)\\
&= \dfrac{\text{exp}(1.8266+0.3491)}{1+\text{exp}(1.8266+0.3491)}\\
&= 0.1858\\
\end{align}
In SAS, if we include the statement
output out=predict pred=phat reschi=pearson resdev=deviance;
then SAS creates a new dataset called "results" that includes all of the variables in the original dataset, the predicted probabilities \(\hat{\pi}_i\), and the Pearson and deviance residuals. We can add some code to calculate and print out the estimated expected number of successes \(\hat{\mu}_i=n_i\hat{\pi}_i\) (e.g., "shat") and failures \(n_i\hat{\mu}_i=n_i(1\hat{\pi}_i)\) (e.g., "fhat").
Running this program gives a new output section:
Here "s" are the levels of the categorical predictor for parents' smoking behavior, "y" as before the number of students smoking for each level of the predictor, "n" the marginal counts for each level of the predictor", "prob" is the estimated probability of "success" (e.g. a student smoking given the level of the predictor), "shat" and "fhat" expected number of successes and failures respectively, and "pearson" and "deviance" are Pearson and Deviance residuals.
All of the "shat" and "fhat" values, that is predicted number of success and failures are greater than 5.0, so the χ^{2} approximation is trustworthy.
Below is the R code (smoke.R [15]) that replicates the analysis of the original 2 × 3 table with logistic regression.
First, let’s see the table we created for the analysis.
You should again notice the difference in data input. In SAS, we input y and n as data columns; here we just input data columns as yes and no.
Next, the model information is displayed.
If you don’t specify the baseline using the “class” option, R will set the first level as the default. Here, it’s “0”. the parentsmoke1 and parentsmoke2 correspond to the X_{1} and X_{2} dummy variables. The saturated model is:
\(\text{logit}(\pi)=1.8266+0.3491 X_1+0.5882 X_2\)
and the estimated probability of a child smoking given the explanatory variable:
\(\hat{\pi}_i=\dfrac{\text{exp}(1.8266+0.3491 X_1+0.5882 X_2)}{1+\text{exp}(1.8266+0.3491 X_1+0.5882 X_2)}\)
For example, the predicted probability of a student smoking given that only one parent is smoking is:
\begin{align}
P(Y_i=1neither=0,one=1,both=0) &= P(Y_i=1X_1=1,X_2=0)\\
&= \dfrac{\text{exp}(1.8266+0.3491)}{1+\text{exp}(1.8266+0.3491)}\\
&= 0.1858\\
\end{align}
In R, deviance residuals are given directly in the output by summary() function.
> smoke.logistic<glm(response~parentsmoke, family=binomial(link=logit))
> summary(smoke.logistic)
Call:
glm(formula = response ~ parentsmoke, family = binomial(link = logit))
Deviance Residuals:
[1] 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.82661 0.07858 23.244 < 2e16 ***
parentsmoke1 0.34905 0.09554 3.654 0.000259 ***
parentsmoke2 0.58823 0.09695 6.067 1.3e09 ***

We can also obtain the deviance and Pearson residuals by using residulas.glm() function. Do help(residuals.glm()) to study this object more.
To obtain deviance residuals
> residuals(smoke.logistic)
[1] 0 0 0
To obtain Person residuals:
> residuals(smoke.logistic, type="pearson")
1 2 3
2.440787e13 4.355932e13 1.477321e11
To obtain the predicted values, use function predict.glm() with which we can specify the type of predicted values we want.
type is the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of logodds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
For example, the code below gives predicted probabilities.
> predict(smoke.logistic, type="response")
1 2 3
0.2247191 0.1857972 0.1386431
All of the predicted number of success and failures are greater than 5.0 so the χ^{2} approximation is trustworthy.
The goodnessoffit statistics Χ^{2} and G^{2} from this model are both zero, because the model is saturated. Suppose that we fit the interceptonly model as before by removing the predictors from the model statement:
The goodnessoffit statistics are shown below.
The Pearson statistic Χ^{2} = 37.5663 and the deviance G^{2} = 38.3658 are precisely equal to the usual Χ^{2} and G^{2} for testing independence in the 2 × 3 table.
Clearly, in this example, the saturated model fits perfectly and the independence model does not fit well.
We can use the HosmerLemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
The residual deviance is almost 0 because the model is saturated. We can find G^{2 }= 38.366 by null deviance, which are precisely equal to the usual G^{2} for testing independence in the 2 by 3 table. Since this statistics is large which leads to small pvalues, it provides evidence against the interceptonly model in favor of the current model. You can also find the same result in the output from the anova() part.
Clearly, in this example, the saturated model fits perfectly and the independence model does not fit well.
We can use the HosmerLemeshow test to assess the overall fit of the model. As in the previous example, the Hosmer and Lemeshow statistic is not very meaningful since the number of groups is small.
Here is a link to a program in R that produces this Hosmer and Lemeshow statistic (ROCandHL.R [22])
> ### TO use HomserLemeshow statistic (although not relevant here since the number of groups is small)
> ## First run ROCandHL.R script which has function hosmerlem() in it
> ## hosmerlem() takes the vector of successes, predicted vector of success and g=# of groups as input
> ## produce the vector of predicted success "yhat"
> yhat=rowSums(response)*predict(smoke.logistic, type="response")
> yhat
1 2 3
400 416 188
>
> hosmerlem(response[,1], yhat, g=3) ## here run 3 groups
X^2 Df P(>Chi)
"1.00593633284243e24" "1" "."
We have shown that analyzing a 2 × 3 table for associations is equivalent to a binary logistic regression with two dummy variables as predictors. For 2 × J tables, we would fit a binary logistic regression with J − 1 dummy variables.
In our model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)
Test H_{0} : β_{1} = β_{2} = 0 versus the alternative that at least one of the coefficients β_{1}, . . . , β_{k } is not zero.
In other words, this is testing the null hypothesis that an interceptonly model is correct,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0\)
versus the alternative that the current model is correct
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)
This test has k degrees of freedom (e.g. the number of dummy or indicator variables, that is the number of βparameters (except the intercept).
Large chisquare statistics lead to small pvalues and provide evidence against the interceptonly model in favor of the current model.
For the previous test this would be equivalent to testing "interceptonly" model vs. full (saturated) model.
The likelihoodratio statistic is:
ΔG^{2} = −2 log L from reduced model
−(−2 log L from current model)
and the degrees of freedom is k (the number of coefficients in question). The pvalue is \(P(\chi^2_k \geq \Delta G^2)\).
For our example, ΔG^{2} = 5176.510 − 5138.144 = 38.3658 with df = 3 − 1 = 2.
Notice that this matches:
Likelihood Ratio 38.3658 2 <.0001
from the "Testing Global Hypothesis: BETA=0" section.
How would you test only for X_{2}? Which two models can you compare? How do you interpret this case? 
For example, here is the saturated model we just looked at in SAS.
Compare this to
model y/n = s1 /scale=none lackfit;
Testing a hypothesis that the probability of the characteristic depends on the value of the jth variable.
Testing H_{0} : β_{j} = 0 versus H_{1} : β_{j} ≠ 0.
A value of z^{2} bigger than 3.84 indicates that we can reject the null hypothesis β_{j} = 0 at the .05level.
From the second row of this part of the ouput:
\(\beta_1:\left(\dfrac{0.3491}{0.0955}\right)^2=13.3481\)
A value of z bigger than 1.96 or small pvalue indicates that we can reject the null hypothesis β_{j} = 0 at the .05level.
Confidence Intervals: An approximate (1 − α) × 100% confidence interval for j is given by
\(\hat{\beta}_j \pm z_{(1\alpha/2)} \times SE(\hat{\beta}_j)\)
For example, a 95% CI for β_{1}
0.3491 ± 1.96 × 0.0955 = (0.16192, 0.5368)
Then, the 95% CI for the oddsratio of a student smoking if one parent is smoking in comparison to neither smoking is:
(exp(0.16192), exp(0.5368)) = (1.176, 1.710)
Student and parents' smoking behaviors are dependent. Furthermore,
This section will focus on the following:
Key Concepts
Objectives

Recall the 3 × 2 × 2 table that we examined in Lesson 5 [23] that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.
Socioeconomic
status 
Boy scout

Delinquent


Yes

No


Low

Yes

11

43

No

42

169


Medium

Yes

14

104

No

20

132


High

Yes

8

196

No

2

59

Because the outcome variable D is binary, we can express many models of interest using binary logistic regression.
Before handling the full threeway table, let us consider the 2 × 2 marginal table for B and D as we did in Lesson 5 [23]. We concluded that the boy scout status (B) and the delinquent status (D) are dependent and that
Boy scout

Delinquent


Yes

No


Yes

33

343

No

64

360

the estimated logodds ratio is
\(\text{log}\left(\dfrac{33\times 360}{64 \times 343}\right)=0.6140\)
with a standard error of \(\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272\). That is, we estimate that being a boy scout lowers the logodds of delinquency by 0.614; the oddsratio is 0.541.
Now let’s fit a logistic regression model,
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_i\)
where X_{i} is a dummy variable
X_{i} = 0 if nonscout,
X_{i} = 1 if scout.
See the SAS code in the program scout.sas [24] below:
The first category is "nonscout", because it comes before "scout" in the alphabetical order.
Some output from this part of the program:
The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=0.6140\)
is identical to the logodds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.
Also, in this model, β_{0} is the logodds of delinquency for nonscouts (X_{i} = 0). Looking at the 2 × 2 table, the estimated logodds for nonscouts is:
\(\text{log}\left(\dfrac{64}{360}\right)=1.7272\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodnessoffit statistics X^{2} and G^{2} from this model are both zero, because the model is saturated. Let’s fit the interceptonly mode by removing the predictor from the model statement, like this:
model y/n = / scale=none;
The goodnessoffit statistics are shown below.
The Pearson statistic X^{2} = 7.4652 is identical to the ordinary X^{2} for testing independence in the 2 × 2 table (see notes Lesson 5 and boys.sas files). And the deviance G^{2} = 7.6126 is identical to the G^{2 }for testing independence in the 2 × 2 table.
The R code for this model and other models in this section for boy scout example, see the R program scout.R [25].
Here is the corresponding R code for scout1.sas that will produce the 2 × 2 table:
Part of the R output includes:
The baseline for this model is “nonscout” because it comes before “scout” in the alphabetical order. The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=−0.6140\)
is identical to the logodds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.
Also, in this model, β_{0} is the logodds of delinquency for nonscouts (X_{i} = 0). Looking at the 2 × 2 table, the estimated logodds for nonscouts is:
log(64/360) = −1.7272
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodnessoffit statistics are shown below:
As we discussed earlier in other R output, the residual deviance is almost 0 because the model is saturated. The null deviance is the G^{2} that corresponds to deviance goodnessoffit statistics in SAS output. Here, the deviance G^{2} = 7.6126 is identical to the G^{2} for testing independence in the 2 × 2 table.
Note: Thus we have shown again that analyzing a 2 × 2 table for association is equivalent to logistic regression with a dummy variable. Next let us look at the rest of the data and generalize these analyses to I × 2 tables and I × J × 2 tables.
Now let us do a similar analysis for the 3 × 2 table that classifies subjects by S and D:
Socioeconomic
status 
Delinquent


Yes

No


Low

53

212

Medium

34

236

High

10

255

Two odds ratios of interest are
(53 × 236) / (34 × 212) = 1.735,
(53 × 255) / (10 × 212) = 6.375.
We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are
log 1.735 = .5512 and log 6.375 = 1.852,
and the standard errors are
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392\)
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571\)
Now let us replicate this analysis using logistic regression. First, we reexpress the data in terms of y_{i} = number of delinquents and n_{i} = number of boys for the three Sgroups:
y_{i}

n_{i}


Low

53

265

Medium

34

270

High

10

265

Then we define a pair of dummy indicators,
X_{1} = 1 if S=medium,
X_{1} = 0 otherwise,X_{2} = 1 if S=high,
X_{2} = 0 otherwise.
Let π = probability of delinquency. Then the model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)
says that the logodds of delinquency are β_{0} for S = low, β_{0} + β_{1} for S = medium, and β_{0} + β_{2} for S = high.
The SAS code for fitting this model is shown below (see scout.sas [24]).
Some relevant portions of the output are shown below.
Think about the following question, then click on the icon to the left display an answer. Why are G^{2} and X^{2}= 0? What happened with information on boy scout status? 
In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis H_{1} = H_{2} = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are
exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.
Here is the part in the R program file scout.R [25] that corresponds to scout2.sas program.
Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as “low” for this model.
R output:
The null deviance is the G^{2} that corresponds to deviance goodnessoffit statistics found in the SAS output. Here, the deviance G^{2} = 36.252. So, we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis H_{1} = H_{2} = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are
exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.
Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable. Next we will see how this plays out with logistic regression.
In this example, we could have also arranged the input data like this:
S

B

y_{i}

n_{i}

low

scout

11

54

low

Nonscout

42

211

medium

scout

14

118

medium

Nonscout

20

152

high

scout

8

204

high

Nonscout

2

61

A SAS program for fitting the same model is shown below.
The parameter estimates from this new program are exactly the same as before:
But the overall fit statistics are different! Before, we had X^{2} = 0 and G^{2} = 0 because the model was saturated (there were three parameters and N = 3 lines of data). But now, the fit statistics are:
The model appears to fit very well, but it is no longer saturated. What happened? Recall that X^{2} and G^{2} are testing the null hypothesis that the current model is correct, versus the alternative of a saturated model. When we disaggregated the data by levels of B, using six input lines rather than three, the current model did not change but the saturated model did; the saturated model was enlarged to six parameters.
Please note: It is very important for you to understand how you entered the data and what model you are fitting. If you understand the basic concepts, then you can apply model comparisons with any statistical software application. 
Another way to interpret the overall X^{2} and G^{2} goodnessoffit tests is that they are testing the significance of all omitted covariates. If we collapse the data over B and use only three lines of data, then SAS is unaware of the existence of B. But if we disaggregate the data by levels of B and do not include it in the model, then SAS has the opportunity to test the fit of the current model—in which the probability of delinquency varies by S alone—against the saturated alternative in which the probability of delinquency varies by each combination of the levels of S and B. When the data are disaggregated, the goodnessoffit tests are actually testing the hypothesis that D is unrelated to B once S has been taken into account—i.e., that D and B are conditionally independent given S.
Here’s another way to think about it. The current model has three parameters:
But the alternative has six parameters. We can think of these six parameters as an intercept and five dummies to distinguish among the six rows of data, but then it’s not easy to see how the current model becomes a special case of it. So, we think of the six parameters as
Now it has become clear that the current model is a special case of this model in which the coefficients for B and the SB interactions are zero. The overall X^{2} and G^{2 }statistics for the disaggregated data are testing the joint significance of the B dummy and the SB interactions.
So, should we aggregate, or should we not? If the current model is true, then it doesn’t matter; we get exactly the same estimated coefficients and standard errors either way. But disaggregating gives us the opportunity to test the significance of the omitted terms for B and SB.
Therefore, it often makes sense to disaggregate your dataset by variables that are not included in the model, because it gives you the opportunity to test the overall fit of your model. But that strategy has limits. If you disaggregate the data too much, the n_{i}’s may become too small to reliably test the fit by X^{2} and G^{2}.
In this part of the lesson we will consider different binary logistic regression models for threeway tables and their link to loglinear models. Let us return to the 3 × 2 × 2 table:
Socioeconomic
status 
Boy scout

Delinquent


Yes

No


Low

Yes

11

43

No

42

169


Medium

Yes

14

104

No

20

132


High

Yes

8

196

No

2

59

As we discussed in Lesson 5 [23], there are many different models that we could fit to this table. But if we think of D as a response and B and S as potential predictors, we can focus on a subset of models that make sense.
Let π be the probability of delinquency. The simplest model worth considering is the null or interceptonly model,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0\) (1)
in which D is unrelated to B or S. If we were to fit this model in PROC LOGISTIC using the disaggregated data (all six lines), we would find that the X^{2} and G^{2 }statistics are identical to those we obtained in Lesson 5 [23] from testing the null hypothesis "S and B are independent of D.
That is, testing the overall fit of model (1), i.e., intercept only model, is equivalent to testing the fit of the loglinear model (D, SB), because (1) says that D is unrelated to S and B but makes no assumptions about whether S and B are related.
After (1), we may want to fit the logit model that has a main effect for B,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1\) (2)
where
X_{1} = 1 if B=scout,
X_{1} = 0 otherwise.
If the data are disaggregated into six lines, the goodnessoffit tests for model (2) will be equivalent to the test for (DB, SB) loglinear model, which says that D and S are conditionally independent given B.
This makes sense, because (2) says that S has no effect on D once B has been taken into account.
The model that has main effects for S,
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_2 X_2+\beta_3 X_3\) (3)
where
X_{2} = 1 if S = medium,
X_{2} = 0 otherwise,X_{3} = 1 if S = high,
X_{3} = 0 otherwise,
says that B has no effect on D once S has been taken into account. The goodnessoffit tests for (3) are equivalent to testing the null hypothesis that (DS, BS) fits, i.e. that D and B are conditionally independent given S.
The logit model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3\) (4)
has main effects for both B and S, corresponds to the model of homogeneous association which we discussed in Lesson 5 [23]. We could not fit the model at that time, because the ML estimates have no closedform solution, but we calculated CMH statistic and BreslowDay statistic. But with logistic regression software, fitting this model is no more difficult than for any other model.
See scout.sas [24].
This model says that the effect of B on D, when expressed in terms of odds ratios, is identical across the levels of S. Equivalently, it says that the odds ratios describing the relationship between S and D are identical across the levels of B. If this model does not fit, we have evidence that the effect of B on D varies across the levels of S, or that the effect of S on D varies across the levels of B.
Finally, the saturated model can be written as:
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3X_3+\beta_4X_1X_2+\beta_5 X_1X_3\)
which has the main effects for B and S and their interactions. This model has X^{2} = G^{2 }= 0 with zero degrees of freedom.
See scout.sas [24]. If time permits, you should try to fit these in R.
So how do we choose between these five models? Which would be the best fit, most parsimonious model? That's next.
In this part of the lesson we will focus on model selection. Fitting all the models from the previous section to the delinquency data, we obtain the following fit statistics.
Model

X^{2}

G^{2}

df

Saturated

0.00

0.00

0

S + B

0.15

0.15

2

S

0.16

0.16

3

B

28.80

27.38

4

Null (intercept only)

36.41

32.96

5

The above table is an example of "analysisofdeviance" table. This is like ANOVA table you have seen in linear regressions or similar models, where we look at the difference in the fit statistics, e.g. Fstatistic, due to dropping or adding a parameter. In this case, we are checking for the change in deviance and if it is significant or not.
If a pair of models is nested (i.e. the smaller model is a special case of the larger one) then we can test
H_{0} : smaller model is true
versus
H_{1} : larger model is true
by doing likelihood ratio testing, and comparing
ΔG^{2} = G^{2 }for smaller model − G^{2 }for larger model
or
Δ X^{2} = X^{2} for smaller model − X^{2} for larger model
to a χ^{2} distribution with degrees of freedom equal to
Δdf = df for smaller model
− df for larger model.
This is exactly similar to testing whether a reduced model is true versus whether the fullmodel is true, for linear regression. Recall that full model has more parameters and setting some of them equal to zero the reduced model is obtained. Such models are nested, or hierarchical. The method described here holds ONLY for nested models.
For example, any model can be compared to the saturated model. The Null (interceptonly) model can be compared to any model above it. The S model or B model may be compared to S + B. But the S model cannot be directly compared to the B model because they are not nested.
The goal is to find a simple model that fits the data. For this purpose, it helps to add pvalues to the analysisofdeviance table. Here are the pvalues associated with the G^{2} statistics:
Model

G^{2}

df

p

Saturated

0.00

0



S + B

0.15

2

.928

S

0.16

3

.984

B

28.80

4

.000

Null (intercept only)

36.41

5

.000

From this table, we may conclude that:
Based on this table, we should choose to represent our data by the S model, because it’s the simplest one that fits the data well, thus D is conditionally independent of B given S.
But let us temporarily turn our attention to the saturated model to see how we can interpret the parameters. We can write this model as
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_1 X_2+\beta_5 X_1 X_3\) (5)
where π is the probability of delinquency,
X_{1} = 1 if B = scout,
X_{1} = 0 otherwise
is the main effect for B, and
X_{2} = 1 if S = medium,
X_{2} = 0 otherwise,X_{3} = 1 if S = high,
X_{3} = 0 otherwise
are the main effects for S.
Let us fit this model in SAS (see scout.sas [24]):
The table of coefficients is shown below:
Here is the R code in the R program scout.R [25] that corresponds to scout4.sas:
The coefficient estimates from the output are shown below:
How do we interpret these effects? Referring back to equation (5),
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_1 X_2+\beta_5 X_1 X_3\)
we see that the logodds of delinquency for each S × B group are:
S

B

logodds

low  scout  β_{0} + β_{1} 
low  nonscout  β_{0} 
medium  scout  β_{0} + β_{1}+ β_{2} + β_{4} 
medium  nonscout  β_{0} + β_{2} 
high  scout  β_{0} + β_{1}+ β_{3} + β_{5} 
high  nonscout  β_{0} + β_{3} 
Therefore,
β_{1} = logodds for S = low, B = scout,
−logodds for S = low, B = nonscout.
In other words, β_{1} gives the effect of scouting on delinquency when S = low. The estimate of β_{1} in the SAS output agrees with the B × D odds ratio for S = low,
\(\text{log}\left(\dfrac{11\times 169}{42 \times 43}\right)=0.0289\)
The effect of scouting for S = medium, however, is
β_{1} + β_{4} = logodds for S = medium, B = scout,
−logodds for S = medium, B = nonscout.
and the effect of scouting for S = high is
β_{1} + β_{5} = logodds for S = high, B = scout,
−logodds for S = high, B = nonscout.
Estimates of these latter two effects do not directly appear in the SAS or R output. How can we get them?
One way is to simply calculate them from the individual \(\hat{\beta}_j's\), and then get their standard errors from the elements of the estimated covariance matrix. For example, the estimated standard error for \(\hat{\beta}_1+\hat{\beta}_4\) is:
\(\sqrt{\hat{V}(\hat{\beta}_1)+\hat{V}(\hat{\beta}_4)+2 \hat{Cov}(\hat{\beta}_1,\hat{\beta}_4)}\)
Adding the covb option to the model statement in PROC LOGISTIC will cause SAS to print out the estimated covariance matrix. In R, one can use summary function and call the object cov.scaled (see scout.R code).
For example, since this is the saturated model we know that the oddsratio for given the S=medium scouting level is:
\(\left(\dfrac{14/104}{20 / 132}\right)=0.8885\)
and on the log scale, log(0.8885)=0.11827. If we read out the output from R or SAS, then \(\hat{\beta}_1+\hat{\beta}_4=0.028920.14719=0.11827\) which corresponds to what we expected. To get the standard error, from the estimated covariance matrix we extract the appropriate elements, i.e.,
\(\sqrt{\hat{V}(\hat{\beta}_1)+\hat{V}(\hat{\beta}_4)+2 \hat{Cov}(\hat{\beta}_1,\hat{\beta}_4)}=\sqrt{0.14389159+0.28251130+2*(0.14389159)}=0.3723167\)
Please note that we can certainly reduce the precision with which we are reporting these values to let's say about 4 decimal points. The values above were extracted from the R output for the estimated covariance matrix, i.e.,
> summary(result)$cov.scaled
(Intercept) BscoutTRUE SmediumTRUE ShighTRUE BscoutTRUE:SmediumTRUE BscoutTRUE:ShighTRUE
(Intercept) 0.02972668 0.02972668 0.02972668 0.02972668 0.02972668 0.02972668
BscoutTRUE 0.02972668 0.14389159 0.02972668 0.02972668 0.14389159 0.14389159
SmediumTRUE 0.02972668 0.02972668 0.08730244 0.02972668 0.08730244 0.02972668
ShighTRUE 0.02972668 0.02972668 0.02972668 0.54667583 0.02972668 0.54667583
BscoutTRUE:SmediumTRUE 0.02972668 0.14389159 0.08730244 0.02972668 0.28251130 0.14389159
BscoutTRUE:ShighTRUE 0.02972668 0.14389159 0.02972668 0.54667583 0.14389159 0.79094277
Another way is to recode the model so that the estimates of interest and their standard errors appear directly in the table of coefficients. Suppose that we define the following dummy variables:
X_{1} = 1 if S = low, 0 otherwise
X_{2} = if S = low and B = scout, 0 otherwise
X_{3} = 1 if S = medium, 0 otherwise
X_{4} = 1 if S = medium and B = scout, 0 otherwise
X_{5} = 1 if S = high, 0 otherwise
X_{6} = 1 if S = high and B = scout, 0 otherwise
Then we fit the model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_4 +\beta_5 X_5+\beta_6 X_6\)
Notice that this new model does not include an intercept; an intercept would cause a collinearity problem, because X_{1} + X_{3} + X_{5} = 1. Under this new coding scheme, the logodds of delinquency for each S × B group are:
S

B

logodds

low  scout  β_{1} + β_{2} 
low  nonscout  β_{1} 
medium  scout  β_{3} + β_{4} 
medium  nonscout  β_{3} 
high  scout  β_{5} + β_{6} 
high  nonscout  β_{5} 
Therefore,
β_{2} = effect of scouting when S = low,
β_{4} = effect of scouting when S = medium,
β_{6} = effect of scouting when S = high.
SAS code for fitting this new model is shown below (see scout.sas [24]).
In the model statement, notice the use of the noint option to remove the intercept. The estimated table of coefficients is shown below.
Here is the R code from the R program scout.R [25] that corresponds to scout5.sas:
In R, you can exclude the intercept by including "1" in the formula as seen in the code above. The estimated table of coefficients is shown below.
I will leave it to you to verify that the estimates and standard errors for β_{2}, β_{4} and β_{6} correspond to the logodds ratios and standard errors that we get from analyzing the B × D tables for S = low, S = medium and S = high.
These type of analysis, models and parameter interpretations extend to any k−way table. More specifically tables where we have a binary response and many other categorical variables that can have many different levels.
This is a special example of a multiple logistic regression where we have more than one explanatory variable, but they are all categorical.
Regardless of the predictor variables being categorical, continuous or a combination, when dealing with multiple predictors, model selection becomes important. With logistic regression as in ordinary multiple linear regression, we can use automated procedures such as Stepwise Procedure or Backward Elimination. These are analogous to those in ordinary multiple regression, but with a change in statistic used.
However, the more optimal procedure for logistic regression would be to use Likelihood ratio test (LRT) for testing elimination of variables, as we described with the boys scout example. If there are many categorical predictors, the sparseness can be a problem for these automated algorithms. You can also use AIC and BIC for model selection. These will be discussed along with the loglinear models later. For an example of a larger dataset, and more on Model Selection, see handouts relevant for the Water Level Study data (water.sas and water.txt):
Go to Case Studies: The Water Level Study [28]; or you can also see relevant files on the SAS supplemental pages [29] (e.g., water_level1.sas, water_level2.sas). For corresponding R files see R supplemental pages [30] (e.g., water.R)
In the next lesson we will deal with logistic regression with continuous covariates and other advanced topics.
Logit models represent how binary (or multinomial) response variable is related to a set of explanatory variables, which can be discrete and/or continuous. In this lesson we focused on Binary Logistic Regression. Below is a brief summary and link to LogLinear and Probit models.
For a more detailed discussion refer to Agresti (2007), Ch.3, Agresti (2013), Ch.4, (pages 115118, 135132), and/or McCullagh & Nelder (1989).
They are related in a sense that the loglinear models are more general than logit models, and some logit models are equivalent to certain loglinear models (e.g. consider the admissions data example or boys scout example).
Both model how binary response variable depends on a set of explanatory variable They have the same:
But they differ in the link function.
The logistic regression model
\(\text{logit}(\pi(x))=\text{log} \left(\dfrac{\pi(x)}{1\pi(x)}\right)=\beta_0+\beta x\)
uses the logistic cumulative distribution function (cdf).
The probit model
\(\text{probit}(\pi(x))=\beta_0+\beta x\)
uses normal cdf
\(\text{probit}(\pi)=F^{1}(X \leq x)\)
that is the inverse of the standard normal distribution. For example, probit(0.975) = 1.96, probit(0.950) = 1.64, and probit(0.5) = 0.
Fitted values between these two models are often very similar. Rarely does one of these models fit substantially better (or worse) than the other, although more difference can be observed with sparse data.
Think back to intro statistics classes and approximating binomial distribution with normal.
More on probit models see Agresti (2007), Section 3.2.4, or Agresti (2013), Section 6.6.
Links:
[1] https://onlinecourses.science.psu.edu/stat504/node/217
[2] https://onlinecourses.science.psu.edu/stat504/node/216
[3] https://onlinecourses.science.psu.edu/stat504/node/117
[4] https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_logistic_sect003.htm
[5] https://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/logistic_sect35.htm#%20stat_logistic_logisticod
[6] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson05/LinRegExample.doc
[7] https://onlinecourses.science.psu.edu/stat504/node/84
[8] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson03/smokeindep.sas
[9] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson03/smokeindep.R
[10] https://www.dynamicdrive.com
[11] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.sas
[12] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.sas
[13] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.lst
[14] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/dataformats.R
[15] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.R
[16] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.out
[17] https://onlinecourses.science.psu.edu/stat504/node/157
[18] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smokelog.R
[19] https://onlinecourses.science.psu.edu/stat504/node/152#multitab
[20] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/smoke.lst
[21] https://onlinecourses.science.psu.edu/stat504/#
[22] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/ROCandHL.R
[23] https://onlinecourses.science.psu.edu/stat504/node/102
[24] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/scout.sas
[25] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/scout.R
[26] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/WaterStudyLogisticRegression.pdf
[27] https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson06/WaterStudyModelSelection.pdf
[28] https://onlinecourses.science.psu.edu/stat504/node/197
[29] https://onlinecourses.science.psu.edu/stat504/node/206
[30] https://onlinecourses.science.psu.edu/stat504/node/207