6.3.1 - Connecting Logistic Regression to the Analysis of Two- and Three-way Tables

Printer-friendly version

Recall the 3 × 2 × 2 table that we examined in Lesson 5 that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.

 Socioeconomic status Boy scout Delinquent Yes No Low Yes 11 43 No 42 169 Medium Yes 14 104 No 20 132 High Yes 8 196 No 2 59

Because the outcome variable D is binary, we can express many models of interest using binary logistic regression.

Before handling the full three-way table, let us consider the 2 × 2 marginal table for B and D as we did in Lesson 5. We concluded that the boy scout status (B) and the delinquent status (D) are dependent and that

 Boy scout Delinquent Yes No Yes 33 343 No 64 360

the estimated log-odds ratio is

$\text{log}\left(\dfrac{33\times 360}{64 \times 343}\right)=-0.6140$

with a standard error of $\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272$. That is, we estimate that being a boy scout lowers the log-odds of delinquency by 0.614; the odds-ratio is 0.541.

Now let’s fit a logistic regression model,

$\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_i$

where Xi is a dummy variable

Xi = 0 if non-scout,
Xi = 1 if scout.

See the SAS code in the program scout.sas below:

The first category is "nonscout", because it comes before "scout" in the alphabetical order.

Some output from this part of the program:

The estimated coefficient of the dummy variable,

$\hat{\beta}_1=-0.6140$

is identical to the log-odds ratio from the analysis of the 2 × 2 table. The standard error for $\hat{\beta}_1$, 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.

Also, in this model, β0 is the log-odds of delinquency for nonscouts (Xi = 0). Looking at the 2 × 2 table, the estimated log-odds for non-scouts is:

$\text{log}\left(\dfrac{64}{360}\right)=-1.7272$

which agrees with  $\hat{\beta}_0$ from the logistic model.

The goodness-of-fit statistics X2 and G2 from this model are both zero, because the model is saturated. Let’s fit the intercept-only mode by removing the predictor from the model statement, like this:

model y/n = / scale=none;

The goodness-of-fit statistics are shown below.

The Pearson statistic X2 = 7.4652 is identical to the ordinary X2 for testing independence in the 2 × 2 table (see notes Lesson 5 and boys.sas files). And the deviance G2 = 7.6126 is identical to the G2 for testing independence in the 2 × 2 table.

The R code for this model and other models in this section for boy scout example, see the R program scout.R.

Here is the corresponding R code for scout1.sas that will produce the 2 × 2 table:

Part of the R output includes:

The baseline for this model is “nonscout” because it comes before “scout” in the alphabetical order. The estimated coefficient of the dummy variable,

$\hat{\beta}_1=−0.6140$

is identical to the log-odds ratio from the analysis of the 2 × 2 table. The standard error for $\hat{\beta}_1$, 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.

Also, in this model, β0 is the log-odds of delinquency for nonscouts (Xi = 0). Looking at the 2 × 2 table, the estimated log-odds for non-scouts is:

log(64/360) = −1.7272

which agrees with $\hat{\beta}_0$ from the logistic model.

The goodness-of-fit statistics are shown below:

As we discussed earlier in other R output, the residual deviance is almost 0 because the model is saturated. The null deviance is the G2 that corresponds to deviance goodness-of-fit statistics in SAS output. Here, the deviance G2 = 7.6126 is identical to the G2 for testing independence in the 2 × 2 table.

Note: Thus we have shown again that analyzing a 2 × 2 table for association is equivalent to logistic regression with a dummy variable. Next let us look at the rest of the data and generalize these analyses to I × 2 tables and I × J × 2 tables.

Now let us do a similar analysis for the 3 × 2 table that classifies subjects by S and D:

 Socioeconomic status Delinquent Yes No Low 53 212 Medium 34 236 High 10 255

Two odds ratios of interest are

(53 × 236) / (34 × 212) = 1.735,
(53 × 255) / (10 × 212) = 6.375.

We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are

log 1.735 = .5512 and log 6.375 = 1.852,

and the standard errors are

$\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392$

$\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571$

Now let us replicate this analysis using logistic regression. First, we re-express the data in terms of yi = number of delinquents and ni = number of boys for the three S-groups:

 yi ni Low 53 265 Medium 34 270 High 10 265

Then we define a pair of dummy indicators,

X1 = 1 if S=medium,
X1 = 0 otherwise,

X2 = 1 if S=high,
X2 = 0 otherwise.

Let π = probability of delinquency. Then the model

$\text{log}\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2$

says that the log-odds of delinquency are β0 for S = low, β0 + β1 for S = medium, and β0 + β2 for S = high.

The SAS code for fitting this model is shown below (see scout.sas).

Some relevant portions of the output are shown below.

Think about the following question, then click on the icon to the left display an answer.

Why are G2 and X2= 0?     What happened with information on boy scout status?

In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis H1 = H2 = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are

exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.

Here is the part in the R program file scout.R that corresponds to scout2.sas program.

Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as “low” for this model.

R output:

Think about the following question, then click on the icon to the left display an answer.

As happened in SAS, we get the residual deviance of almost zero in this case. Why?     What happened with information on boy scout status?

The null deviance is the G2 that corresponds to deviance goodness-of-fit statistics found in the SAS output. Here, the deviance G2 = 36.252. So, we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis H1 = H2 = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are

exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.

Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable.  Next we will see how this plays out with logistic regression.