6.3.1 - Connecting Logistic Regression to the Analysis of Two- and Three-way Tables

Printer-friendly versionPrinter-friendly version

Recall the 3 × 2 × 2 table that we examined in Lesson 5 that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.

Socioeconomic
status
Boy scout
Delinquent
Yes
No
Low
Yes
11
43
No
42
169
Medium
Yes
14
104
No
20
132
High
Yes
8
196
No
2
59

Because the outcome variable D is binary, we can express many models of interest using binary logistic regression.

Before handling the full three-way table, let us consider the 2 × 2 marginal table for B and D as we did in Lesson 5. We concluded that the boy scout status (B) and the delinquent status (D) are dependent and that

Boy scout
Delinquent
Yes
No
Yes
33
343
No
64
360

the estimated log-odds ratio is

\(\text{log}\left(\dfrac{33\times 360}{64 \times 343}\right)=-0.6140\)

with a standard error of \(\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272\). That is, we estimate that being a boy scout lowers the log-odds of delinquency by 0.614; the odds-ratio is 0.541.

Now let’s fit a logistic regression model,

\(\text{log}\left(\dfrac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1 X_i\)

where Xi is a dummy variable

Xi = 0 if non-scout,
Xi = 1 if scout.

SAS logoSee the SAS code in the program scout.sas below:

SAS program

The first category is "nonscout", because it comes before "scout" in the alphabetical order.

Some output from this part of the program:

SAS output

The estimated coefficient of the dummy variable,

\(\hat{\beta}_1=-0.6140\)

is identical to the log-odds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.

Also, in this model, β0 is the log-odds of delinquency for nonscouts (Xi = 0). Looking at the 2 × 2 table, the estimated log-odds for non-scouts is:

\(\text{log}\left(\dfrac{64}{360}\right)=-1.7272\)

which agrees with  \(\hat{\beta}_0\) from the logistic model.

The goodness-of-fit statistics X2 and G2 from this model are both zero, because the model is saturated. Let’s fit the intercept-only mode by removing the predictor from the model statement, like this:

model y/n = / scale=none;

The goodness-of-fit statistics are shown below.

SAS output

The Pearson statistic X2 = 7.4652 is identical to the ordinary X2 for testing independence in the 2 × 2 table (see notes Lesson 5 and boys.sas files). And the deviance G2 = 7.6126 is identical to the G2 for testing independence in the 2 × 2 table.

R logo The R code for this model and other models in this section for boy scout example, see the R program scout.R.  

Here is the corresponding R code for scout1.sas that will produce the 2 × 2 table:

 r code

Part of the R output includes:

R output

The baseline for this model is “nonscout” because it comes before “scout” in the alphabetical order. The estimated coefficient of the dummy variable,

\(\hat{\beta}_1=−0.6140\)

is identical to the log-odds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.

Also, in this model, β0 is the log-odds of delinquency for nonscouts (Xi = 0). Looking at the 2 × 2 table, the estimated log-odds for non-scouts is:

log(64/360) = −1.7272

which agrees with \(\hat{\beta}_0\) from the logistic model.

The goodness-of-fit statistics are shown below:

R output

As we discussed earlier in other R output, the residual deviance is almost 0 because the model is saturated. The null deviance is the G2 that corresponds to deviance goodness-of-fit statistics in SAS output. Here, the deviance G2 = 7.6126 is identical to the G2 for testing independence in the 2 × 2 table.

Note: Thus we have shown again that analyzing a 2 × 2 table for association is equivalent to logistic regression with a dummy variable. Next let us look at the rest of the data and generalize these analyses to I × 2 tables and I × J × 2 tables.

Now let us do a similar analysis for the 3 × 2 table that classifies subjects by S and D:

Socioeconomic
status
Delinquent
Yes
No
Low
53
212
Medium
34
236
High
10
255

Two odds ratios of interest are

(53 × 236) / (34 × 212) = 1.735,
(53 × 255) / (10 × 212) = 6.375.

We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are

log 1.735 = .5512 and log 6.375 = 1.852,

and the standard errors are

\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392\)

\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571\)

Now let us replicate this analysis using logistic regression. First, we re-express the data in terms of yi = number of delinquents and ni = number of boys for the three S-groups:

 
yi
ni
Low
53
265
Medium
34
270
High
10
265

Then we define a pair of dummy indicators,

X1 = 1 if S=medium,
X1 = 0 otherwise,

X2 = 1 if S=high,
X2 = 0 otherwise.

Let π = probability of delinquency. Then the model

\(\text{log}\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)

says that the log-odds of delinquency are β0 for S = low, β0 + β1 for S = medium, and β0 + β2 for S = high.

SAS logoThe SAS code for fitting this model is shown below (see scout.sas).

SAS program

Some relevant portions of the output are shown below.

SAS output

Think about the following question, then click on the icon to the left display an answer.

Why are G2 and X2= 0?     What happened with information on boy scout status?

SAS output

SAS output

In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis H1 = H2 = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are

exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.

R logo  Here is the part in the R program file scout.R that corresponds to scout2.sas program.

r code

Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as “low” for this model. 

R output:

R output

Think about the following question, then click on the icon to the left display an answer.

As happened in SAS, we get the residual deviance of almost zero in this case. Why?     What happened with information on boy scout status?

The null deviance is the G2 that corresponds to deviance goodness-of-fit statistics found in the SAS output. Here, the deviance G2 = 36.252. So, we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis H1 = H2 = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are

exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.

Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable.  Next we will see how this plays out with logistic regression.