6.3.1  Connecting Logistic Regression to the Analysis of Two and Threeway Tables
Recall the 3 × 2 × 2 table that we examined in Lesson 5 that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.
Socioeconomic
status 
Boy scout

Delinquent


Yes

No


Low

Yes

11

43

No

42

169


Medium

Yes

14

104

No

20

132


High

Yes

8

196

No

2

59

Because the outcome variable D is binary, we can express many models of interest using binary logistic regression.
Before handling the full threeway table, let us consider the 2 × 2 marginal table for B and D as we did in Lesson 5. We concluded that the boy scout status (B) and the delinquent status (D) are dependent and that
Boy scout

Delinquent


Yes

No


Yes

33

343

No

64

360

the estimated logodds ratio is
\(\text{log}\left(\dfrac{33\times 360}{64 \times 343}\right)=0.6140\)
with a standard error of \(\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272\). That is, we estimate that being a boy scout lowers the logodds of delinquency by 0.614; the oddsratio is 0.541.
Now let’s fit a logistic regression model,
\(\text{log}\left(\dfrac{\pi_i}{1\pi_i}\right)=\beta_0+\beta_1 X_i\)
where X_{i} is a dummy variable
X_{i} = 0 if nonscout,
X_{i} = 1 if scout.
See the SAS code in the program scout.sas below:
The first category is "nonscout", because it comes before "scout" in the alphabetical order.
Some output from this part of the program:
The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=0.6140\)
is identical to the logodds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.
Also, in this model, β_{0} is the logodds of delinquency for nonscouts (X_{i} = 0). Looking at the 2 × 2 table, the estimated logodds for nonscouts is:
\(\text{log}\left(\dfrac{64}{360}\right)=1.7272\)
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodnessoffit statistics X^{2} and G^{2} from this model are both zero, because the model is saturated. Let’s fit the interceptonly mode by removing the predictor from the model statement, like this:
model y/n = / scale=none;
The goodnessoffit statistics are shown below.
The Pearson statistic X^{2} = 7.4652 is identical to the ordinary X^{2} for testing independence in the 2 × 2 table (see notes Lesson 5 and boys.sas files). And the deviance G^{2} = 7.6126 is identical to the G^{2 }for testing independence in the 2 × 2 table.
The R code for this model and other models in this section for boy scout example, see the R program scout.R.
Here is the corresponding R code for scout1.sas that will produce the 2 × 2 table:
Part of the R output includes:
The baseline for this model is “nonscout” because it comes before “scout” in the alphabetical order. The estimated coefficient of the dummy variable,
\(\hat{\beta}_1=−0.6140\)
is identical to the logodds ratio from the analysis of the 2 × 2 table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the 2 × 2 table analysis.
Also, in this model, β_{0} is the logodds of delinquency for nonscouts (X_{i} = 0). Looking at the 2 × 2 table, the estimated logodds for nonscouts is:
log(64/360) = −1.7272
which agrees with \(\hat{\beta}_0\) from the logistic model.
The goodnessoffit statistics are shown below:
As we discussed earlier in other R output, the residual deviance is almost 0 because the model is saturated. The null deviance is the G^{2} that corresponds to deviance goodnessoffit statistics in SAS output. Here, the deviance G^{2} = 7.6126 is identical to the G^{2} for testing independence in the 2 × 2 table.
Note: Thus we have shown again that analyzing a 2 × 2 table for association is equivalent to logistic regression with a dummy variable. Next let us look at the rest of the data and generalize these analyses to I × 2 tables and I × J × 2 tables.
Now let us do a similar analysis for the 3 × 2 table that classifies subjects by S and D:
Socioeconomic
status 
Delinquent


Yes

No


Low

53

212

Medium

34

236

High

10

255

Two odds ratios of interest are
(53 × 236) / (34 × 212) = 1.735,
(53 × 255) / (10 × 212) = 6.375.
We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are
log 1.735 = .5512 and log 6.375 = 1.852,
and the standard errors are
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392\)
\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571\)
Now let us replicate this analysis using logistic regression. First, we reexpress the data in terms of y_{i} = number of delinquents and n_{i} = number of boys for the three Sgroups:
y_{i}

n_{i}


Low

53

265

Medium

34

270

High

10

265

Then we define a pair of dummy indicators,
X_{1} = 1 if S=medium,
X_{1} = 0 otherwise,X_{2} = 1 if S=high,
X_{2} = 0 otherwise.
Let π = probability of delinquency. Then the model
\(\text{log}\left(\dfrac{\pi}{1\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2\)
says that the logodds of delinquency are β_{0} for S = low, β_{0} + β_{1} for S = medium, and β_{0} + β_{2} for S = high.
The SAS code for fitting this model is shown below (see scout.sas).
Some relevant portions of the output are shown below.
Think about the following question, then click on the icon to the left display an answer. Why are G^{2} and X^{2}= 0? What happened with information on boy scout status? 
In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis H_{1} = H_{2} = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are
exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.
Here is the part in the R program file scout.R that corresponds to scout2.sas program.
Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as “low” for this model.
R output:
The null deviance is the G^{2} that corresponds to deviance goodnessoffit statistics found in the SAS output. Here, the deviance G^{2} = 36.252. So, we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis H_{1} = H_{2} = 0 is equivalent to the usual test for independence in the 3 × 2 table. The estimated coefficients and SE’s are as we predicted, and the estimated odds ratios are
exp(−.5512) = 0.576 = 1/1.735,
exp(−1.852) = 0.157 = 1/6.375.
Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable. Next we will see how this plays out with logistic regression.