6.3.1 - Estimating Odds Ratios

Recall the \(3\times2\times2\) table that we examined in Lesson 5 that classifies 800 boys according to S = socioeconomic status, B = whether a boy is a scout, and D = juvenile delinquency status.

Socioeconomic status	Boy scout	Delinquent
Socioeconomic status	Boy scout	Yes	No
Low	Yes	11	43
Low	No	42	169
Medium	Yes	14	104
Medium	No	20	132
High	Yes	8	196
High	No	2	59

Because the outcome variable D is binary, we can express many models of interest using binary logistic regression. Before handling the full three-way table, let us consider the \(2\times2\) marginal table for B and D as we did in Lesson 5. We concluded that the boy scout status (B) and the delinquent status (D) are associated and that (from the table below)

Boy scout	Delinquent
Boy scout	Yes	No
Yes	33	343
No	64	360

the estimated log-odds ratio is

\(\log\left(\dfrac{33( 360)}{64 (343)}\right)=-0.6140\)

with a standard error of \(\sqrt{\dfrac{1}{33}+\dfrac{1}{343}+\dfrac{1}{64}+\dfrac{1}{360}}=0.2272\). That is, we estimate that being a boy scout is associated a lower log-odds of delinquency by 0.614; the odds-ratio is 0.541. Now let’s fit the logistic regression model

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x\)

where \(x\) is a dummy variable

\(x\) = 0 if non-scout,
\(x\) = 1 if scout.

See the SAS code in the program scout.sas below:

options nocenter nodate nonumber linesize=72;

data scout1;
   input S $ y n;
   cards;
scout     33 376
nonscout  64 424
;

proc logist data=scout1;
  class S / param=ref ref=first;
  model y/n = S / scale=none;
  run;

Note that SAS will set the baseline to "non-scout" because it comes before "scout" in alphabetical order. Some output from this part of the program:


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-1.7272	0.1357	162.1110	<.0001
S	scout	1	-0.6140	0.2272	7.3032	0.0069

The estimated coefficient of \(x\)

\(\hat{\beta}_1=-0.6140\)

is identical to the log-odds ratio from the analysis of the \(2\times2\) table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the \(2\times2\) table analysis. Also, in this model, \(\beta_0\) is the log-odds of delinquency for non-scouts (\(x=0\)). Looking at the \(2\times2\) table, the estimated log-odds for non-scouts is

\(\log\left(\dfrac{64}{360}\right)=-1.7272\)

which agrees with \(\hat{\beta}_0\) from the logistic model.

The goodness-of-fit statistics \(X^2\) and \(G^2\) from this model are both zeroes because the model is saturated. Let's fit the intercept-only model by removing the predictor from the model statement:

model y/n = / scale=none;

The goodness-of-fit statistics are shown below.


Deviance and Pearson Goodness-of-Fit Statistics
Criterion	Value	DF	Value/DF	Pr > ChiSq
Deviance	7.6126	1	7.6126	0.0058
Pearson	7.4652	1	7.4652	0.0063

Number of events/trials observations: 2

The Pearson statistic \(X^2=7.4652\) is identical to the ordinary chi-square statistic for testing independence in the \(2\times2\) table. And the deviance \(G^2= 7.6126\) is identical to the \(G^2\)for testing independence in the \(2\times2\) table.

The R code for this model and other models in this section for the boy scout example is in the R program scout.R. Here is part of the code that works with the \(2\times2\) (marginal) table between scouting and delinquency.

#### corresponds to the data scout1 in scout.SAS
#### Fits a Logistic regression with S="scout" or "nonscout"

S=factor(c("scout","nonscout"))
Sscout=(S=="scout")
Snonscout=(S=="nonscout")
y=c(33,64)
n=c(376,424)
count=cbind(y,n-y)
result=glm(count~Sscout+Snonscout,family=binomial("logit"))
summary(result)

Part of the R output includes:

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.7272     0.1357 -12.732  < 2e-16 ***
SscoutTRUE     -0.6140     0.2272  -2.702  0.00688 ** 
SnonscoutTRUE       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that R sets the baseline to "non-scout" because it comes before "scout" in alphabetical order. The estimated coefficient for scout is

\(\hat{\beta}_1=−0.6140\)

and is identical to the log-odds ratio from the analysis of the \(2\times2\) table. The standard error for \(\hat{\beta}_1\), 0.2272, is identical to the standard error that came from the \(2\times2\) table analysis.

Also, in this model, \(\beta_0\) is the log-odds of delinquency for nonscouts (\(x_i=0\)). Looking at the \(2\times2\) table, the estimated log-odds for non-scouts is

\(\log(64/360) = −1.7272\)

which agrees with \(\hat{\beta}_0\) from the logistic model.

The goodness-of-fit statistics are shown below:

   Null deviance: 7.6126e+00  on 1  degrees of freedom
Residual deviance: 4.0190e-14  on 0  degrees of freedom
AIC: 15.083

As we have seen before, the residual deviance is essentially 0 because the model is saturated. The null deviance corresponds to the deviance goodness-of-fit statistics in the SAS output. Here, the deviance \(G^2 = 7.6126\) is identical to the \(G^2\) for testing independence in the \(2\times2\) table.

NOTE! We have shown again that analyzing a \(2\times2\) table for association is equivalent to logistic regression with an indicator variable. Next, let us look at the rest of the data and generalize these analyses to \(I\times 2\) tables and \(I\times J\times 2\) tables.

Now let us do a similar analysis for the \(3\times2\) table that classifies subjects by S and D:

Socioeconomic status	Delinquent
Socioeconomic status	Yes	No
Low	53	212
Medium	34	236
High	10	255

Two odds ratios of interest are

\((53(236)) / (34( 212)) = 1.735\)
\((53( 255)) / (10 (212)) = 6.375\)

We estimate that the odds of delinquency for the S = low group are 1.735 times as high as for the S = medium group, and 6.375 times as high as for the S = high group. The estimated log odds ratios are

\(\log 1.735 = .5512\) and \(\log 6.375 = 1.852\)

and the standard errors are

\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{34}+\dfrac{1}{236}}=0.2392\)

\(\sqrt{\dfrac{1}{53}+\dfrac{1}{212}+\dfrac{1}{10}+\dfrac{1}{255}}=0.3571\)

Now let us replicate this analysis using logistic regression. First, we re-express the data in terms of \y_i\) = number of delinquents and \(n_i\) = number of boys for the \(i\)th row (SES group):

	\(y_i\)	\(n_i\)
Low	53	265
Medium	34	270
High	10	265

Then we define a pair of dummy indicators,

\(x_1=1\) if S=medium,
\(x_1=0\) otherwise,

\(x_2=1\) if S=high,
\(x_2=0\) otherwise.

Let \(\pi\) = probability of delinquency. Then, the model

\(\log\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 x_1+\beta_2 x_2\)

says that the log-odds of delinquency are \(\beta_0\) for S = low, \(\beta_0+\beta_1\) for S = medium, and \(\beta_0+\beta_2\) for S = high.

The SAS code for fitting this model is shown below (see scout.sas).

options nocenter nodate nonumber linesize="72";
data new;
	input S $ y n;
    cardsl
low		53 265
medium		34 270
hight		10 265
;

proc logist data=new;
class S / order data param=ref ref=first;
model y/n = S scale=none;
run;

Some relevant portions of the output are shown below.


Deviance and Pearson Goodness-of-Fit Statistics
Criterion	Value	DF	Value/DF	Pr > ChiSq
Deviance	0.0000	0	.	.
Pearson	0.0000	0	.	.

Number of events/trials observations: 3

Think about the following question, then click on the button to show the answer.

Stop and Think!

Why are \(G^2\) and \(X^2= 0\)? What happened with information on boy scout status?

Since we are fitting the saturated model here, we are using all three lines of data. So, our current model IS the saturated model.

So, in comparing the current model with the saturated model, the deviance has to be zero.

As far the information about boy scout status, we have collapsed over this status. This is similar to when we collapsed down from a 3-way table down into a 2-way table and their is no information about boy scout status.


Model Fit Statistics
Criterion	Intercept Only	Intercept and Covariates
Criterion	Intercept Only	Log Likelihood	Full Log Likelihood
AIC	593.053	560.801	20.942
SC	597.738	574.855	34.995
-2 Log L	591.053	554.801	14.942


Testing Global Null Hypothesis: BETA=0
Test	Chi-Square	DF	Pr > ChiSq
Likelihood Ratio	36.2523	2	<.0001
Score	32.8263	2	<.0001
Wald	27.7335	2	<.0001


Type 3 Analysis of Effects
Effect	DF	Wald Chi-Square	Pr > ChiSq
S	2	27.7335	<.0001


Analysis of Maximum Likelihood Estimates
Parameter		DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept		1	-1.3863	0.1536	81.4848	<.0001
S	medium	1	-0.5512	0.2392	5.3080	0.0212
S	high	1	-1.8524	0.3571	26.9110	<.0001


Odds Ratio Estimates
Effect	Point Estimate	95% Wald Confidence Limits
S medium vs low	0.576	0.361	0.921
S high vs low	0.157	0.078	0.316

In this case, the "intercept only" model says that delinquency is unrelated to socioeconomic status, so the test of the global null hypothesis \(\beta_1=\beta_2=0\) is equivalent to the usual test for independence in the \(3\times2\) table. The estimated coefficients and SEs are as we predicted, and the estimated odds ratios are

\(\exp(−.5512) = 0.576 = 1/1.735\),
\(\exp(−1.852) = 0.157 = 1/6.375\)

Here is how we can fit this model in R (from scout.R, corresponds to the scout2 data in the scout.sas program).

S=factor(c("low","medium","high"))
y=c(53,34,10)
n=c(265,270,265)
count=cbind(y,n-y)
Smedium=(S=="medium")
Shigh=(S=="high")
result=glm(count~Smedium+Shigh,family=binomial("logit"))
summary(result)

Notice that we only use “Smedium” and “Shigh” in the model statement in glm(). So we set the baseline as "low" for this model.

R output:

Call:
glm(formula = count ~ Smedium + Shigh, family = binomial("logit"))

Deviance Residuals: 
[1]  0  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3863     0.1536  -9.027  < 2e-16 ***
SmediumTRUE  -0.5512     0.2392  -2.304   0.0212 *  
ShighTRUE    -1.8524     0.3571  -5.188 2.13e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.6252e+01  on 2  degrees of freedom
Residual deviance: 6.8390e-14  on 0  degrees of freedom
AIC: 20.942

Number of Fisher Scoring iterations: 3

Stop and Think!

As happened in SAS, we get the residual deviance of almost zero in this case. Why? What happened with information on boy scout status?

Since we are fitting the saturated model here, we are using all three lines of data. So, our current model IS the saturated model.

So, in comparing the current model with the saturated model, the deviance has to be zero.

The null deviance is the \(G^2\) that corresponds to the deviance goodness-of-fit statistics found in the SAS output. Here, the deviance \(G^2 = 36.252\), so we can conclude that the delinquency is related to socioeconomic status. The test of the global null hypothesis \(\beta_1=\beta_2=0\) is equivalent to the usual test for independence in the \(3\times2\) table. The estimated coefficients and SEs are as we predicted, and the estimated odds ratios are

\(\exp(−.5512) = 0.576 = 1/1.735\),
\(\exp(−1.852) = 0.157 = 1/6.375\).

Notice that we did not say anything here about the scout status. We have "ignored" that information because we collapsed over that variable. Next, we will see how this plays out with logistic regression.