The simplest model that one might propose is that ALL variables are independent of one another. Graphically, if we have three random variables, \(X\), \(Y\), and \(Z\) we can express this model as:
In this graph, the lack of connections between the nodes indicates no relationships exist among \(X\), \(Y\), and \(Z{,}\) or there is mutual independence. In the notation of loglinear models, which will learn about later as well, this model is expressed as (\(X\), \(Y\), \(Z\)). We will use this notation from here on. The notation indicates which aspects of the data we consider in the model. In this case, the independence model means we care only about the marginals totals \(n_{i++}, n_{+j+}, n_{++k}\) for each variable separately, and these pieces of information will be sufficient to fit a model and compute the expected counts. Alternatively, as we will see later on, we could keep track of the number of times some joint outcome of two variables occurs.
In terms of odds ratios, the model (\(X\), \(Y\), \(Z\)) implies that if we look at the marginal tables \(X \times Y\), \(X \times Z\) , and \(Y\times Z\) , then all of the odds ratios in these marginal tables are equal to 1. In other words, mutual independence implies marginal independence (i.e., there is independence in the marginal tables). All variables are really independent of one another.
Under this model the following must hold:
\(P(X=i,Y=j,Z=k)=P(X=i)P(Y=j)P(Z=k)\)
for all \(i, j, k\). That is, the joint probabilities are the product of the marginal probabilities. This is a simple extension from the model of independence in twoway tables where it was assumed:
\(P(X=i,Y=j)=P(X=i)P(Y=j)\)
We denote the marginal probabilities
\(\pi_{i++} = P(X = i), i = 1, 2, \ldots , I \)
\(\pi_{+j+} = P(Y = j), j = 1, 2, \ldots, J\)
\(\pi_{++k} = P(Z = k), k = 1, 2, \ldots , K\)
so that \(\pi_{ijk} = [\pi_{i++}\pi_{+j+}\pi_{++k}\) for all \(i, j, k\). Then the unknown parameters of the model of independence are
\(\pi_{i++} = (\pi_{1++}, \pi_{2++},\ldots, \pi_{I++})\)
\(\pi_{+j+} = (\pi_{+1+}, \pi_{+2+},\ldots , \pi_{+J+})\)
\(\pi_{++k} = (\pi_{++1}, \pi_{++2},\ldots , \pi_{++K})\)
Under the assumption that the model of independence is true, once we know the marginal probability values, we have enough information to estimate all unknown cell probabilities. Because each of the marginal probability vectors must add up to one, the number of free parameters in the model is \((I − 1) + (J − 1) + (K −I )\). This is exactly like the twoway table, but now one more set of additional parameter(s) needs to be taken care of for the additional random variable.
Consider the Berkeley admissions example, where under the independence model, the number of free parameters is \((21) + (21) + (61) = 7\), and the marginal distributions are
\((n_{1++}, n_{2++}, \ldots , n_{I++}) \sim Mult(n, \pi_{i++})\)
\((n_{+1+}, n_{+2+}, \ldots , n_{+J+}) \sim Mult(n, \pi_{+j+})\)
\((n_{++1}, n_{++2}, \ldots , n_{++K}) \sim Mult(n, \pi_{++k})\)
and these three vectors are mutually independent. Thus, the three parameter vectors \(\pi_{i++}, \pi_{+j+}\), and \(\pi_{++k}\) can be estimated independently of one another. The ML estimates are the sample proportions in the margins of the table,
\(\hat{\pi}_{i++}=p_{i++}=n_{i++}/n,\quad i=1,2,\ldots,I\)
\(\hat{\pi}_{+j+}=p_{+j+}=n_{+j+}/n,\quad j=1,2,\ldots,J\)
\(\hat{\pi}_{++k}=p_{++k}=n_{++k}/n,\quad k=1,2,\ldots,K\)
It then follows that the estimates of the expected cell counts are
\(E(n_{ijk})=n\hat{\pi}_{i++}\hat{\pi}_{+j+}\hat{\pi}_{++k}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)
Compare this to a twoway table, where the expected counts were:
\(E(n_{ij})=n_{i+}n_{+j}/n\).
For the Berkeley admission example, the marginal tables, i.e., counts are \(X=(2691, 1835)\) (for sex), \(Y=(1755 2771)\) (for admission status), and \(Z=(933, 585, 918, 792, 584, 714)\) (for department). The first (estimated) expected count would be \(E(n_{111})=2691(1755)(933)/4526^2=215.1015\), for example. Then, we compare these expected counts with the corresponding observed counts. That is, 215.1015 would be compared with \(n_{111}=512\) and so on.
ChiSquared Test of Independence Section
The hypothesis of independence can be tested using the general method described earlier:
\(H_0 \colon\) the independence model is true vs. \(H_A \colon\) the saturated model is true
In other words, we can check directly \(H_0\colon \pi_{ijk} = \pi_{i++}\pi_{+j+}\pi_{++k}\) for all \(i, j, k\), vs. \(H_A\colon\) the saturated model
 Estimate the unknown parameters of the independence model, i.e.., the marginal probabilities.
 Calculate estimated cell probabilities and expected cell frequencies \(E_{ijk}=E(n_{ijk})\) under the model of independence.
 Calculate \(X^2\) and/or \(G^2\) by comparing the expected and observed values, and compare them to the appropriate chisquare distribution.
\(X^2=\sum\limits_i \sum\limits_j \sum\limits_k \dfrac{(E_{ijk}n_{ijk})^2}{E_{ijk}}\)
\(G^2=2\sum\limits_i \sum\limits_j \sum\limits_k n_{ijk} \log\left(\dfrac{n_{ijk}}{E_{ijk}}\right)\)
The degrees of freedom (DF) for this test are \(\nu = (IJK − 1) − [ (I − 1) + (J − 1) + (K − 1)]\). As before, this is a difference between the number of free parameters for the saturated model \((IJK1)\) and the number of free parameters in the current model of independence, \((I1)+(J1)+(K1)\). In the case of the Berkeley data, we have \(237 = 16\) degrees of freedom for the mutual independence test.
Recall that we also said that mutual independence implies marginal independence. So, if we reject marginal independence for any pair of variables, we can immediately reject mutual independence overall. For example, recall the significant result when testing for the marginal association between sex and admission status (\(X^2=92.205\)). Since we rejected marginal independence in that case, it would expect to reject mutual (complete) independence as well.
Example: Boys Scouts and Juvenile Delinquency Section
This is a \(2 \times 2 \times 3\) table. It classifies \(n = 800\) boys according to socioeconomic status (S), whether they participated in boy scouts (B), and whether they have been labeled as a juvenile delinquent (D):
Socioeconomic status  Boy scout  Delinquent  

Yes  No  
Low  Yes  11  43 
No  42  169  
Medium  Yes  14  104 
No  20  132  
High  Yes  8  196 
No  2  59 
To fit the mutual independence model, we need to find the marginal totals for B,
\(n_{1++} = 11 + 43 + 14 + 104 + 8 + 196 = 376\)
\(n_{2++} = 42 + 169 + 20 + 132 + 2 + 59 = 424\)
for D,
\(n_{+1+} = 11 + 42 + 14 + 20 + 8 + 2 = 97\)
\(n_{+2+} = 43 + 169 + 104 + 132 + 196 + 59 = 703\)
and for S,
\(n_{++1} = 11 + 43 + 42 + 169 = 265\)
\(n_{++2} = 14 + 104 + 20 + 132 = 270\)
\(n_{++3} = 8 + 196 + 2 + 59 = 265\)
Next, we calculate the expected counts for each cell,
\(E_{ijk}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)
and then calculate the chisquare statistic. The degrees of freedom for this test are \((2\times 2\times 31)[(21)+(21)+(31)]=7\) so the \(p\)value can be found as \(P(\chi^2_7 \geq X^2)\) and \(P(\chi^2_7 \geq G^2)\).
In SAS, we can use the following lines of code:
data answers;
a1=1probchi(218.6622,7);
;
proc print data=answers;
run;
In R, we can use 1pchisq(218.6622, 7).
The pvalues are essentially zero, indicating that the mutual independence model does not fit. Remember, in order for the chisquared approximation to work well, the \(E_{ijk}\) needs to be sufficiently large. Sufficiently large means that most of them (e.g., about at least 80%) should be at least five, and none should be less than one. We should examine the \(E_{ijk}\) to see if they are large enough.
Fortunately, both R and SAS will give the expected counts, (in parentheses) and the observed counts:
Socioeconomic status  Boy scout  Delinquent  

Yes  No  
Low  Yes  11 (15.102) 
43 (109.448) 
No  42 (17.030) 
169 (123.420) 

Medium  Yes  14 (15.387 ) 
104 
No  20 (17.351) 
132 (125.749) 

High  Yes 
8 
196 (109.448) 
No  2 (17.030) 
59 
Here, the expected counts are sufficiently large for the chisquare approximation to work well, and thus we must conclude that the variables B (boys scout), D (delinquent), and S (socioeconomic status) are not mutually independent.
There is no single function or a call in SAS nor R that will directly test the mutual independence model; see will see how to fit this model via a loglinear model. However, we can test this by relying on our understanding of twoway tables, and of marginal and partial tables and related odds ratios. For mutual independence to hold, all of the tests for independence in marginal tables must hold. Thus, we can do the analysis of all twoway marginal tables using the chisquared test of independence in each. Alternatively, we can consider the oddsratios in each twoway table. In this case, for example, the estimated odds ratio for the \(B \times D\) table, is 0.542, and it is not equal to 1; i.e., 1 is not in covered by the 95% oddsratio confidence interval, (0.347, 0.845). Therefore, we have sufficient evidence to reject the null hypothesis that boy scout status and delinquent status are independent of one another and thus that \(B\), \(D\), and \(S\) are not mutually independent.
The following SAS or R code supports the above analysis by testing independence of twoway marginal tables. Again, we will see later in the course that this is done more efficiently via loglinear models.
Here we will use the SAS code found in the program boys.sas as shown below.
data boys;
input SES $scout $ delinquent $ count @@;
datalines;
low yes yes 11 low yes no 43
low no yes 42 low no no 169
med yes yes 14 med yes no 104
med no yes 20 med no no 132
high yes yes 8 high yes no 196
high no yes 2 high no no 59
;
proc freq;
weight count;
tables SES;
tables scout;
tables delinquent;
tables SES*scout/all nocol nopct;
tables SES*delinquent/all nocol nopct;
tables scout*delinquent/all nocol nopct;
tables SES*scout*delinquent/chisq cmh expected nocol nopct;
run;
The output for this program can be found in this file: boys.lst.
For R, we will use the R program file boys.R to create contingency tables and carry out the test of mutual independence.
#### Test of conditional independence
#### Use of oddsratio(), confit() from {vcd}
#### CochranMantelHaenszel test
#### Also testing via Loglinear models
############################################
#### Input the table
deliquent=c("no","yes")
scout=c("no", "yes")
SES=c("low", "med","high")
table=expand.grid(deliquent=deliquent,scout=scout,SES=SES)
count=c(169,42,43,11,132,20,104,14,59,2,196,8)
table=cbind(table,count=count)
table
temp=xtabs(count~deliquent+scout+SES,table)
temp
#### Create "flat" contigency tables
ftable(temp)
#### Let's see how we can create various subtables
#### Oneway Table SES
Frequency=as.vector(margin.table(temp,3))
CumFrequency=cumsum(Frequency)
cbind(SES,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))
#### Oneway Table scout
Frequency=as.vector(margin.table(temp,2))
CumFrequency=cumsum(Frequency)
cbind(scout,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))
#### Oneway Table deliquent
Frequency=as.vector(margin.table(temp,1))
CumFrequency=cumsum(Frequency)
cbind(deliquent,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))
#### Test the Mutual Independence, step by step
#### compute the expected frequencies
E=array(NA,dim(temp))
for (i in 1:dim(temp)[1]) {
for (j in 1:dim(temp)[2]) {
for (k in 1:dim(temp)[3]) {
E[i,j,k]=(margin.table(temp,3)[k]*margin.table(temp,2)[j]*margin.table(temp,1))[i]/(sum(temp))^2
}}}
E
#### compute the X^2, and G^2
df=(length(temp)1)(sum(dim(temp))3)
X2=sum((tempE)^2/E)
X2
1pchisq(X2,df)
G2=2*sum(temp*log(temp/E))
G2
1pchisq(G2,df)
#### Test for Mutual indpendence (and other models) by considering analysis of all twoway tables
#### Twoway Table SES*scout
#### This is a test of Marginal Independence for SES and Scout
SES_Scout=margin.table(temp,c(3,2))
result=chisq.test(SES_Scout)
result
result$expected
result=chisq.test(SES_Scout,correct = FALSE)
result
SES_Scout=list(Frequency=SES_Scout,RowPercentage=prop.table(SES_Scout,2))
#### Twoway Table SES*deliquent
#### temp=xtabs(count~scout+SES+deliquent,table)
#### SES_deliquent=addmargins(temp)[1:2,1:3,3]
#### This is a test of Marginal Independence for SES and deliquent status
SES_deliquent=margin.table(temp,c(3,1))
result=chisq.test(SES_deliquent)
result
result$expected
result=chisq.test(SES_deliquent,correct = FALSE)
result
SES_deliquent=list(Frequency=SES_deliquent,RowPercentage=prop.table(SES_deliquent,2))
#### Twoway Table deliquent*scout
#### This is a test of Marginal Independence for Deliquent status and the Scout status
deliquent_scout=margin.table(temp,c(1,2))
result=chisq.test(deliquent_scout)
result$expected
result=chisq.test(deliquent_scout,correct = FALSE)
result
deliquent_scout1=list(Frequency=deliquent_scout,RowPercentage=prop.table(deliquent_scout,1))
#### compute the log(oddsraio), oddsratio and its 95% CI using {vcd} package
lor=oddsratio(deliquent_scout)
lor
OR=exp(lor)
OR
OR=oddsratio(deliquent_scout, log=FALSE)
OR
CI=exp(confint(lor))
CI
CI=confint(OR)
CI
#### Table of deliquent*scout at different level of SES
temp
#### Test for Joint Independence of (D,BS)
#### creating 6x2 table, BS x D
SESscout_deliquent=ftable(temp, row.vars=c(3,2))
result=chisq.test(SESscout_deliquent)
result
#### Test for Marginal Independence (see above analysis of twoway tables)
#### Test for conditional independence
#### To get partial tables of DB for each level of S
temp[,,1]
chisq.test(temp[,,1], correct=FALSE)
temp[,,2]
chisq.test(temp[,,2], correct=FALSE)
temp[,,3]
chisq.test(temp[,,3], correct=FALSE)
X2=sum(chisq.test(temp[,,1], correct=FALSE)$statistic+chisq.test(temp[,,2], correct=FALSE)$statistic+chisq.test(temp[,,3], correct=FALSE)$statistic)
1pchisq(X2,df=3)
#### CochranMantelHaenszel test
mantelhaen.test(temp)
mantelhaen.test(temp,correct=FALSE)
#### BreslowDay test
#### make sure to first source/run breslowday.test_.R
breslowday.test(temp)
################################################################
#### Loglinear Models
#### Let's see how we would do this via Log_linear models
#### We will look at the details later in the course
#### test of conditional independence
#### via loglinear model, but the object has to be a table!
is.table(temp) #### to check if table
temp<as.table(temp) #### to save as table
temp.condind<loglin(temp, list(c(1,3), c(2,3)), fit=TRUE, param=TRUE) #### fit the cond.indep. model
temp.condind
1pchisq(temp.condind$lrt, temp.condind$df)
#### There is no way to do BreslowDay stats in R;
#### you need to write your own function or fit the homogeneous association model
#### to test for identity of oddsratios
#### test of homogenous association model
temp.hom<loglin(temp, list(c(1,3), c(2,3), c(1,2)), fit=TRUE, param=TRUE)
temp.hom
1pchisq(temp.hom$lrt, temp.hom$df)
#### Here is how to do the same but with the glm() function in R
#### test of conditional independence
#### via loglinear mode, but using glm() funcion
#### the object now needs to be dataframe
temp.data<as.data.frame(temp)
temp.data
temp.condind<glm(temp.data$Freq~temp.data$scout*temp.data$SES+temp.data$deliquent*temp.data$SES, family=poisson())
summary(temp.condind)
fitted(temp.condind)
#### test of homogenous association model
temp.hom<glm(temp.data$Freq~temp.data$scout*temp.data$SES+temp.data$deliquent*temp.data$SES+temp.data$scout*temp.data$deliquent, family=poisson())
summary(temp.hom)
fitted(temp.hom)
The output file, boys.out contains the results for these procedures.