5.3.1 - Mutual (Complete) Independence

The simplest model that one might propose is that ALL variables are independent of one another. Graphically, if we have three random variables, \(X\), \(Y\), and \(Z\) we can express this model as:

In this graph, the lack of connections between the nodes indicates no relationships exist among \(X\), \(Y\), and \(Z{,}\) or there is mutual independence. In the notation of log-linear models, which will learn about later as well, this model is expressed as (\(X\), \(Y\), \(Z\)). We will use this notation from here on. The notation indicates which aspects of the data we consider in the model. In this case, the independence model means we care only about the marginals totals \(n_{i++}, n_{+j+}, n_{++k}\) for each variable separately, and these pieces of information will be sufficient to fit a model and compute the expected counts. Alternatively, as we will see later on, we could keep track of the number of times some joint outcome of two variables occurs.

In terms of odds ratios, the model (\(X\), \(Y\), \(Z\)) implies that if we look at the marginal tables \(X \times Y\), \(X \times Z\) , and \(Y\times Z\) , then all of the odds ratios in these marginal tables are equal to 1. In other words, mutual independence implies marginal independence (i.e., there is independence in the marginal tables). All variables are really independent of one another.

Under this model the following must hold:

\(P(X=i,Y=j,Z=k)=P(X=i)P(Y=j)P(Z=k)\)

for all \(i, j, k\). That is, the joint probabilities are the product of the marginal probabilities. This is a simple extension from the model of independence in two-way tables where it was assumed:

\(P(X=i,Y=j)=P(X=i)P(Y=j)\)

We denote the marginal probabilities

\(\pi_{i++} = P(X = i), i = 1, 2, \ldots , I \)

\(\pi_{+j+} = P(Y = j), j = 1, 2, \ldots, J\)

\(\pi_{++k} = P(Z = k), k = 1, 2, \ldots , K\)

so that \(\pi_{ijk} = [\pi_{i++}\pi_{+j+}\pi_{++k}\) for all \(i, j, k\). Then the unknown parameters of the model of independence are

\(\pi_{i++} = (\pi_{1++}, \pi_{2++},\ldots, \pi_{I++})\)

\(\pi_{+j+} = (\pi_{+1+}, \pi_{+2+},\ldots , \pi_{+J+})\)

\(\pi_{++k} = (\pi_{++1}, \pi_{++2},\ldots , \pi_{++K})\)

Under the assumption that the model of independence is true, once we know the marginal probability values, we have enough information to estimate all unknown cell probabilities. Because each of the marginal probability vectors must add up to one, the number of free parameters in the model is \((I − 1) + (J − 1) + (K −I )\). This is exactly like the two-way table, but now one more set of additional parameter(s) needs to be taken care of for the additional random variable.

Consider the Berkeley admissions example, where under the independence model, the number of free parameters is \((2-1) + (2-1) + (6-1) = 7\), and the marginal distributions are

\((n_{1++}, n_{2++}, \ldots , n_{I++}) \sim Mult(n, \pi_{i++})\)

\((n_{+1+}, n_{+2+}, \ldots , n_{+J+}) \sim Mult(n, \pi_{+j+})\)

\((n_{++1}, n_{++2}, \ldots , n_{++K}) \sim Mult(n, \pi_{++k})\)

and these three vectors are mutually independent. Thus, the three parameter vectors \(\pi_{i++}, \pi_{+j+}\), and \(\pi_{++k}\) can be estimated independently of one another. The ML estimates are the sample proportions in the margins of the table,

\(\hat{\pi}_{i++}=p_{i++}=n_{i++}/n,\quad i=1,2,\ldots,I\)

\(\hat{\pi}_{+j+}=p_{+j+}=n_{+j+}/n,\quad j=1,2,\ldots,J\)

\(\hat{\pi}_{++k}=p_{++k}=n_{++k}/n,\quad k=1,2,\ldots,K\)

It then follows that the estimates of the expected cell counts are

\(E(n_{ijk})=n\hat{\pi}_{i++}\hat{\pi}_{+j+}\hat{\pi}_{++k}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)

Compare this to a two-way table, where the expected counts were:

\(E(n_{ij})=n_{i+}n_{+j}/n\).

For the Berkeley admission example, the marginal tables, i.e., counts are \(X=(2691, 1835)\) (for sex), \(Y=(1755 2771)\) (for admission status), and \(Z=(933, 585, 918, 792, 584, 714)\) (for department). The first (estimated) expected count would be \(E(n_{111})=2691(1755)(933)/4526^2=215.1015\), for example. Then, we compare these expected counts with the corresponding observed counts. That is, 215.1015 would be compared with \(n_{111}=512\) and so on.

Chi-Squared Test of Independence Section

The hypothesis of independence can be tested using the general method described earlier:

\(H_0 \colon\) the independence model is true vs. \(H_A \colon\) the saturated model is true

In other words, we can check directly \(H_0\colon \pi_{ijk} = \pi_{i++}\pi_{+j+}\pi_{++k}\) for all \(i, j, k\), vs. \(H_A\colon\) the saturated model

Estimate the unknown parameters of the independence model, i.e.., the marginal probabilities.
Calculate estimated cell probabilities and expected cell frequencies \(E_{ijk}=E(n_{ijk})\) under the model of independence.
Calculate \(X^2\) and/or \(G^2\) by comparing the expected and observed values, and compare them to the appropriate chi-square distribution.

\(X^2=\sum\limits_i \sum\limits_j \sum\limits_k \dfrac{(E_{ijk}-n_{ijk})^2}{E_{ijk}}\)

\(G^2=2\sum\limits_i \sum\limits_j \sum\limits_k n_{ijk} \log\left(\dfrac{n_{ijk}}{E_{ijk}}\right)\)

The degrees of freedom (DF) for this test are \(\nu = (IJK − 1) − [ (I − 1) + (J − 1) + (K − 1)]\). As before, this is a difference between the number of free parameters for the saturated model \((IJK-1)\) and the number of free parameters in the current model of independence, \((I-1)+(J-1)+(K-1)\). In the case of the Berkeley data, we have \(23-7 = 16\) degrees of freedom for the mutual independence test.

Recall that we also said that mutual independence implies marginal independence. So, if we reject marginal independence for any pair of variables, we can immediately reject mutual independence overall. For example, recall the significant result when testing for the marginal association between sex and admission status (\(X^2=92.205\)). Since we rejected marginal independence in that case, it would expect to reject mutual (complete) independence as well.

Stop and Think!

What is your conclusion? Would you reject the model of complete independence? Are these three variables mutually independent?

Example: Boys Scouts and Juvenile Delinquency Section

This is a \(2 \times 2 \times 3\) table. It classifies \(n = 800\) boys according to socioeconomic status (S), whether they participated in boy scouts (B), and whether they have been labeled as a juvenile delinquent (D):

Socioeconomic status	Boy scout	Delinquent
Socioeconomic status	Boy scout	Yes	No
Low	Yes	11	43
Low	No	42	169
Medium	Yes	14	104
Medium	No	20	132
High	Yes	8	196
High	No	2	59

To fit the mutual independence model, we need to find the marginal totals for B,

\(n_{1++} = 11 + 43 + 14 + 104 + 8 + 196 = 376\)

\(n_{2++} = 42 + 169 + 20 + 132 + 2 + 59 = 424\)

for D,

\(n_{+1+} = 11 + 42 + 14 + 20 + 8 + 2 = 97\)

\(n_{+2+} = 43 + 169 + 104 + 132 + 196 + 59 = 703\)

and for S,

\(n_{++1} = 11 + 43 + 42 + 169 = 265\)

\(n_{++2} = 14 + 104 + 20 + 132 = 270\)

\(n_{++3} = 8 + 196 + 2 + 59 = 265\)

Next, we calculate the expected counts for each cell,

\(E_{ijk}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)

and then calculate the chi-square statistic. The degrees of freedom for this test are \((2\times 2\times 3-1)-[(2-1)+(2-1)+(3-1)]=7\) so the \(p\)-value can be found as \(P(\chi^2_7 \geq X^2)\) and \(P(\chi^2_7 \geq G^2)\).

In SAS, we can use the following lines of code:

data answers;
a1=1-probchi(218.6622,7);
;
proc print data=answers;
run;

In R, we can use 1-pchisq(218.6622, 7).

The p-values are essentially zero, indicating that the mutual independence model does not fit. Remember, in order for the chi-squared approximation to work well, the \(E_{ijk}\) needs to be sufficiently large. Sufficiently large means that most of them (e.g., about at least 80%) should be at least five, and none should be less than one. We should examine the \(E_{ijk}\) to see if they are large enough.

Fortunately, both R and SAS will give the expected counts, (in parentheses) and the observed counts:

Socioeconomic status	Boy scout	Delinquent
Socioeconomic status	Boy scout	Yes	No
Low	Yes	11 (15.102)	43 (109.448)
Low	No	42 (17.030)	169 (123.420)
Medium	Yes	14 (15.387 )	104 (111.513)
Medium	No	20 (17.351)	132 (125.749)
High	Yes	8 (15.102)	196 (109.448)
High	No	2 (17.030)	59 (123.420)

Here, the expected counts are sufficiently large for the chi-square approximation to work well, and thus we must conclude that the variables B (boys scout), D (delinquent), and S (socioeconomic status) are not mutually independent.

Note! Most software packages should give you a warning if more than 20% of the expected cells are less than 5, and this may have an influence on large sample approximations.

There is no single function or a call in SAS nor R that will directly test the mutual independence model; see will see how to fit this model via a log-linear model. However, we can test this by relying on our understanding of two-way tables, and of marginal and partial tables and related odds ratios. For mutual independence to hold, all of the tests for independence in marginal tables must hold. Thus, we can do the analysis of all two-way marginal tables using the chi-squared test of independence in each. Alternatively, we can consider the odds-ratios in each two-way table. In this case, for example, the estimated odds ratio for the \(B \times D\) table, is 0.542, and it is not equal to 1; i.e., 1 is not in covered by the 95% odds-ratio confidence interval, (0.347, 0.845). Therefore, we have sufficient evidence to reject the null hypothesis that boy scout status and delinquent status are independent of one another and thus that \(B\), \(D\), and \(S\) are not mutually independent.

The following SAS or R code supports the above analysis by testing independence of two-way marginal tables. Again, we will see later in the course that this is done more efficiently via log-linear models.

Here we will use the SAS code found in the program boys.sas as shown below.

data boys;
input SES $scout $ delinquent $ count @@;
datalines;
low yes yes 11 low yes no 43
low no yes 42 low no no 169
med yes yes 14 med yes no 104
med no yes 20 med no no 132
high yes yes 8 high yes no 196
high no yes 2 high no no 59
;
proc freq;
 weight count;
 tables SES;
 tables scout;
 tables delinquent;
 tables SES*scout/all nocol nopct;
 tables SES*delinquent/all nocol nopct;
 tables scout*delinquent/all nocol nopct;
 tables SES*scout*delinquent/chisq cmh expected nocol nopct;
run;

The output for this program can be found in this file: boys.lst.

For R, we will use the R program file boys.R to create contingency tables and carry out the test of mutual independence.

#### Test of conditional independence
#### Use of oddsratio(), confit() from {vcd}
#### Cochran-Mantel-Haenszel test
#### Also testing via Log-linear models
############################################

#### Input the table

deliquent=c("no","yes")
scout=c("no", "yes")
SES=c("low", "med","high")
table=expand.grid(deliquent=deliquent,scout=scout,SES=SES)
count=c(169,42,43,11,132,20,104,14,59,2,196,8)
table=cbind(table,count=count)
table
temp=xtabs(count~deliquent+scout+SES,table)
temp

#### Create "flat" contigency tables

ftable(temp)

#### Let's see how we can create various subtables
#### One-way Table SES

Frequency=as.vector(margin.table(temp,3))
CumFrequency=cumsum(Frequency)
cbind(SES,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))

#### One-way Table scout

Frequency=as.vector(margin.table(temp,2))
CumFrequency=cumsum(Frequency)
cbind(scout,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))

#### One-way Table deliquent

Frequency=as.vector(margin.table(temp,1))
CumFrequency=cumsum(Frequency)
cbind(deliquent,Frequency=Frequency,Percentage=Frequency/sum(Frequency),CumFrequency=CumFrequency,CumPercentage=CumFrequency/sum(Frequency))

#### Test the Mutual Independence, step by step 
#### compute the expected frequencies

E=array(NA,dim(temp))
for (i in 1:dim(temp)[1]) {
for (j in 1:dim(temp)[2]) {
for (k in 1:dim(temp)[3]) {
E[i,j,k]=(margin.table(temp,3)[k]*margin.table(temp,2)[j]*margin.table(temp,1))[i]/(sum(temp))^2
}}}
E

#### compute the X^2, and G^2

df=(length(temp)-1)-(sum(dim(temp))-3)
X2=sum((temp-E)^2/E)
X2
1-pchisq(X2,df)
G2=2*sum(temp*log(temp/E))
G2
1-pchisq(G2,df)


#### Test for Mutual indpendence (and other models) by considering analysis of all two-way tables
#### Two-way Table SES*scout
#### This is a test of Marginal Independence for SES and Scout

SES_Scout=margin.table(temp,c(3,2))
result=chisq.test(SES_Scout)
result
result$expected
result=chisq.test(SES_Scout,correct = FALSE)
result
SES_Scout=list(Frequency=SES_Scout,RowPercentage=prop.table(SES_Scout,2))


#### Two-way Table SES*deliquent
#### temp=xtabs(count~scout+SES+deliquent,table)
#### SES_deliquent=addmargins(temp)[1:2,1:3,3]
#### This is a test of Marginal Independence for SES and deliquent status

SES_deliquent=margin.table(temp,c(3,1))
result=chisq.test(SES_deliquent)
result
result$expected
result=chisq.test(SES_deliquent,correct = FALSE)
result
SES_deliquent=list(Frequency=SES_deliquent,RowPercentage=prop.table(SES_deliquent,2))

#### Two-way Table deliquent*scout
#### This is a test of Marginal Independence for Deliquent status and the Scout status

deliquent_scout=margin.table(temp,c(1,2))
result=chisq.test(deliquent_scout)
result$expected
result=chisq.test(deliquent_scout,correct = FALSE)
result
deliquent_scout1=list(Frequency=deliquent_scout,RowPercentage=prop.table(deliquent_scout,1))

#### compute the log(oddsraio), oddsratio and its 95% CI using {vcd} package

lor=oddsratio(deliquent_scout)
lor
OR=exp(lor)
OR
OR=oddsratio(deliquent_scout, log=FALSE)
OR
CI=exp(confint(lor))
CI
CI=confint(OR)
CI


#### Table of deliquent*scout at different level of SES

temp

#### Test for Joint Independence of (D,BS)
#### creating 6x2 table, BS x D

SESscout_deliquent=ftable(temp, row.vars=c(3,2))
result=chisq.test(SESscout_deliquent)
result

#### Test for Marginal Independence (see above analysis of two-way tables)

#### Test for conditional independence
#### To get partial tables of DB for each level of S

temp[,,1]
chisq.test(temp[,,1], correct=FALSE)
temp[,,2]
chisq.test(temp[,,2], correct=FALSE)
temp[,,3]
chisq.test(temp[,,3], correct=FALSE)
X2=sum(chisq.test(temp[,,1], correct=FALSE)$statistic+chisq.test(temp[,,2], correct=FALSE)$statistic+chisq.test(temp[,,3], correct=FALSE)$statistic)
1-pchisq(X2,df=3)

#### Cochran-Mantel-Haenszel test

mantelhaen.test(temp)
mantelhaen.test(temp,correct=FALSE) 

#### Breslow-Day test
#### make sure to first source/run breslowday.test_.R

breslowday.test(temp)

################################################################
#### Log-linear Models
#### Let's see how we would do this via Log_linear models 
#### We will look at the details later in the course
#### test of conditional independence
#### via loglinear model, but the object has to be a table!

is.table(temp)  #### to check if table
temp<-as.table(temp) #### to save as table
temp.condind<-loglin(temp, list(c(1,3), c(2,3)), fit=TRUE, param=TRUE) #### fit the cond.indep. model
temp.condind
1-pchisq(temp.condind$lrt, temp.condind$df)

#### There is no way to do Breslow-Day stats in R; 
#### you need to write your own function or fit the homogeneous association model
#### to test for identity of odds-ratios

#### test of homogenous association model

temp.hom<-loglin(temp, list(c(1,3), c(2,3), c(1,2)), fit=TRUE, param=TRUE)
temp.hom
1-pchisq(temp.hom$lrt, temp.hom$df)


#### Here is how to do the same but with the glm() function in R
#### test of conditional independence
#### via loglinear mode, but using glm() funcion
#### the object now needs to be dataframe

temp.data<-as.data.frame(temp)
temp.data
temp.condind<-glm(temp.data$Freq~temp.data$scout*temp.data$SES+temp.data$deliquent*temp.data$SES, family=poisson())
summary(temp.condind)
fitted(temp.condind)

#### test of homogenous association model

temp.hom<-glm(temp.data$Freq~temp.data$scout*temp.data$SES+temp.data$deliquent*temp.data$SES+temp.data$scout*temp.data$deliquent, family=poisson())
summary(temp.hom)
fitted(temp.hom)

The output file, boys.out contains the results for these procedures.

Stop and Think!

If two or more variables in a \(k\)-way table are not independent, then where is this difference coming from? That is, what are some other possible relationships that hold? What are some other models that can capture this data? Maybe \(S\) and \(B\) are jointly independent of \(D\)?