3.3  Test for Independence
3.3  Test for IndependenceTwoway contingency tables show the behavior (distribution) of two discrete variables. Given an \(I\times J\) table, it is therefore natural to ask if and how are \(Y\) and \(Z\) related?
Suppose for the moment that there is no relationship between \(Y\) and \(Z\), i.e., they are independent.
 Statistical Independence

By statistical independence, we mean that the joint probabilities are equal to the product of the marginal probabilities:
\(\pi_{ij}=P(Y=i,Z=j)=P(Y=i)P(Z=j)\)
\(\pi_{ij}=\pi_{i+}\pi_{+j}\)for all pairs of \(i, j\). If the data are independent and we know the marginal totals, then we can determine the exact probabilities for each of the cells. Independence may be expressed in other forms also, but for now, we will use this definition. Intuitively, independence means that the behavior of one variable, say \(Y\), will not be impacted by the behavior of \(Z\). In the Vitamin C example, independence means that whether or not a skier takes vitamin C has nothing to do with whether or not he/she has a cold.
ChiSquared Test of Independence
The hypothesis of independence can be tested using the general method of goodnessoffit test described earlier.
\(H_0\): the independence model is true i.e. \(\pi_{ij} = \pi_{i+}\pi_{+j}\)_{ }for all pairs of \((i, j)\)
versus the alternative
\(H_A\): the saturated model is true, i.e. \(\pi_{ij} \ne \pi_{i+}\pi_{+j}\)_{ }for at least one pair of \((i, j)\)
 Step 1
calculate expected counts under the independence model
 Step 2
compare the expected counts \(E_{ij}\) to the observed counts \(O_{ij}\)
 Step 3
calculate \(X^2\) and/or \(G^2\) for testing the hypothesis of independence, and compare the values to the appropriate chisquared distribution with correct df \((I1)(J1)\)
Before we see how to do this in R and SAS, let's see more about the saturated model and independence model in \(I\times J\) tables.
Saturated Model
Suppose that the cell counts \(n_{11}, \dots , n_{IJ}\) have a multinomial distribution with index \(n_{++} = n\) and parameters
\(\pi=(\pi_{11},\ldots,\pi_{IJ})\)
(these results also work for Poisson or productmultinomial sampling). If the saturated model is true, the number of unknown parameters we need to estimate is maximal (just like in oneway tables) and it is equal to the number of cells in the table. But because the elements of \(\pi\) must sum to one, the saturated model actually has \(IJ − 1\) free parameters. For example if we had a \(3\times5\) table, we would have \(15  1 = 14\) unique parameters \(\pi_{ij}\)s that need to be estimated.
Independence Model
If the two variables Y and Z are independent, then \(\pi\) has a special form. Let
\(\pi_{i+} = P(Y = i), i = 1, 2, \ldots , I \)
\(\pi_{+j} = P(Z = j), j = 1, 2, \ldots, J\)
Note that \(\sum\limits_{i=1}^I \pi_{i+}=\sum\limits_{j=1}^J \pi_{+j}=1\), so the vectors \(\pi_{i+} = (\pi_{1+}, \pi_{2+}, . . . , \pi_{I+})\) and \(\pi_{+j} = (\pi_{+1}, \pi_{+2}, \ldots , \pi_{+J} )\) representing the marginal distributions (row probabilities and column probabilities) containing \( I − 1\) and \(J − 1\) unknown parameters, respectively.
If \(Y\) and \(Z\) are independent, then any element of \(\pi\) can be expressed as
\(\pi_{ij}=P(Y=i)P(Z=j)=\pi_{i+}\pi_{+j}\)
Thus, under independence, \(\pi\) is a function of \((I − 1) + (J − 1)\) unknown parameters.
The parameters under the independence model can be estimated as follows. Note that the vector of row sums (observed marginal counts for the row variable \(Y\)), \((n_{1+},n_{2+},\ldots,n_{I+})\) has a multinomial distribution with index \(n\) and parameter \(\pi_{i+}\). The vector of column sums (observed marginal counts for the column variable \(Z\)), \((n_{+1},n_{+2},\ldots,n_{+J})\) has a multinomial distribution with index \(n = n_{++}\) and parameter \(\pi_{+j}\). The elements of \(\pi_{i+}\) and \(\pi_{+j}\) can thus be estimated by the sample proportions in each margin:
\(\hat{\pi}_{i+}=n_{i+}/n_{++},\qquad i=1,2,\ldots,I\)
and
\(\hat{\pi}_{+j}=n_{+j}/n_{++},\qquad j=1,2,\ldots,J\)
respectively.
Then, the estimated expected cell frequencies under the independence model are
\(E_{ij}=n\hat{\pi}_{ij}=n\hat{\pi}_{i+}\hat{\pi}_{+j}=\dfrac{n_{i+}n_{+j}}{n_{++}}\)
which can be remembered as
expected frequency = row total × column total / grand total .
Pearson and Deviance Statistics
Since for jointly observed \((Y,Z)\) the twoway table counts can be viewed as a single multinomial distribution with \(IJ\) categories, we can apply the chisquare approximation in the same way we applied it for the goodnessoffit tests; we just need to adapt to the double index and sum over all cells in both dimensions. That is, the quantity
\(\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}n\pi_{ij})^2}{n\pi_{ij}}\)
is approximately chisquared with degrees of freedom equal to \(\nu=IJ1\). And if we have a null hypothesis in mind, say for independence, we can use the estimated probabilities under that hypothesis to construct both the Pearson and likelihood ratio test statistics. The degrees of freedom would be reduced by the number of estimated parameters as well. In what follows below, we assume the null hypothesis (reduced model) to be tested is that of independence, but other hypotheses could also be considered.
 Pearson goodnessoffit statistic
 The Pearson goodnessoffit statistic is
\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}n\hat{\pi}_{ij})^2}{n\hat{\pi}_{ij}}\)
where \(\hat{\pi}_{ij}=(n_{i+}/n)(n_{+j}/n)\) under the independence model. Note that this expression still corresponds to
\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(O_{ij}E_{ij})^2}{E_{ij}}\)
where \(O_{ij} = n_{ij}\) is the observed count and \(E_{ij} = E(n_{ij})\) is the expected count under the null hypothesis.
The deviance statistic can similarly be calculated for twoway tables:
\(G^2=2\sum\limits_{i=1}^I \sum\limits_{j=1}^J n_{ij}\log\left(\dfrac{n_{ij}}{n\hat{\pi}_{ij}}\right) = 2\sum\limits_{i=1}^I \sum\limits_{j=1}^J O_{ij}\log\left(\dfrac{O_{ij}}{E_{ij}}\right)\)
\(G^2\) is also called the likelihoodratio test statistic or likelihoodratio chisquared test statistic. Recall from the discussion on oneway tables that we are comparing the likelihoods of the assumed model under \(H_0\) and some alternative model, \(H_A\), typically the saturated model (i.e., the observed data) by default. More generally, the likelihoodratio test statistic can be described as follows:
Let max \(L(H_0)\) be the maximum of the likelihood when parameters satisfy \(H_0; H_0\) usually has more restrictions on the parameters.
Let max \(L(H_A)\) be the maximum of the likelihood when parameters satisfy \(H_A; H_A\) usually has no or fewer restrictions on the parameters.
Then the likelihoodratio statistic would be:
\(\Lambda=\dfrac{\max L(H_0)}{\max L(H_A)}\)
and the deviance \(G^2 = −2\log(\Lambda)\).
The smaller the likelihood under \(H_0\) (less chance of the restricted model to hold given the data), the more evidence you would have against \(H_0\), that is, the smaller \(\Lambda\) and greater \(G^2\).
What are the degrees of freedom for this test?
The general rule for DF
DF are equal to the number of parameters specified (estimated) under the alternative model (hypothesis) minus the number of parameters estimated under the null model (hypothesis).
Computing Degrees of Freedom
Recall, under the saturated model, \(\pi\) contains \(IJ1\) free (unique) parameters. And under the independence model, \(\pi\) is a function of \((I1)+(J1)\) parameters since each joint probability \(\pi_{ij}\) can be written as the product of the marginals \(\pi_{i+}\pi_{+j}\), each of which has the sumtoone constraint. The degrees of freedom are therefore
\(\nu=(IJ1)(I1)(J1)=(I1)(J1)\)
For \(2\times2\) tables, this reduces to \((2  1)(2  1) = 1\).
A large value of \(X^2\) or \(G^2\) indicates that the independence model is not plausible and that \(Y\) and \(Z\) are related. Under the null hypothesis, \(X^2\) and \(G^2\) are approximately distributed as a chisquared distribution with \(\nu= (I − 1)(J − 1)\) degrees of freedom, provided that (a) the \(n\) sample units are iid (i.e., there is no clustering), and (b) the expected counts \(E_{ij}\) are sufficiently large (recall the discussion in oneway tables) At a minimum we'd like to have all \(E_{ij}\ge1\) with at least 80% of them 5 or more.
For \(2\times2\) tables, the 95th percentile of \(\chi^2_1\) is 3.84, so an observed value of \(X^2\) or \(G^2\) greater than 3.84 means that we can reject the null hypothesis of independence at the .05 level.
Example: Vitamin C
How can we do the test of independence computationally? Let's illustrate this first using the Vitamin C example, which is the \(2\times2\) case. We're interested in whether the treatment type and contracting cold are associated.
Here is an example in SAS using the program code VitaminC.sas; see the R tab for the R code.
data ski;
input treatment $ response $ count;
datalines;
placebo cold 31
placebo nocold 109
ascorbic cold 17
ascorbic nocold 122
;
run;
proc freq;
weight count;
tables treatment*response/ chisq relrisk riskdiff expected;
exact or;
run;
Let's look at different parts of the output. SAS output for this program.
Heading I:
Table of treatment by response section produces the table with observed, expected values, sample proportions, and conditional probabilities.
The FREQ Procedure


Heading II:
Statistics for Table of treatment by response section produces various test statistics, such as \(X^2\) and \(G^2\).
The FREQ Procedure
Statistics for Table of treatment by response
Statistic  DF  Value  Prob 

ChiSquare  1  4.8114  0.0283 
Likelihood Ratio ChiSquare  1  4.8717  0.0273 
Continuity Adj. ChiSquare  1  4.1407  0.0419 
MantelHaenszel ChiSquare  1  4.7942  0.0286 
Phi Coefficient  0.1313  
Contingency Coefficient  0.1302  
Cramer's V  0.1313 
\(X^2 = 4.8114\) and \(G^2 = 4.8717\), with df=1, indicate strong evidence for rejecting the independence model. Continuity Adj. ChiSquare = 4.1407 with \(p\)value = 0.0419, is the Pearson's \(X^2\) adjusted slightly for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}n_{ij}}0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)value than the usual Pearson \(X^2\) statistic without the correction. But since SAS can produce exact tests we won't need to consider this statistic.
Here is the test of independence for Vitamin C example, also found in the section with R files VitaminC.R and its output, VitaminC.out.
### Pearson's Chisquared test with Yates' continuity correction
result<chisq.test(ski)
result
###Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals
###Pearson's Chisquared test withOUT Yates' continuity correction
result<chisq.test(ski, correct=FALSE)
result
result$observed
result$expected
result$residuals
Notice that by default chisq.test() function in R will give us the \(X^2\) statistic with a slight adjustment for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}n_{ij}}0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)value than the usual Pearson \(X^2\) statistic without the correction. To get the usual \(X^2\), we need to invoke an option correct = FALSE.
To compute the deviance statistic for twoway tables, we can use the function LRstats.R or one of the R packages, such as VCD.
Example: Coronary Heart Disease
Next, we return to the Coronary Heart Disease example from the introduction and ask "Is having coronary heart disease independent of the cholesterol level in ones body? Is there any evidence of a relationship/association between cholesterol and heart disease?"
Test of Independence in SAS  Coronary Heart Disease Example
Let's see the same calculation using the SAS code below: HeartDisease.sas
data chd;
input CHD $ serum $ count @@;
datalines;
chd 0199 12 chd 200199 8 chd 220259 31 chd 260+ 41
nochd 0199 307 nochd 200199 246 nochd 220259 439 nochd 260+ 245
;
proc freq; weight count;
tables CHD*serum /chisq expected deviation cmh cellchi2 measures;
/*exact fisher or / alpha=.05;*/
run;
See the complete SAS output: HeartDisease SAS Output.
Here is a portion of the output from SAS with the Pearson chisquare statistic and Deviance (likelihoodratio chisquare) statistic:
Statistics for Table of CHD by serum
Statistic  DF  Value  Prob 

ChiSquare  3  35.0285  <.0001 
Likelihood Ratio ChiSquare  3  31.9212  <.0001 
MantelHaenszel ChiSquare  1  26.1475  <.0001 
Phi Coefficient  0.1623  
Contingency Coefficient  0.1603  
Cramer's V  0.1623 
Test of Independence in R  Coronary Heart Disease Example
Two different computations are done in HeartDisease.R file using the function chisq.test(). Here is the first:
heart <c(12,8,31,41,307,246,439,245)
heart<matrix(heart,4,2)
heart=t(heart)
## run the chisquared test of independence & save it into a new object
result<chisq.test(heart)
result
## Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals
### Example: Heart Disease Example Lesson 3 ##
### Simple line by line R code
### Nice R code that corresponds to SAS code and output
#######################################################
## enter data
heart <c(12,8,31,41,307,246,439,245)
heart<matrix(heart,4,2)
heart=t(heart)
## run the chisquared test of independence & save it into a new object
result<chisq.test(heart)
result
## Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals
### Likelihood Ratio Test
LR=2*sum(heart*log(heart/result$expected))
LR
LRchisq=1pchisq(LR,df=(41)*(21))
LRchisq
##make sure you have function LRstats()
LRstats(heart)
## Let's calculate the conditional probabilities
## the following function gives the desired marginal, in this case, the counts for the serum groups
serum<margin.table(heart,2)
serum
## let's look at the counts for the four groups with CHD
heart[1,]
## then counts for the four groups with NOCHD, which is the second column of data in the dataframe we created above
heart[2,]
### conditional probabilities are:
heart[2,]/serum
heart[1,]/serum
########################################
### Nice R code that corresponds to SAS code and output
#######################################################
heart=matrix(c(12,307,8,246,31,439,41,245), ncol=4, dimnames=list(CHD=c("chd", "nochd"), serum=c("0199", "200199","220259","260+")))
heart
count=heart
### ChiSquare Independence Test
result=chisq.test(count)
result$expected
### Let us look at the Percentage, Row Percentage and Column Percentage
### of the total observations contained in each cell.
Contingency_Table=list(Frequency=count,Expected=result$expected,Deviation=countresult$expected,Percentage=prop.table(count),RowPercentage=prop.table(count,1),ColPercentage=prop.table(count,2))
Contingency_Table
###### Computing various measures of association
library(vcd)
assocstats(heart)
### For the Pearson correlation coefficent
### and MantelHaenszel,
### for IxJ tables, you can also use
### pears.cor() function.
### Mak sure you run this function first!
### c(1,2) and c(1,2,3,4), are the vectors of score values
pears.cor(heart, c(1,2),c(1,2,3,4))
### and this should give you, r=0.14, M2=26.1475
##Gamma
Gamma.f(heart)
Output
You will notice in the file that unlike for \(2\times 2\) tables where we had to worry about R the continuity correction, there is no such thing for \( I \times J\) tables. It doesn't matter, in this example, if you call the function chisq.test(heart, correct=TRUE) or chisq.test(heart, correct=FALSE) because the results are the same.
\(X^2= 35.0285\), df = 3, \(p\)value = 1.202e07
and the likelihoodratio test statistic is
\(G^2= 31.9212\), df = 3, \(p\)value = 5.43736e07
Notice that the chisq.test() function does not compute \(G^2\), but we included extra code to do that. For the complete output file, see the HeartDisease.out file. Here is also a more general function LRstats.R for computing the \(G^2\) for twoway tables. The results are discussed further below.
Conclusion
We reject the null hypothesis of independence because of the big values of the chisquare statistics. Notice the degrees of freedom are equal to \(3 = (41)(21)\), and thus the \(p\)value is very low. Therefore, through the \(X^2\) test for independence, we have demonstrated beyond a reasonable doubt that a relationship exists between cholesterol and CHD.
A good statistical analysis, however, should not end with the rejection of a null hypothesis. Once we have demonstrated that a relationship exists, we may be interested in which cells were particularly revealing. To do this we consider computing and evaluating the residuals.