3.3 - Test for Independence

3.3 - Test for Independence

Two-way contingency tables show the behavior (distribution) of two discrete variables. Given an \(I\times J\) table, it is therefore natural to ask if and how are \(Y\) and \(Z\) related?

Suppose for the moment that there is no relationship between \(Y\) and \(Z\), i.e., they are independent.

Statistical Independence

By statistical independence, we mean that the joint probabilities are equal to the product of the marginal probabilities:

\(\pi_{ij}=P(Y=i,Z=j)=P(Y=i)P(Z=j)\)
\(\pi_{ij}=\pi_{i+}\pi_{+j}\)

for all pairs of \(i, j\). If the data are independent and we know the marginal totals, then we can determine the exact probabilities for each of the cells. Independence may be expressed in other forms also, but for now, we will use this definition. Intuitively, independence means that the behavior of one variable, say \(Y\), will not be impacted by the behavior of \(Z\). In the Vitamin C example, independence means that whether or not a skier takes vitamin C has nothing to do with whether or not he/she has a cold.

Chi-Squared Test of Independence

The hypothesis of independence can be tested using the general method of goodness-of-fit test described earlier.

\(H_0\): the independence model is true i.e. \(\pi_{ij} = \pi_{i+}\pi_{+j}\) for all pairs of \((i, j)\)

versus the alternative

\(H_A\): the saturated model is true, i.e. \(\pi_{ij} \ne \pi_{i+}\pi_{+j}\) for at least one pair of \((i, j)\)

  1. Step 1

    calculate expected counts under the independence model

  2. Step 2

    compare the expected counts \(E_{ij}\) to the observed counts \(O_{ij}\)

  3. Step 3

    calculate \(X^2\) and/or \(G^2\) for testing the hypothesis of independence, and compare the values to the appropriate chi-squared distribution with correct df \((I-1)(J-1)\)

Before we see how to do this in R and SAS, let's see more about the saturated model and independence model in \(I\times J\) tables.

Saturated Model

Suppose that the cell counts \(n_{11}, \dots , n_{IJ}\) have a multinomial distribution with index \(n_{++} = n\) and parameters

\(\pi=(\pi_{11},\ldots,\pi_{IJ})\)

(these results also work for Poisson or product-multinomial sampling). If the saturated model is true, the number of unknown parameters we need to estimate is maximal (just like in one-way tables) and it is equal to the number of cells in the table. But because the elements of \(\pi\) must sum to one, the saturated model actually has \(IJ − 1\) free parameters. For example if we had a \(3\times5\) table, we would have \(15 - 1 = 14\) unique parameters \(\pi_{ij}\)s that need to be estimated.

Independence Model

If the two variables Y and Z are independent, then \(\pi\) has a special form. Let

\(\pi_{i+} = P(Y = i), i = 1, 2, \ldots , I \)
\(\pi_{+j} = P(Z = j), j = 1, 2, \ldots, J\)

Note that \(\sum\limits_{i=1}^I \pi_{i+}=\sum\limits_{j=1}^J \pi_{+j}=1\), so the vectors \(\pi_{i+} = (\pi_{1+}, \pi_{2+}, . . . , \pi_{I+})\) and \(\pi_{+j} = (\pi_{+1}, \pi_{+2}, \ldots , \pi_{+J} )\) representing the marginal distributions (row probabilities and column probabilities) containing \( I − 1\) and \(J − 1\) unknown parameters, respectively.

If \(Y\) and \(Z\) are independent, then any element of \(\pi\) can be expressed as

\(\pi_{ij}=P(Y=i)P(Z=j)=\pi_{i+}\pi_{+j}\)

Thus, under independence, \(\pi\) is a function of \((I − 1) + (J − 1)\) unknown parameters.

The parameters under the independence model can be estimated as follows. Note that the vector of row sums (observed marginal counts for the row variable \(Y\)), \((n_{1+},n_{2+},\ldots,n_{I+})\) has a multinomial distribution with index \(n\) and parameter \(\pi_{i+}\). The vector of column sums (observed marginal counts for the column variable \(Z\)), \((n_{+1},n_{+2},\ldots,n_{+J})\) has a multinomial distribution with index \(n = n_{++}\) and parameter \(\pi_{+j}\). The elements of \(\pi_{i+}\) and \(\pi_{+j}\) can thus be estimated by the sample proportions in each margin:

\(\hat{\pi}_{i+}=n_{i+}/n_{++},\qquad i=1,2,\ldots,I\)

and

\(\hat{\pi}_{+j}=n_{+j}/n_{++},\qquad j=1,2,\ldots,J\)

respectively.

Then, the estimated expected cell frequencies under the independence model are

\(E_{ij}=n\hat{\pi}_{ij}=n\hat{\pi}_{i+}\hat{\pi}_{+j}=\dfrac{n_{i+}n_{+j}}{n_{++}}\)

which can be remembered as

expected frequency = row total × column total / grand total .

The independence model is a sub-model of (i.e., a special case of) the saturated model. To test if the independence model fits, we compare the expected to the observed counts, and more formally, we do this by computing \(X^2\) or \(G^2\) statistics.

Pearson and Deviance Statistics

Since for jointly observed \((Y,Z)\) the two-way table counts can be viewed as a single multinomial distribution with \(IJ\) categories, we can apply the chi-square approximation in the same way we applied it for the goodness-of-fit tests; we just need to adapt to the double index and sum over all cells in both dimensions. That is, the quantity

\(\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}-n\pi_{ij})^2}{n\pi_{ij}}\)

is approximately chi-squared with degrees of freedom equal to \(\nu=IJ-1\). And if we have a null hypothesis in mind, say for independence, we can use the estimated probabilities under that hypothesis to construct both the Pearson and likelihood ratio test statistics. The degrees of freedom would be reduced by the number of estimated parameters as well. In what follows below, we assume the null hypothesis (reduced model) to be tested is that of independence, but other hypotheses could also be considered.

Pearson goodness-of-fit statistic
The Pearson goodness-of-fit statistic is

\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(n_{ij}-n\hat{\pi}_{ij})^2}{n\hat{\pi}_{ij}}\)

where \(\hat{\pi}_{ij}=(n_{i+}/n)(n_{+j}/n)\) under the independence model. Note that this expression still corresponds to

\(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J \dfrac{(O_{ij}-E_{ij})^2}{E_{ij}}\)

where \(O_{ij} = n_{ij}\) is the observed count and \(E_{ij} = E(n_{ij})\) is the expected count under the null hypothesis.

The deviance statistic can similarly be calculated for two-way tables:

\(G^2=2\sum\limits_{i=1}^I \sum\limits_{j=1}^J n_{ij}\log\left(\dfrac{n_{ij}}{n\hat{\pi}_{ij}}\right) = 2\sum\limits_{i=1}^I \sum\limits_{j=1}^J O_{ij}\log\left(\dfrac{O_{ij}}{E_{ij}}\right)\)

\(G^2\) is also called the likelihood-ratio test statistic or likelihood-ratio chi-squared test statistic. Recall from the discussion on one-way tables that we are comparing the likelihoods of the assumed model under \(H_0\) and some alternative model, \(H_A\), typically the saturated model (i.e., the observed data) by default. More generally, the likelihood-ratio test statistic can be described as follows:

Let max \(L(H_0)\) be the maximum of the likelihood when parameters satisfy \(H_0; H_0\) usually has more restrictions on the parameters.

Let max \(L(H_A)\) be the maximum of the likelihood when parameters satisfy \(H_A; H_A\) usually has no or fewer restrictions on the parameters.

Then the likelihood-ratio statistic would be:

\(\Lambda=\dfrac{\max L(H_0)}{\max L(H_A)}\)

and the deviance \(G^2 = −2\log(\Lambda)\).

The smaller the likelihood under \(H_0\) (less chance of the restricted model to hold given the data), the more evidence you would have against \(H_0\), that is, the smaller \(\Lambda\) and greater \(G^2\).

What are the degrees of freedom for this test?

The general rule for DF

DF are equal to the number of parameters specified (estimated) under the alternative model (hypothesis) minus the number of parameters estimated under the null model (hypothesis).

Computing Degrees of Freedom

Recall, under the saturated model, \(\pi\) contains \(IJ-1\) free (unique) parameters. And under the independence model, \(\pi\) is a function of \((I-1)+(J-1)\) parameters since each joint probability \(\pi_{ij}\) can be written as the product of the marginals \(\pi_{i+}\pi_{+j}\), each of which has the sum-to-one constraint. The degrees of freedom are therefore

\(\nu=(IJ-1)-(I-1)-(J-1)=(I-1)(J-1)\)

For \(2\times2\) tables, this reduces to \((2 - 1)(2 - 1) = 1\).

A large value of \(X^2\) or \(G^2\) indicates that the independence model is not plausible and that \(Y\) and \(Z\) are related. Under the null hypothesis, \(X^2\) and \(G^2\) are approximately distributed as a chi-squared distribution with \(\nu= (I − 1)(J − 1)\) degrees of freedom, provided that (a) the \(n\) sample units are iid (i.e., there is no clustering), and (b) the expected counts \(E_{ij}\) are sufficiently large (recall the discussion in one-way tables) At a minimum we'd like to have all \(E_{ij}\ge1\) with at least 80% of them 5 or more.

For \(2\times2\) tables, the 95th percentile of \(\chi^2_1\) is 3.84, so an observed value of \(X^2\) or \(G^2\) greater than 3.84 means that we can reject the null hypothesis of independence at the .05 level.

Example: Vitamin C

oranges

How can we do the test of independence computationally? Let's illustrate this first using the Vitamin C example, which is the \(2\times2\) case. We're interested in whether the treatment type and contracting cold are associated.

Here is an example in SAS using the program code VitaminC.sas; see the R tab for the R code.

data ski;
input treatment $ response $ count;
datalines;
placebo cold 31
placebo nocold 109
ascorbic cold 17
ascorbic nocold 122
;
run;
proc freq;
weight count;
tables treatment*response/ chisq relrisk riskdiff expected;
exact or;
run;

Let's look at different parts of the output. SAS output for this program.

Heading I:

Table of treatment by response section produces the table with observed, expected values, sample proportions, and conditional probabilities.

The FREQ Procedure

Frequency
Expected
Percent
Row Pct
Col Pct
 
Table of treatment by response
treatment response
cold nocold Total
ascorbic
17
23.914
6.09
12.23
35.42
122
115.09
43.73
87.77
52.81
139
 
49.82
 
 
placebo
31
24.086
11.11
22.14
64.58
109
115.91
39.07
77.86
47.19
140
 
50.18
 
 
Total
48
17.20
231
82.80
279
100.00
Heading II:

Statistics for Table of treatment by response section produces various test statistics, such as \(X^2\) and \(G^2\).

The FREQ Procedure

Statistics for Table of treatment by response

 
Statistic DF Value Prob
Chi-Square 1 4.8114 0.0283
Likelihood Ratio Chi-Square 1 4.8717 0.0273
Continuity Adj. Chi-Square 1 4.1407 0.0419
Mantel-Haenszel Chi-Square 1 4.7942 0.0286
Phi Coefficient   -0.1313  
Contingency Coefficient   0.1302  
Cramer's V   -0.1313  

\(X^2 = 4.8114\) and \(G^2 = 4.8717\), with df=1, indicate strong evidence for rejecting the independence model. Continuity Adj. Chi-Square = 4.1407 with \(p\)-value = 0.0419, is the Pearson's \(X^2\) adjusted slightly for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}-n_{ij}}-0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)-value than the usual Pearson \(X^2\) statistic without the correction. But since SAS can produce exact tests we won't need to consider this statistic.

Here is the test of independence for Vitamin C example, also found in the section with R files VitaminC.R and its output, VitaminC.out.

### Pearson's Chi-squared test  with Yates' continuity correction
result<-chisq.test(ski)
result

###Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals

###Pearson's Chi-squared test  withOUT Yates' continuity correction
result<-chisq.test(ski, correct=FALSE)
result
result$observed
result$expected
result$residuals

Notice that by default chisq.test() function in R will give us the \(X^2\) statistic with a slight adjustment for small cell counts. The adjustment is to subtract 0.5 from a difference between the observed and the expected counts in the formula for the \(X^2\) statistic, i.e., \({O_{ij}-n_{ij}}-0.5\). The \(X^2\) statistic with this adjustment gives conservative inference; that is, it gives a bigger \(p\)-value than the usual Pearson \(X^2\) statistic without the correction. To get the usual \(X^2\), we need to invoke an option correct = FALSE.

To compute the deviance statistic for two-way tables, we can use the function LRstats.R or one of the R packages, such as VCD.

Example: Coronary Heart Disease

Next, we return to the Coronary Heart Disease example from the introduction and ask "Is having coronary heart disease independent of the cholesterol level in ones body? Is there any evidence of a relationship/association between cholesterol and heart disease?"

Test of Independence in SAS - Coronary Heart Disease Example

Let's see the same calculation using the SAS code below: HeartDisease.sas

data chd;
input CHD $ serum $ count @@;
datalines;
chd 0-199 12 chd 200-199 8 chd 220-259 31 chd 260+ 41
nochd 0-199 307 nochd 200-199 246 nochd 220-259 439 nochd 260+ 245
;
proc freq;  weight count;
   tables CHD*serum /chisq expected deviation cmh cellchi2 measures;
   /*exact fisher or / alpha=.05;*/
run;

See the complete SAS output: HeartDisease SAS Output.

Here is a portion of the output from SAS with the Pearson chi-square statistic and Deviance (likelihood-ratio chi-square) statistic:

Statistics for Table of CHD by serum

 
Statistic DF Value Prob
Chi-Square 3 35.0285 <.0001
Likelihood Ratio Chi-Square 3 31.9212 <.0001
Mantel-Haenszel Chi-Square 1 26.1475 <.0001
Phi Coefficient   0.1623  
Contingency Coefficient   0.1603  
Cramer's V   0.1623  

Test of Independence in R - Coronary Heart Disease Example

Two different computations are done in HeartDisease.R file using the function chisq.test(). Here is the first:

heart <-c(12,8,31,41,307,246,439,245) 
heart<-matrix(heart,4,2) 
heart=t(heart)

## run the chi-squared test of independence & save it into a new object
result<-chisq.test(heart) 
result

## Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals
### Example: Heart Disease Example Lesson 3 ##
### Simple line by line R code
### Nice R code that corresponds to SAS code and output
#######################################################

## enter data
heart <-c(12,8,31,41,307,246,439,245) 
heart<-matrix(heart,4,2) 
heart=t(heart)

## run the chi-squared test of independence & save it into a new object
result<-chisq.test(heart) 
result

## Let's look at the obseved, expected values and the residuals
result$observed
result$expected
result$residuals

### Likelihood Ratio Test
LR=2*sum(heart*log(heart/result$expected))
LR
LRchisq=1-pchisq(LR,df=(4-1)*(2-1))
LRchisq


 ##make sure you have function LRstats()
 LRstats(heart)

 
 ## Let's calculate the conditional probabilities
 ## the following function gives the desired marginal, in this case, the counts for the serum groups
serum<-margin.table(heart,2)
serum 

## let's look at the counts for the four groups with CHD
heart[1,]

## then counts for the four groups with NOCHD, which is the second column of data in the dataframe we created above
heart[2,]

### conditional probabilities are:
heart[2,]/serum
heart[1,]/serum

########################################
### Nice R code that corresponds to SAS code and output
#######################################################
heart=matrix(c(12,307,8,246,31,439,41,245), ncol=4, dimnames=list(CHD=c("chd", "nochd"), serum=c("0-199", "200-199","220-259","260+")))
heart
count=heart

### Chi-Square Independence Test
result=chisq.test(count)
result$expected

### Let us look at the Percentage, Row Percentage and Column Percentage 
### of the total observations contained in each cell.

Contingency_Table=list(Frequency=count,Expected=result$expected,Deviation=count-result$expected,Percentage=prop.table(count),RowPercentage=prop.table(count,1),ColPercentage=prop.table(count,2))
Contingency_Table

###### Computing various measures of association
library(vcd)
assocstats(heart)

### For the Pearson correlation coefficent 
### and Mantel-Haenszel, 
### for IxJ tables, you can also use 
### pears.cor() function. 
### Mak sure you run this function first!
### c(1,2) and c(1,2,3,4), are the vectors of score values 
pears.cor(heart, c(1,2),c(1,2,3,4)) 
### and this should give you, r=-0.14, M2=26.1475

##Gamma 
Gamma.f(heart)

Output

You will notice in the file that unlike for \(2\times 2\) tables where we had to worry about R  the continuity correction, there is no such thing for \( I \times J\) tables. It doesn't matter, in this example, if you call the function chisq.test(heart, correct=TRUE) or chisq.test(heart, correct=FALSE) because the results are the same.

\(X^2= 35.0285\), df = 3, \(p\)-value = 1.202e-07

and the likelihood-ratio test statistic is

\(G^2= 31.9212\), df = 3, \(p\)-value = 5.43736e-07

Notice that the chisq.test() function does not compute \(G^2\), but we included extra code to do that. For the complete output file, see the HeartDisease.out file. Here is also a more general function LRstats.R for computing the \(G^2\) for two-way tables. The results are discussed further below.

Conclusion

We reject the null hypothesis of independence because of the big values of the chi-square statistics. Notice the degrees of freedom are equal to \(3 = (4-1)(2-1)\), and thus the \(p\)-value is very low. Therefore, through the \(X^2\) test for independence, we have demonstrated beyond a reasonable doubt that a relationship exists between cholesterol and CHD.

A good statistical analysis, however, should not end with the rejection of a null hypothesis. Once we have demonstrated that a relationship exists, we may be interested in which cells were particularly revealing. To do this we consider computing and evaluating the residuals.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility