5.4.1 - Mutual (Complete) Independence

Printer-friendly versionPrinter-friendly version

The simplest model that one might propose is that ALL variables are independent of one another.

Graphically, if we have three random variables, A, B, and C, we can express this model as:

plot

In this graph, the lack of connections between the nodes indicates no relationships exist among A, B, and C, or there is mutual independence. In the notation of log-linear models, which will learn about later as well, this model is expressed as (A, B, C). We will use this notation from here on.  The notation indicates which aspects of the data we consider in the model (which sufficient statistics we compute).  In this case, the independence model means we only care about the marginals totals $n_{i++}, n_{+j+}, n_{++k}$ for each variable separately, and these pieces of information will be sufficient to fit a model and compute the expected counts.  Alternatively, as we will see later on, we could keep track of the number of times some joint outcome of two variables occurs.

In terms of odds ratios, the model (A, B, C) implies that if we look at the marginal tables A × B, B × C , and A × C , then all of the odds ratios in these marginal tables are equal to 1. In other words, mutual independence implies marginal independence (i.e. there is independence in the marginal tables). All variables are really independent of one another.

Under this model the following must hold:

\(P(A=i,B=j,C=k)=P(A=i)P(B=j)P(C=k)\)

for all i, j, k. That is, the joint probabilities are the product of the marginal probabilities. This is a simple extension from the model of independence in two-way tables where it was assumed:

\(P(A=i,B=j)=P(A=i)P(B=j)\)

Define the marginal probabilities,

πi++ = P(A = i), i = 1, 2, . . . , I ,
π+j+ = P(B = j),  j = 1, 2, . . . , J,
π++k = P(C = k), k = 1, 2, . . . , K .

so that πijk = πi++π+j+π++k for all i, j, k.

Then the unknown parameters of the model of independence are

πi++ = (π1++, π2++, . . . , πI++),
π+j+ = (π+1+, π+2+, . . . , π+J+),
π++k = (π++1, π++2, . . . , π++K).

Under the assumption that the model of independence is true, once we know the marginal probability values, we have enough information to estimate all unknown cell probabilities. Because each of the marginal probability vectors must add up to one, the number of free parameters in the model is (I − 1) + (J − 1) + (K I ). This is exactly like the two-way table, but now one more set of additional parameter(s) need to be taken care of for the additional random variable. Consider the Death Penalty example where the number of free parameters is  (2-1) + (2-1) + (2-1) = 3.

Notice that under the independence model, marginal distributions are

(n1++, n2++, . . . , nI++) ∼ Mult(n, πi++),
(n+1+, n+2+, . . . , n+J+) ∼ Mult(n, π+j+),
(n++1, n++2, . . . , n++K) ∼ Mult(n, π++k),

and these three vectors are mutually independent. Thus the three parameter vectors πi++, π+j+, and π++k can be estimated independently of one another. The ML estimates are the sample proportions in the margins of the table, 

\(\hat{\pi}_{i++}=p_{i++}=n_{i++}/n,\quad i=1,2,\ldots,I\)
\(\hat{\pi}_{+j+}=p_{+j+}=n_{+j+}/n,\quad j=1,2,\ldots,J\)
\(\hat{\pi}_{++k}=p_{++k}=n_{++k}/n,\quad k=1,2,\ldots,K\)

It then follows that the estimates of the expected cell counts are

\(E_{ijk}=n\hat{\pi}_{i++}\hat{\pi}_{+j+}\hat{\pi}_{++k}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\)

Again, compare this to a two-way table, where the expected counts were: \(E(n_{ij})=n_{i+}n_{+j}/n\).

For the death penalty example, the marginal tables, i.e., counts are: A=[160,166], B=[214, 112], C=[36,290]. The E(n111)=(160 × 214 × 36)/(3262)=11.60, etc... Then compare these expected counts with the corresponding observed counts, e.g., 11.60 to n111=19, etc...

Chi-Squared Test of Independence

The hypothesis of independence can be tested using the general method described earlier in Lesson 3 (and 2). To test

H0 : the independence model is true     vs.     HA : the saturated model is true

In other words, we can check directly H0: πijk = πi++π+j+π++k for all i, j, k, vs. HA: the saturated model

  • Estimate the unknown parameters of the independence model, e.g., the marginal probabilities.
  • Calculate estimated cell probabilities and expected cell frequencies Eijk under the model of independence.
  • Calculate X2and/or G2 by comparing the expected and observed values, and compare them to the appropriate chi-square distribution.

\(X^2=\sum\limits_i \sum\limits_j \sum\limits_k \dfrac{(E_{ijk}-n_{ijk})^2}{E_{ijk}}\)

\(G^2=2\sum\limits_i \sum\limits_j \sum\limits_k n_{ijk} \text{log }\left(\dfrac{n_{ijk}}{E_{ijk}}\right)\)

The degrees of freedom (DF) for this test are ν = (IJK − 1) − [ (I − 1) + (J − 1) + (K − 1) ]. As before this is a difference between the number of free parameters for the saturated model (IJK-1) and the number free parameters in the current model of independence, \((I-1)+(J-1)+(K-1)\).

For example, for the death penalty example, DF = 7-3 = 4.

Recall that we also said that mutual independence implies marginal independence. So, if we reject marginal independence for any pair of variables, we can immediately reject mutual independence overall. For example, consider the estimated marginal odds ratios and their confidence intervals for death penalty example (see death.sas (or death.R) and death.lst (or death.out) in the previous section); the estimates are θAC =1.18, θAB=27.43, θBC=2.88, each for a 2 × 2 table marginal table, with df = 1.

Discuss     What is your conclusion?  Would you reject the model of complete independence? Are these three variables mutually independent?

Example - Boys Scouts and Juvenile Delinquency

This is a 3 × 2 × 2 table. It classifies n = 800 boys according to socioeconomic status (S), whether they are a boy scout (B), and whether they have been labeled as a juvenile delinquent (D):

Socioeconomic status
Boy scout
Delinquent
Yes
No
Low
Yes
11
43
No
42
169
Medium
Yes
14
104
No
20
132
High
Yes
8
196
No
2
59

To fit the full independence model, we need to find the marginal totals for B,

n1++ = 11 + 43 + 14 + 104 + 8 + 196 = 376,
n2++ = 42 + 169 + 20 + 132 + 2 + 59 = 424,

for D,

n+1+ = 11 + 42 + 14 + 20 + 8 + 2 = 97,
n+2+ = 43 + 169 + 104 + 132 + 196 + 59 = 703,

and for S,

n++1 = 11 + 43 + 42 + 169 = 265,
n
++2 = 14 + 104 + 20 + 132 = 270,
n
++3 = 8 + 196 + 2 + 59 = 265.

Calculate the expected counts for each cell, \(E_{ijk}=\dfrac{n_{i++}n_{+j+}n_{++k}}{n^2}\) and then calculate the chi-square statistics.

The degrees of freedom for this test are \((2\times 2\times 3-1)-[(2-1)+(2-1)+(3-1)]=7\) so p-values can be found as \(P(\chi^2_7 \geq X^2)\) and \(P(\chi^2_7 \geq G^2)\).

Recall, here is a simple lne of code in SAS that you can use to get the p-value:

SAS program

To get them in R, use 1-pchisq(218.6622, 7).

The p-values are essentially zero, indicating that the mutual independence model does not fit. Remember, in order for the chi-squared approximation to work well, the Eijk needs to be sufficiently large. Sufficiently large means that most of them (e.g., about at least 80%) should be at least five, and none should be less than one. We should examine the Eijk to see if they are large enough.

As you can find by running the provided codes below, both R and SAS will give you the following expected counts, (in parentheses), and the observed counts:

Socioeconomic status
Boy scout
Delinquent
Yes
No
Low
Yes
11
(15.102)
43
(109.448)
No
42
(17.030)
169
(123.420)
Medium
Yes
14
(15.387 )

104
(111.513)

No
20
(17.351)
132
(125.749)
High
Yes

8
(15.102)

196
(109.448)
No
2
(17.030)

59
(123.420)

Here, the expected counts are sufficiently large for the chi-square approximation to work well, and thus we must conclude that the variables B (boys scout), D (delinquent), and S (socioeconomic status) are not mutually independent.

Note: Most software packages should give you a warning if more than 20% of the expected cells are less than 5, and this may have influence on large sample approximations.

There is no single function or a call in SAS nor R that will directly test the mutual independence model; see will see in Lesson 10 how to fit this model via log-linear model. However, we can test this by relying on our understanding of two-way tables, and of marginal and partial tables and related odds ratios. For the mutual independence to hold, all of the tests for independence in marginal tables must hold. Thus, we can do the analysis of all two-way marginal tables (see the SAS and R code). We can do the chi-squared test of independence in each two-way table. Alternatively, we can consider the odds-ratios in each two-way table. In this case, for example, the estimated odds ratio for the B × D table, is 0.542, and it is not equal to 1; i.e., 1 is not in covered by the 95% odds-ratio confidence interval, (0.347, 0.845). Therefore, we have sufficient evidence to reject the null hypothesis that boy scout status and delinquent status are independent of one another, and thus that B, D, and S are not mutually independent.

The following SAS or R code supports the above analysis by testing independence of two-way marginal tables. Again, we will see later in the course that this is done more efficiently via log-linear models.

SAS logo  Here we will use the SAS code found in the program boys.sas  as shown below. 

SAS Program boys.sas

The output for this program can be found in this file: boys.lst.

R logo  For R, we will use the R program file boys.R to create contingency tables.

boys tables R code

Then, the R code computes the test for mutual independence:

boys mutual independence R code

The output file, boys.out contains the results for these procedures.

If two or more variables in a k-way table are not independent, then where is this difference coming from? That is, what are some other possible relationships that hold? What are some other models that can capture this data? May be S and B are jointly independent of D?