5.3.2 - Joint Independence

5.3.2 - Joint Independence

The joint independence model implies that two variables are jointly independent of a third. For example, let's suggest that C is jointly independent of \(X\) and \(Y\). In the log-linear notation, this model is denoted as (\(XY\), \(Z{)}\). Here is a graphical representation of this model:

XYZ

which indicates that \(X\) and \(Y\) are jointly independent of \(Z\). The line linking \(X\) and \(Y\) indicates that \(X\) and \(Y\) are possibly related, but not necessarily so. Therefore, the model of complete independence is a special case of this one.

For three variables, three different joint independence models may be considered. If the model of complete independence (\(X\), \(Y\), \(Z\)) fits a data set, then the model (\(XY\), \(Z\)) will also fit, as will (\(XZ\), \(Y\)) and (\(YZ\), \(X\)). In that case, we will prefer to use (\(X\), \(Y\), \(Z\)) because it is more parsimonious; our goal is to find the simplest model that fits the data. Note that joint independence does not imply mutual independence. This is one of the reasons why we start with the overall model of mutual independence before we collapse and look at models of joint independence.

Assuming that the (\(XY\), \(Z\)) model holds, the cell probabilities would be equal to the product of the marginal probabilities from the \(XY\) margin and the \(Z\) margin:

\(\pi_{ijk} = P(X = i,Y= j) P(Z = k) = \pi_{ij}+ \pi_{++k},\)

where \(\sum_i \sum_j \pi_{ij+} = 1\) and \(\sum_k \pi_{++k} = 1\). Thus if we know the counts in the \(XY\) table and the \(Z\) table, we can compute the expected counts in the \(XYZ\) table. The number of free parameters, that is the number of unknown parameters that we need to estimate is \((IJ − 1) + (K − 1)\), and their ML estimates are \(\hat{\pi}_{ij+}=n_{ij+}/n\), and \(\hat{\pi}_{++k}=n_{++k}/n\). The estimated expected cell frequencies are:

\(\hat{E}_{ijk}=\dfrac{n_{ij+}n_{++k}}{n}\) (1)

Notice the similarity between this formula and the one for the model of independence in a two-way table, \(\hat{E}_{ij}=n_{i+}n_{+j}/n\). If we view \(X\) and \(Y\) as a single categorical variable with \(IJ\) levels, then the goodness-of-fit test for (\(XY\), \(Z\)) is equivalent to the test of independence between the combined variable \(XY\) and \(Z\). The implication of this is that you can re-write the 3-way table as a 2-way table and do its analysis.

To determine the degrees of freedom in order to evaluate this model and to construct the chi-squared and deviance statistics, we need to find the number of free parameters in the saturated model and subtract from this the number of free parameters in the new model. We take the difference of the number of parameters we fit in the saturated model and the number of parameters we fit in the assumed model: DF = \((IJK-1)-[(IJ-1)+(K-1)]\), and this is equal to \((IJ-1)(K-1)\), which ties in the statement above that this is equivalent to the test of independence in a two-way table with a variable (\(XY\)) vs. variable (\(Z\)).

 Stop and Think!
Can you write this two-way table for the Berkeley admissions example? What do you think, is the \(X=\) sex, \(Y=\) admission status, jointly independent of \(Z=\) department? Answer by conducting the chi-square test of independence for the \(XY\times Z\) table.

Example: Boy Scouts and Juvenile Delinquency

Going back to the boy scout example, let’s think of juvenile delinquency D as the response variable, and the other two — boy scout status B and socioeconomic status S—as explanatory variables or predictors. Therefore, it would be useful to test the null hypothesis that D is independent of B and S, or (D, BS).

Because we used \(i\), \(j\), and \(k\) to index B, D, and S, respectively, the estimated expected cell counts under the model "B and S are independent of D" are a slight rearrangement of (1) above,

\(E_{ijk}=\dfrac{n_{i+k}n_{+j+}}{n}\)

We can calculate the \(E_{ijk}\)s by this formula or in SAS or R by entering the data as a two-way table with one dimension corresponding to D and the other dimension corresponding to B \(\times\) S. In fact, looking at the original table, we see that the data are already arranged in the correct fashion; the two columns correspond to the two levels of D, and the six rows correspond to the six levels of a new combined variable B \(\times\) S. To test the hypothesis "D is independent of B and S," we simply enter the data as a \(6\times 2\) matrix, where we have combined the socioeconomic status (SES) with scout status, essentially collapsing the 3-way table into a 2-way table.

The SAS code for this example is: boys.sas.

Here is the SAS code shown below:

data boys;
input SES $scout $ delinquent $ count @@;
datalines;
low yes yes 11 low yes no 43
low no yes 42 low no no 169
med yes yes 14 med yes no 104
med no yes 20 med no no 132
high yes yes 8 high yes no 196
high no yes 2 high no no 59
;
proc freq;
 weight count;
 tables SES;
 tables scout;
 tables delinquent;
 tables SES*scout/all nocol nopct;
 tables SES*delinquent/all nocol nopct;
 tables scout*delinquent/all nocol nopct;
 tables SES*scout*delinquent/chisq cmh expected nocol nopct;
run;

Expected counts are printed below in the SAS output as the second entry in each cell:

The SAS System

The FREQ Procedure

Frequency
Expected
Row Pct
 
Table of SES_scout by delinquent
SES_scout delinquent
no yes Total
high_no
59
53.604
96.72
2
7.3963
3.28
61
 
 
high_yes
196
179.27
96.08
8
24.735
3.92
204
 
 
low_no
169
185.42
80.09
42
25.584
19.91
211
 
 
low_yes
43
47.453
79.63
11
6.5475
20.37
54
 
 
med_no
132
133.57
86.84
20
18.43
13.16
152
 
 
med_yes
104
103.69
88.14
14
14.308
11.86
118
 
 
Total
703
97
800

Statistics for Table of SES_scout by delinquent

 
Statistic DF Value Prob
Chi-Square 5 32.9576 <.0001
Likelihood Ratio Chi-Square 5 36.4147 <.0001
Mantel-Haenszel Chi-Square 1 5.5204 0.0188
Phi Coefficient   0.2030  
Contingency Coefficient   0.1989  
Cramer's V   0.2030  

 

The SAS output from this program is essentially evaluating a 2-way table, which we have already done before.

We will use the R code boys.R and look for the "Joint Independence" comment heading, and use of the ftable() function as shown below.

#### Test for Joint Independence of (D,BS)
#### creating 6x2 table, BS X D

SESscout_delinquent=ftable(temp, row.vars=c(3,2))
results=chisq.test(SESscout_delinquent)
result

Here is the output that R gives us:

Pearson's Chi-squared test
  
  data: SESscout_deliquent
  X-squared = 32.958, df = 5, p-value = 3.837e-06

Our conclusions are based on the following evidence: the chi-squared statistics values (e.g., \(X^2, G^2\)) are very large (e.g., \(X^2=32.9576, df=5\)), and the p-value is essentially zero, indicating that B and S are not independent of D. The expected cell counts are all greater than five, so this p-value is reliable and our hypothesis does not hold - we reject this model of joint independence.

Notice that we can also test for the other joint independence models, e.g., (BD, S), that is, the scout status and delinquency are jointly independent of the SES, and this involves analyzing a \(4\times3\) table, and (SD,B), that is the SES and Delinquency are jointly independent of the Boy's Scout status, and this involves analyzing a \(6\times2\) table. These are just additional analyses of two-way tables.

It is worth noting that while many different varieties of models are possible, one must keep in mind the interpretation of the models. For example, assuming delinquency to be the response, the model (D, BS) would have a natural interpretation; if the model holds, it means that B and S cannot predict D. But if the model does not hold, either B or S may be associated with D. However, (BD, S) and (SD, B) may not render themselves to easy interpretation, although statistically, we can perform the tests of independence.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility