5.3.2 - Joint Independence

The joint independence model implies that two variables are jointly independent of a third. For example, let's suggest that C is jointly independent of \(X\) and \(Y\). In the log-linear notation, this model is denoted as (\(XY\), \(Z{)}\). Here is a graphical representation of this model:

which indicates that \(X\) and \(Y\) are jointly independent of \(Z\). The line linking \(X\) and \(Y\) indicates that \(X\) and \(Y\) are possibly related, but not necessarily so. Therefore, the model of complete independence is a special case of this one.

For three variables, three different joint independence models may be considered. If the model of complete independence (\(X\), \(Y\), \(Z\)) fits a data set, then the model (\(XY\), \(Z\)) will also fit, as will (\(XZ\), \(Y\)) and (\(YZ\), \(X\)). In that case, we will prefer to use (\(X\), \(Y\), \(Z\)) because it is more parsimonious; our goal is to find the simplest model that fits the data. Note that joint independence does not imply mutual independence. This is one of the reasons why we start with the overall model of mutual independence before we collapse and look at models of joint independence.

Assuming that the (\(XY\), \(Z\)) model holds, the cell probabilities would be equal to the product of the marginal probabilities from the \(XY\) margin and the \(Z\) margin:

\(\pi_{ijk} = P(X = i,Y= j) P(Z = k) = \pi_{ij}+ \pi_{++k},\)

where \(\sum_i \sum_j \pi_{ij+} = 1\) and \(\sum_k \pi_{++k} = 1\). Thus if we know the counts in the \(XY\) table and the \(Z\) table, we can compute the expected counts in the \(XYZ\) table. The number of free parameters, that is the number of unknown parameters that we need to estimate is \((IJ − 1) + (K − 1)\), and their ML estimates are \(\hat{\pi}_{ij+}=n_{ij+}/n\), and \(\hat{\pi}_{++k}=n_{++k}/n\). The estimated expected cell frequencies are:

\(\hat{E}_{ijk}=\dfrac{n_{ij+}n_{++k}}{n}\) (1)

Notice the similarity between this formula and the one for the model of independence in a two-way table, \(\hat{E}_{ij}=n_{i+}n_{+j}/n\). If we view \(X\) and \(Y\) as a single categorical variable with \(IJ\) levels, then the goodness-of-fit test for (\(XY\), \(Z\)) is equivalent to the test of independence between the combined variable \(XY\) and \(Z\). The implication of this is that you can re-write the 3-way table as a 2-way table and do its analysis.

To determine the degrees of freedom in order to evaluate this model and to construct the chi-squared and deviance statistics, we need to find the number of free parameters in the saturated model and subtract from this the number of free parameters in the new model. We take the difference of the number of parameters we fit in the saturated model and the number of parameters we fit in the assumed model: DF = \((IJK-1)-[(IJ-1)+(K-1)]\), and this is equal to \((IJ-1)(K-1)\), which ties in the statement above that this is equivalent to the test of independence in a two-way table with a variable (\(XY\)) vs. variable (\(Z\)).

Stop and Think!

Can you write this two-way table for the Berkeley admissions example? What do you think, is the \(X=\) sex, \(Y=\) admission status, jointly independent of \(Z=\) department? Answer by conducting the chi-square test of independence for the \(XY\times Z\) table.

Example: Boy Scouts and Juvenile Delinquency Section

Going back to the boy scout example, let’s think of juvenile delinquency D as the response variable, and the other two — boy scout status B and socioeconomic status S—as explanatory variables or predictors. Therefore, it would be useful to test the null hypothesis that D is independent of B and S, or (D, BS).

Because we used \(i\), \(j\), and \(k\) to index B, D, and S, respectively, the estimated expected cell counts under the model "B and S are independent of D" are a slight rearrangement of (1) above,

\(E_{ijk}=\dfrac{n_{i+k}n_{+j+}}{n}\)

We can calculate the \(E_{ijk}\)s by this formula or in SAS or R by entering the data as a two-way table with one dimension corresponding to D and the other dimension corresponding to B \(\times\) S. In fact, looking at the original table, we see that the data are already arranged in the correct fashion; the two columns correspond to the two levels of D, and the six rows correspond to the six levels of a new combined variable B \(\times\) S. To test the hypothesis "D is independent of B and S," we simply enter the data as a \(6\times 2\) matrix, where we have combined the socioeconomic status (SES) with scout status, essentially collapsing the 3-way table into a 2-way table.

The SAS code for this example is: boys.sas.

Here is the SAS code shown below:

data boys;
input SES $scout $ delinquent $ count @@;
datalines;
low yes yes 11 low yes no 43
low no yes 42 low no no 169
med yes yes 14 med yes no 104
med no yes 20 med no no 132
high yes yes 8 high yes no 196
high no yes 2 high no no 59
;
proc freq;
 weight count;
 tables SES;
 tables scout;
 tables delinquent;
 tables SES*scout/all nocol nopct;
 tables SES*delinquent/all nocol nopct;
 tables scout*delinquent/all nocol nopct;
 tables SES*scout*delinquent/chisq cmh expected nocol nopct;
run;

Expected counts are printed below in the SAS output as the second entry in each cell:

The SAS System

The FREQ Procedure

Frequency Expected Row Pct


Table of SES_scout by delinquent
SES_scout	delinquent
SES_scout	no	yes	Total
high_no	59 53.604 96.72	2 7.3963 3.28	61
high_yes	196 179.27 96.08	8 24.735 3.92	204
low_no	169 185.42 80.09	42 25.584 19.91	211
low_yes	43 47.453 79.63	11 6.5475 20.37	54
med_no	132 133.57 86.84	20 18.43 13.16	152
med_yes	104 103.69 88.14	14 14.308 11.86	118
Total	703	97	800

Statistics for Table of SES_scout by delinquent


Statistic	DF	Value	Prob
Chi-Square	5	32.9576	<.0001
Likelihood Ratio Chi-Square	5	36.4147	<.0001
Mantel-Haenszel Chi-Square	1	5.5204	0.0188
Phi Coefficient		0.2030
Contingency Coefficient		0.1989
Cramer's V		0.2030

The SAS output from this program is essentially evaluating a 2-way table, which we have already done before.

We will use the R code boys.R and look for the "Joint Independence" comment heading, and use of the ftable() function as shown below.

#### Test for Joint Independence of (D,BS)
#### creating 6x2 table, BS X D

SESscout_delinquent=ftable(temp, row.vars=c(3,2))
results=chisq.test(SESscout_delinquent)
result

Here is the output that R gives us:

Pearson's Chi-squared test
  
  data: SESscout_deliquent
  X-squared = 32.958, df = 5, p-value = 3.837e-06

Our conclusions are based on the following evidence: the chi-squared statistics values (e.g., \(X^2, G^2\)) are very large (e.g., \(X^2=32.9576, df=5\)), and the p-value is essentially zero, indicating that B and S are not independent of D. The expected cell counts are all greater than five, so this p-value is reliable and our hypothesis does not hold - we reject this model of joint independence.

Notice that we can also test for the other joint independence models, e.g., (BD, S), that is, the scout status and delinquency are jointly independent of the SES, and this involves analyzing a \(4\times3\) table, and (SD,B), that is the SES and Delinquency are jointly independent of the Boy's Scout status, and this involves analyzing a \(6\times2\) table. These are just additional analyses of two-way tables.

It is worth noting that while many different varieties of models are possible, one must keep in mind the interpretation of the models. For example, assuming delinquency to be the response, the model (D, BS) would have a natural interpretation; if the model holds, it means that B and S cannot predict D. But if the model does not hold, either B or S may be associated with D. However, (BD, S) and (SD, B) may not render themselves to easy interpretation, although statistically, we can perform the tests of independence.