11.1.3  Modeling Incomplete Tables
The structural zeros (and partial crossclassifications) are not as common as sampling zeros. However, methods for incomplete tables can be useful for a variety of problems. These methods can be found in various chapters in Agresti(1996, 2002, 2007), Bishop, Holland and Fienberg (1975) and Fienberg (1980).
 Dealing with anomalous cells.
 Excluding problem sampling zeros from an analysis.
 Check collapsibility across categories of a variable.
 Quasiindependence.
 Symmetry and quasisymmetry.
 Marginal homogeneity.
 BradleyTerry model for paired comparisons.
 (Guttman) scaling of response patterns.
 Estimate missing cells.
 Estimation of population size.
 Others...
General ideas:
We remove the cell(s) from the model building and do the analysis by only fitting models to cells with nonstructural zeros.
We do this by filling in any number for a structural zero (the same value for the observed and the expected count); generally we just put in 0.
Then to account for fixing these values, and to remove these cells from the modeling, we create and indicator variable. To remove the (i, j) cell,
I_{ij} = 1 if cell is the structural zero
= 0 for all other cells
When this indicator is included in a loglinear model as a (numerical) explanatory variable, a single parameter is estimated for the structural zero, which used up 1 df, and the cell is fitted perfectly.
Since structural zeros are fitted perfectly, they have 0 weight in the fit statistics.
Saturated model
For example, consider the teens and health concerns data (see health.sas, health.lst and/or health.R).
Health Concern

Gender


Male

Female


Sex/Reproduction 
6

16

Menstrual problems 


12

How healthy am I? 
49

29

None 
77

102

We can express the saturated loglinear model as :
\begin{align}
\text{log}(\mu_{ij}) &= 0 & \text{for the (2,1) cell}\\
&= \lambda+\lambda_i^H+\lambda_j^G+\lambda_{ij}^{HG} & \text{for the rest}\\
\end{align}
A single equation for the saturated model:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^H+\lambda_j^G+\lambda_{ij}^{HG}+\delta_{21}I(2,1)\)
The δ_{21} is a parameter that will equal whatever it needs to equal such that the (2,1) cell is fit perfectly (i.e., the fitted value will be exactly equal to whatever arbitary constant you filled in for it).
If we just replace the missing value with the zero, the GENMOD will treat this as a sampling zero problem and will not properly adjust the degrees of freedom.
If we also include the indicator variable in model, the GENMOD will treat this as a sampling zero problem but will give the proper degrees of freedom. We need to make sure that the indicator variable is numerical variable and not categorical.
In health.lst output notice that the parameter estimates of the model and the fitted values are the same for the two cases above, but the chisquare statistics and the degrees of freedom for the significance of the individual parameters are not. For the calculation of the degrees of freedom, you can use the same formula as one given in the Sparse Table section of this lesson.
Here is the R code and output for the saturated model without the indicator variable:
And, here is the R code and output with the indicator variable:
The procedure on the previous slide is a general approach to this problem that can be applied with any software program.
Recall that different software packages, and even the same package within the different functions, would have implemented different default treatments of missing values and/or sampling zeros.
For example, if there are missing values present, the GENMOD by default will assume that this is the case of a structural zero (e.g. health1.sas) . Under the Model Information it will clearly indicate the number of missing observations. Compare health.sas and health.lst to health1.sas and health1.lst.
Independence model
What would happened for an independence model?
How about degrees of freedom?
From health.lst:
What do you learn from the overall goodness of fit of this model? Are gender and health concerns independent?
What is the expected value ? Based on the health.lst
\(\hat{\mu}_{21}=\text{exp}(4.54662.06710.107622.9986)\approx 0\)
and based on health1.lst it is a large value,10.78. Here is a part of the output with the prediction from health.lst:
and from health1.lst:
Here is the R output for the independence model:
Here is the output from the independence model with delta:
And, here is the output from the independence model with missing data:
Incompleteness in higher dimensions
By analogy to twodimensional situations, we can consider loglinear models applied only to the nonmissing cells. That is create indicator variables for each missing cell.
Health 
Gender

Male

Female


Age

12 − 15

16 − 17

12 − 15

16 − 17


Sex/Reproduction 
4

2

9

7


Menstrual problems 




4

8


How healthy am I? 
42

7

19

10


None 
57

20

71

31

The table below gives loglikelihood ratios and Pearson chisquare values for various loglinear models. Note that the degrees of freedom are reduced by two from the usual degrees of freedom for a complete table, unless the gender by health margin is fitted.
See health3d.sas (health3.R) or health13d.sas.
Let H=health, G=gender, and A=age.
Model

df

G^{2}

X^{2}

(HG, HA, GA) 
2

2.03

2.03

(HG, HA) 
3

4.86

4.98

(HA, GA) 
4

13.45

13.12

(HG, GA) 
5

9.43

9.62

(HA, G) 
5

17.46

17.05

(HG, A) 
6

15.46

15.95

(H, GA) 
7

22.03

22.59

(H, G, A) 
8

28.24

30.53

Based on the overall goodnessoffit, the models in which \(\lambda_{ij}^{HG}=0\) are quite poor. The fit for models (a) \(\lambda_{ijk}^{HGA}=0\), (b) \(\lambda_{jk}^{GA}=\lambda_{ijk}^{HGA}=0\), and (c) \(\lambda_{ik}^{HA}=\lambda_{ijk}^{HGA}=0\) are each acceptable at the 0.05 alpha level.
By further comparison of these nested models, we notice that the difference between (a) and (b) is small, while between (a) and (c) is significant, e.g. 9.43 − 2.03 = 7.40, df = 5 − 2 = 3. This indicates that the model (b) is appropriate for this data with interpretation that given a particular health concern (other than menstrual problems), there is no relationship between the age and gender.
Anomalous Cells
Consider a situation where overall a model fits a table well, except for one or a few cells. The methodology for incomplete tables can be used to show that the model really does fit well except for these few cells. But then, of course, you need to talk about the anomolous cells. For example, speculate why they are not being fitted well.
Example Table 810 from Fienberg(1980). Mothers of children under the age of 19 were asked whether boys, girls, or both should be required to shovel snow off sidewalks. The responses were crossclassified according to the year in which the question was asked (1953, 1971) and the religion of the mother. There are only two response, since none of the mothers said just girls.
1953

1971


Religion 
Boys

Both

Boys

Both

Protestant 
104

42

165

142

Catholic 
65

44

100

130

Jewish 
4

3

5

6

Other 
13

6

32

23

Let G = Gender, Y = Year, and R = Religion. Since we will treat gender is the response variable and year and religion are explanatory variables, all models should include \(\lambda_{ij}^{RY}\) terms in the loglinear model . Keep this in mind; we will see later that this corresponds to a logistic regression model.
The table below lists the four loglinear models all including \(\lambda_{ij}^{RY}\), and their fit statistics.
Model 
df

G^{2}

p

X^{2}

p

(RY, G) 
7

31.67

<.001

31.06

<.001

(RY, GY) 
6

11.25

.08

11.25

.08

(RY, GR) 
4

21.49

<.001

21.12

<.001

(RY, GY, GR) 
3

0.36

.95

.36

.95

Inference:
The homogeneous association model fits well.
How about the other models?
Compare the fit of (RY, GY) and the homogeneous association model: G^{2}[(RY, GY)(RY, GY, GR)] = 10.89, df = 3, pvalue = 0.01.
However, lets take a closer look at (RY, GY). May be the effect of religion on the response can be accounted for by a single religious category?
The Pearson residuals from the (RY, GY) loglinear model:
1953

1971


Religion

Boys

Both

Boys

Both

Protestant 
.75

1.05

.91

.91

Catholic 
.84

1.18

1.42

1.42

Jewish 
.29

.41

.22

.22

Other 
.12

.17

.85

.85

The 3 largest residuals correspond to mothers who are Catholic. The model underpredicts Both and
overpredicts Boys.
If we do not include Catholic mothers, would the model (RY, GY) (or the corresponding logit model) fit?
Since the fit of this model is G^{2} = 1.35, df = 4 we can conclude that (RY, GY) model fits well when we disregard the row with the data from Catholic mothers. Thus these seem to be anomalous cells.
What may be some other ways of dealing with this data? Do you think a different way of classification would help? Or a different model?