Printer-friendly versionPrinter-friendly version

The structural zeros (and partial cross-classifications) are not as common as sampling zeros. However, methods for incomplete tables can be useful for a variety of problems. These methods can be found in various chapters in Agresti(1996, 2002, 2007), Bishop, Holland and Fienberg (1975) and Fienberg (1980).

  1. Dealing with anomalous cells.
  2. Excluding problem sampling zeros from an analysis.
  3. Check collapsibility across categories of a variable.
  4. Quasi-independence.
  5. Symmetry and quasi-symmetry.
  6. Marginal homogeneity.
  7. Bradley-Terry model for paired comparisons.
  8. (Guttman) scaling of response patterns.
  9. Estimate missing cells.
  10. Estimation of population size.
  11. Others...

General ideas:

We remove the cell(s) from the model building and do the analysis by only fitting models to cells with non-structural zeros.

We do this by filling in any number for a structural zero (the same value for the observed and the expected count); generally we just put in 0.

Then to account for fixing these values, and to remove these cells from the modeling, we create and indicator variable. To remove the (i, j) cell,

Iij = 1 if cell is the structural zero
    = 0 for all other cells

When this indicator is included in a loglinear model as a (numerical) explanatory variable, a single parameter is estimated for the structural zero, which used up 1 df, and the cell is fitted perfectly.

Since structural zeros are fitted perfectly, they have 0 weight in the fit statistics.

Saturated model

For example, consider the teens and health concerns data (see health.sas, health.lst and/or health.R).

Health Concern
Gender
Male
Female
Sex/Reproduction
6
16
Menstrual problems
-
12
How healthy am I?
49
29
None
77
102

We can express the saturated loglinear model as :

\begin{align}
\text{log}(\mu_{ij}) &= 0 & \text{for the (2,1) cell}\\
&= \lambda+\lambda_i^H+\lambda_j^G+\lambda_{ij}^{HG} & \text{for the rest}\\
\end{align}

A single equation for the saturated model:

\(\text{log}(\mu_{ij})=\lambda+\lambda_i^H+\lambda_j^G+\lambda_{ij}^{HG}+\delta_{21}I(2,1)\)

The δ21 is a parameter that will equal whatever it needs to equal such that the (2,1) cell is fit perfectly (i.e., the fitted value will be exactly equal to whatever arbitary constant you filled in for it).

sas logoIf we just replace the missing value with the zero, the GENMOD will treat this as a sampling zero problem and will not properly adjust the degrees of freedom.

table

If we also include the indicator variable in model, the GENMOD will treat this as a sampling zero problem but will give the proper degrees of freedom. We need to make sure that the indicator variable is numerical variable and not categorical.

table

In health.lst output notice that the parameter estimates of the model and the fitted values are the same for the two cases above, but the chi-square statistics and the degrees of freedom for the significance of the individual parameters are not. For the calculation of the degrees of freedom, you can use the same formula as one given in the Sparse Table section of this lesson.

R LogoHere is the R code and output for the saturated model without the indicator variable:

r output

And, here is the R code and output with the indicator variable:

r output

The procedure on the previous slide is a general approach to this problem that can be applied with any software program.

Recall that different software packages, and even the same package within the different functions, would have implemented different default treatments of missing values and/or sampling zeros.

For example, if there are missing values present, the GENMOD by default will assume that this is the case of a structural zero (e.g. health1.sas) . Under the Model Information it will clearly indicate the number of missing observations. Compare health.sas and health.lst to health1.sas and health1.lst.

Independence model

What would happened for an independence model?

How about degrees of freedom?

sas logoFrom health.lst:

table

 

What do you learn from the overall goodness of fit of this model? Are gender and health concerns independent?

What is the expected value ? Based on the health.lst

\(\hat{\mu}_{21}=\text{exp}(4.5466-2.0671-0.1076-22.9986)\approx 0\)

and based on health1.lst it is a large value,10.78. Here is a part of the output with the prediction from health.lst:

table

and from health1.lst:

table

R LogoHere is the R output for the independence model:

r output

Here is the output from the independence model with delta:

r output

And, here is the output from the independence model with missing data:

r output

Incompleteness in higher dimensions

By analogy to two-dimensional situations, we can consider log-linear models applied only to the non-missing cells. That is create indicator variables for each missing cell.

Health
Gender
Male
Female
Age
12 − 15
16 − 17
12 − 15
16 − 17
Sex/Reproduction
4
2
9
7
Menstrual problems
-
-
4
8
How healthy am I?
42
7
19
10
None
57
20
71
31

The table below gives log-likelihood ratios and Pearson chi-square values for various log-linear models. Note that the degrees of freedom are reduced by two from the usual degrees of freedom for a complete table, unless the gender by health margin is fitted.

See health3d.sas (health3.R) or health13d.sas.

Let H=health, G=gender, and A=age.

Model
df
G2
X2
(HG, HA, GA)
2
2.03
2.03
(HG, HA)
3
4.86
4.98
(HA, GA)
4
13.45
13.12
(HG, GA)
5
9.43
9.62
(HA, G)
5
17.46
17.05
(HG, A)
6
15.46
15.95
(H, GA)
7
22.03
22.59
(H, G, A)
8
28.24
30.53

Based on the overall goodness-of-fit, the models in which \(\lambda_{ij}^{HG}=0\) are quite poor. The fit for models (a) \(\lambda_{ijk}^{HGA}=0\), (b) \(\lambda_{jk}^{GA}=\lambda_{ijk}^{HGA}=0\), and (c) \(\lambda_{ik}^{HA}=\lambda_{ijk}^{HGA}=0\) are each acceptable at the 0.05 alpha level.

By further comparison of these nested models, we notice that the difference between (a) and (b) is small, while between (a) and (c) is significant, e.g. 9.43 − 2.03 = 7.40, df = 5 − 2 = 3. This indicates that the model (b) is appropriate for this data with interpretation that given a particular health concern (other than menstrual problems), there is no relationship between the age and gender.

Anomalous Cells

Consider a situation where overall a model fits a table well, except for one or a few cells. The methodology for incomplete tables can be used to show that the model really does fit well except for these few cells. But then, of course, you need to talk about the anomolous cells. For example, speculate why they are not being fitted well.

Example Table 8-10 from Fienberg(1980). Mothers of children under the age of 19 were asked whether boys, girls, or both should be required to shovel snow off sidewalks. The responses were cross-classified according to the year in which the question was asked (1953, 1971) and the religion of the mother. There are only two response, since none of the mothers said just girls.

 
1953
1971
Religion
Boys
Both
Boys
Both
Protestant
104
42
165
142
Catholic
65
44
100
130
Jewish
4
3
5
6
Other
13
6
32
23

Let G = Gender, Y = Year, and R = Religion. Since we will treat gender is the response variable and year and religion are explanatory variables, all models should include \(\lambda_{ij}^{RY}\) terms in the loglinear model . Keep this in mind; we will see later that this corresponds to a logistic regression model.

The table below lists the four log-linear models all including \(\lambda_{ij}^{RY}\), and their fit statistics.

Model
df
G2
p
X2
p
(RY, G)
7
31.67
<.001
31.06
<.001
(RY, GY)
6
11.25
.08
11.25
.08
(RY, GR)
4
21.49
<.001
21.12
<.001
(RY, GY, GR)
3
0.36
.95
.36
.95

Inference:

The homogeneous association model fits well.

How about the other models?

Compare the fit of (RY, GY) and the homogeneous association model: G2[(RY, GY)|(RY, GY, GR)] = 10.89, df = 3, p-value = 0.01.

However, lets take a closer look at (RY, GY). May be the effect of religion on the response can be accounted for by a single religious category?

The Pearson residuals from the (RY, GY) loglinear model:

 
1953
1971
Religion
Boys
Both
Boys
Both
Protestant
.75
-1.05
.91
-.91
Catholic
-.84
1.18
-1.42
1.42
Jewish
-.29
.41
-.22
.22
Other
.12
-.17
.85
-.85

The 3 largest residuals correspond to mothers who are Catholic. The model under-predicts Both and
overpredicts Boys.

If we do not include Catholic mothers, would the model (RY, GY) (or the corresponding logit model) fit?

Since the fit of this model is G2 = 1.35, df = 4 we can conclude that (RY, GY) model fits well when we disregard the row with the data from Catholic mothers. Thus these seem to be anomalous cells.

What may be some other ways of dealing with this data? Do you think a different way of classification would help? Or a different model?