11.1.1  Sparse Tables
As we consider higherdimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to
 the sample size n
 the table size N
We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.
Types of zeros (or empty cells)
Mostly when we talk about sparseness we are talking about zeros. This is a problem that is more frequently encountered. There are different types of zeros, and we need to differentiate between the two. We have discussed different types of zero's earlier in a passing manner. Now we will examine the situations in more detail.
Sampling Zeros occur when there is no observation in the cell; i.e., n_{ij} = 0, but probabilistically you still have a chance of observing this value, P(observation in a cell) = π_{ij }> 0
For example, if you increase the sample size n, you might get n_{ij} > 0. Consider sampling graduate students under 18 years old. They exist, but we didn't sample them. We can think of these values as random zeros ; they could be missing at random, or there maybe more structural missing mechanism due to study design.
Structural Zeros are cells that are theoretically impossible to observe a value, i.e. n_{ij} = 0 and π_{ij} = 0.
Tables with structural zeros are structurally incomplete. They are also know as incomplete tables.
This is different from a partial classification where an incomplete table results from not being able to completely crossclassify all individuals or units.
Example of a structurally incomplete table
Survey of teenagers regarding their health concerns (Fienberg (1980)):
Health Concern

Gender


Male

Female


Sex/Reproduction 
6

16

Menstrual problems 


12

How healthy am I? 
49

29

None 
77

102

The reasonable assumption here is that men will not have concerns with menstrual problems, therefore the chance of observing this type of concern is: P(male with menstrual problems) = 0.
Examples of a partial classification:
Data from a study in which elderly were asked whether they took tranquillizers (Agresti, 1990). Some individuals were interviewed in 1979, some in 1985, and some in both 1979 and 1985.
1985


1975 
yes

no

not sampled

total

yes 
175

190

230

595

no 
139

1518

982

2639

not sampled 
64

596



659

total 
378

2303

1212

3893

In this case, there were some possible subjects that simply were not sampled.
Recognizing and understanding different types of incompleteness has implications for statistical analysis and inference.
For example, if there are structural zeros or incomplete classification you should NOT:
 fill in the cells with zeros
 collapse the table until there are no zeros in the table
 quit the analysis
We will focus more on fitting models with sampling zeros. Some good references on modeling incomplete tables are Fienberg (1980), Chapter 8 and Bishop, Holland and Fienberg (1975), Chapters, 5, 6 and 8.
Effects of sampling zeros
There are two major effects on modeling contingency tables with sampling zeros:
 Maximum likelihood estimates (MLEs) may NOT exist for loglinear/logit models
 if MLE estimates exist, they may be biased
Recall that terms in a loglinear model correspond to the margins of the tables. Existence of MLE estimates depends on the position of zeros in the table and what effects are included in a model.
If all n_{ij} > 0 than MLE estimates of parameters are finite.
If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite. The most recent research which may shed more light on the relationship between the pattern of zeros and MLE existence involves tools from Algebraic Statistics (Rinaldo, 2005).
Hypothetical Example
Reference: Fienberg, 1980, Ch.8, ZerosEx.sas, and ZerosEx.R. There are two examples in these files; first with a nonzero margin, but a bad pattern of zeros, and the second one with a zero margin.
If there is a single sampling zero in a table, the MLE estimates are always positive.
If there are nonzero margins, but a bad pattern of zeros, some MLE estimates of the cells counts will be negative (although you might not observe this in a computer program)!
If there are zero margins, you cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1 For example, n_{+11} = 0 so there is no MLE estimate for λ_{11}^{YZ} = 0, and any model including this term will not fit properly.
We can force λ_{11}^{YZ} = 0 but then must adjust for the degrees of freedom in the model, and no software will do this automatically (look at the last part of ZerosEx.sas and ZerosEx.lst (or ZerosEx.R) and the homogeneous model)!
A general formula for adjusting DF
df = (T_{e}  Z_{e})  (T_{p}  Z_{p})
where T_{e} = of cells fitted, Z_{e} = of zero cells, T_{p} = of parameters in the model, Z_{p} = of parameters that cannot be estimated because of zero margins.
Let's see these examples. First consider the following 2 × 2 × 2 table for each level of Z variable:
Z=1:
Y=1  Y=2  
X=1  0  5 
X=2  0  16 
Z=2
Y=1 
Y=2 

X=1 
6 
9 
X=2 
5 
7 
Now consider that we want to fit a homogeneous association model: (XY, XZ,YZ). Let's consider the marginal tables that correspond to the highest order terms in the model and see if any of them have 0 counts:
XY:
Y=1 
Y=2 

X=1 
6 
14 
X=2 
5 
23 
XZ:
Z=1 
Z=2 

X=1 
5 
15 
X=2 
16 
12 
YZ:
Z=1 
Z=2 

Y=1 
0 
11 
Y=2 
21 
16 
Here, n_{+11} = 0 so there is no MLE estimate for λ_{11}^{YZ} = 0, and any model including this term will not fit properly. Next, notice that a software does not usually detect that there is a problem  you need to pay attention to this! See below for hints want to look for. Sometimes you will just get "NaN" instead of parameter estimates  that is, a software won't be able to estimate it (e.g. loglin() in R). For now, let's look at R and SAS outputs using glm() that is GENMOD:
From   notice huge standard errors for some of the terms such as Y*Z, y1z1 below!
Analysis Of Parameter Estimates
Standard Wald 95% Chi
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 1.9459 0.3780 1.2051 2.6867 26.51 <.0001
X x1 1 0.2513 0.5040 0.7364 1.2390 0.25 0.6180
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 0.3365 0.5855 1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z1 1 0.8267 0.4532 0.0615 1.7149 3.33 0.0681
Z z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 0.0690 0.7878 1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z1 1 1.4145 0.7187 2.8230 0.0059 3.87 0.0490
X*Z x1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z1 1 26.0000 115148.2 225712 225660.3 0.00 0.9998
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Homogeneous Association with zero margin: (XY,XZ,YZ)
As expected overall df=1, but they really ought to be 0!
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 1 0.0000 0.0000
Here, if we apply the above formula, the dfs should be:
df = (T_{e}  Z_{e})  (T_{p}  Z_{p})=(82)(71)=66=0
However, the estimates are huge and the model like this is useless. The best is to remove the cells with zero counts and refit the model, and this should compute the correct degrees of freedom too. In SAS, one way of doing this is:
/*delete zero values*/
data zeros1;
set zeros;
if count=0 then delete;
run;
/* homogeneous associations */
proc genmod order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi obstats;
title 'Homogeneous Association with zero margin': (XY,XZ,YZ);
run;
Now the output looks like:
Analysis Of Parameter Estimates
Standard Wald 95% Chi
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq
Intercept 1 2.7726 0.2500 2.2826 3.2626 123.00 <.0001
X x1 1 1.1632 0.5123 2.1673 0.1590 5.15 0.0232
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 0.3365 0.5855 1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z2 1 0.8267 0.4532 1.7149 0.0615 3.33 0.0681
Z z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 0.0690 0.7878 1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z2 1 1.4145 0.7187 0.0059 2.8230 3.87 0.0490
X*Z x1 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
Homogeneous Association with zero margin: (XY,XZ,YZ)
Also, notice that if you fit a model that does not require the YZ margin which has zero counts, then there are no issues in fitting the model and getting correct degrees of freedom. For example, fit a loglinear model of independence  the results will be the same if you use the original 2 × 2 × 2 table or the one where you remove the zero counts.
From  notice huge standard errors for some of the terms such as Yy2 : Zz2 below!
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))
Deviance Residuals:
1 2 3 4 5 6 7 8
8.988e06 0.000e+00 0.000e+00 2.581e08 1.664e05 0.000e+00 0.000e+00 1.490e08
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 2.393e+01 4.535e+04 0.001 1.000
Xx2 1.232e+00 9.397e01 1.311 0.190
Yy2 2.554e+01 4.535e+04 0.001 1.000
Zz2 2.572e+01 4.535e+04 0.001 1.000
Xx2:Yy2 6.899e02 7.878e01 0.088 0.930
Xx2:Zz2 1.414e+00 7.187e01 1.968 0.049 *
Yy2:Zz2 2.514e+01 4.535e+04 0.001 1.000

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3.7197e+01 on 7 degrees of freedom
Residual deviance: 3.5779e10 on 1 degrees of freedom
AIC: 37.101
As expected df = 1, but they really ought to be 0! Here, if we apply the above formula, the dfs should be:
df = (T_{e}  Z_{e})  (T_{p}  Z_{p})=(82)(71)=66=0
However, the estimates are huge and the model like this is useless. The best is to remove the cells with zero counts and refit the model, and this should compute the correct degrees of freedom too. In R, one way of doing this is:
count=c(0,6,5,9,0,5,16,7)
count=count[which(count!=0)]
X=X[which(count!=0)]
Y=Y[which(count!=0)]
Z=Z[which(count!=0)]
### Homogeneous Association with zero margin: (XY,XZ,YZ)
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))
summary(model)
And the output looks like:
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>z)
(Intercept) 1.7918 0.4082 4.389 1.14e05 ***
Xx2 0.9808 0.4787 2.049 0.0405 *
Yy2 0.4055 0.5270 0.769 0.4417
Zz2 0.1823 0.6055 0.301 0.7633
Xx2:Yy2 NA NA NA NA
Xx2:Zz2 0.6444 0.7563 0.852 0.3942
Yy2:Zz2 0.4055 0.8233 0.493 0.6224

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 9.5791e+00 on 5 degrees of freedom
Residual deviance: 1.3323e15 on 0 degrees of freedom
AIC: 35.101
Also, notice that if you fit a model that does not require the YZ margin which has zero counts, then there are no issues in fitting the model and getting correct degrees of freedom. For example, fit a loglinear model of independence  the results will be the same if you use the original 2 × 2 × 2 table or the one where you remove the zero counts.
Detecting a problem
The iterative algorithm for computing the MLE estimates of the model parameters may not converge, and the software output will typically give you a warning.
For example, in the .lst SAS file for PROC GENMOD under Criteria For Assessing Goodness Of Fit:
and, for example, in the .log SAS file for PROC GENMOD:
Look at the fit of the saturated model in ZerosEx.lst and ZerosEx.log
Here is another way to detect that there is an error...
The estimated standard errors of parameters with zero counts, and fitted counts, are huge in comparison to the rest of the estimates.
Notice the huge standard error, 139605.2 for \(\lambda_{111}^{XYZ}\) and for the corresponding fitted count.
The bias problem due to sparseness
 The odds ratio estimates can be severely biased
 The sampling distribution of fit statistics may be poorly approximated by the chisquared distribution (this is why you are getting the huge standard errors).
An ad hoc solution: add .5 to each cell in the table
Adding .5 shrinks the estimated odds ratios that are 1 to finite values and increases estimates that are 0. However, for unsaturated models, adding .5 will oversmooth the data.
Be cautious! Keep in mind that an infinite estimate of a model parameter maybe OK, but an infinite estimate of a true odds ratio is unsatisfactory.
Suggested Solutions:
 When a model does not converge, try adding a tiny number to all cells in the table such as 0.000000001 to all of the zero cells. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS are doing when it fits these types of models.
 Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.
 Try alternative modeling procedures and techniques such as Random effects models, Bayesian methods, etc....
Software solutions
The devil is in the details! When you have special situations you have to pay special attention. The software solutions will differ depending if a called procedure treats the zero as a sampling or a structural zero. This can even differ depending on the version of the software that is used! If a sampling zero, the most common solution is to add a very small value (e.g. 1E20 (10 to the power of 20), to these cells.
You have to consult the manuals.
Some notes for SAS:
For PROC CATMOD: see the SAS links in the first part of this lesson. Note that the treatment of zeros with CATMOD also depends on which iterative algorithm is used for estimating the parameters (e.g. weighted least squares versus maximum likelihood).
In PROC GENMOD: If all possible combinations of categories of independent variables are listed, with the count variable taking value zero for the empty cells, then the zeros will be treated as sampling zeros. If only nonzero cells are included in the data set (that is we delete the empty cells from the data set), then the empty cells are treated as structural zeros (see the end of ZerosEx.sas).
Again, you must check the help/description sections for implemented procedures in the help manuals!