As we consider higherdimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to
 the sample size \(n\)
 the table size \(N\)
We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.
Types of zeros (or empty cells) Section
Usually, when we talk about sparseness we are talking about zeros. This is a problem that is frequently encountered. There are different types of zeros, and we need to differentiate between the two.
Sampling Zeros occur when there is no observation in a cell, but there is a positive probability of observing an observation. That is, \(n_{ij} = 0\), but \(\pi_{ij} > 0\), for some \(i\) and \(j\). And increasing the sample size will eventually lead to an observation there.
Structural Zeros are are theoretically impossible to observe. That is, both \(n_{ij} = 0\), and \(\pi_{ij} = 0\). Tables with structural zeros are structurally incomplete. They are also known as incomplete tables. This is different from a partial classification where an incomplete table results from not being able to completely crossclassify all individuals or units.
Example of a structurally incomplete table Section
Sample of adults on whether they got an infection once or twice:
Infection

Reinfection


Yes  No  
Yes  87 
16

No    116 
Total  87  132 
The explanation here is that one cannot get an infection the second time until one gets it the first time. Thus, \(n_{21}\) is a structural zero.
Hypothetical Example Section
If there is a single sampling zero in a table, the MLE estimates are always positive. If there are nonzero margins, but a bad pattern of zeros, some MLE estimates of the cells counts can be negative.
If there are zero margins, we cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1. For example, if \(n_{+11} = 0\), there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. We can force \(\lambda_{11}^{YZ} = 0\) but then must adjust for the degrees of freedom in the model, and no software will do this automatically.
A general formula for adjusting the degrees of freedom:
\(df = \left(T_e  Z_e\right)  \left(T_p  Z_p\right)\)
where \(T_e \) is the number of cells fitted, \(Z_e\) is the number of zero cells, \(T_p\) is the number of parameters in the model, and \(Z_p \) is the number of parameters that cannot be estimated because of zero margins.
Let's see how this works in the example below. The data, also available in Zeros.sas and Zeros.R, are from Fienberg (1980) and contain a marginal total if we sum over X:
Z=1:
Y=1  Y=2  
X=1  0  5 
X=2  0  16 
Z=2
Y=1  Y=2  
X=1  6  9 
X=2  5  7 
The three marginal tables are
XY:
Y=1  Y=2  
X=1  6  14 
X=2  5  23 
XZ:
Z=1  Z=2  
X=1  5  15 
X=2  16  12 
YZ:
Z=1  Z=2  
Y=1  0  11 
Y=2  21  16 
Here, \(n_{+11} = 0\), so there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. Let's take a look at these examples in R and SAS.
The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.
data zero_margin;
input X $ Y $ Z $ count @@;
datalines;
x1 y1 z1 0 x1 y1 z2 6 x1 y2 z1 5 x1 y2 z2 9
x2 y1 z1 0 x2 y1 z2 5 x2 y2 z1 16 x2 y2 z2 7
;
/* homogeneous associations */
proc genmod data=zero_margin order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
Notice the large standard errors for some of the terms such as Y*Z, y1z1 below.
Homogeneous Association with zero margin 8
The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Wald
Parameter DF Estimate Error Confidence Limits ChiSquare Pr > ChiSq
Intercept 1 1.9459 0.3780 1.2051 2.6867 26.51 <.0001
X x1 1 0.2513 0.5040 0.7364 1.2390 0.25 0.6180
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 0.3365 0.5855 1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z1 1 0.8267 0.4532 0.0615 1.7149 3.33 0.0681
Z z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 0.0690 0.7878 1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z1 1 1.4145 0.7187 2.8230 0.0059 3.87 0.0490
X*Z x1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z1 1 25.5664 92704.17 181722 181671.3 0.00 0.9998
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the threeway interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 1 0.0000 0.0000
Scaled Deviance 1 0.0000 0.0000
Pearson ChiSquare 1 0.0000 0.0000
Scaled Pearson X2 1 0.0000 0.0000
\(df = \left(T_e  Z_e\right)  \left(T_p  Z_p\right) = \left(8  2\right)  \left(7  1\right) = 6  6 = 0\)
As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have
/*delete zero values*/
data zero_del;
set zero_margin;
if count=0 then delete;
run;
/* homogeneous associations */
proc genmod data=zero_del order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 0 0.0000 .
Scaled Deviance 0 0.0000 .
Pearson ChiSquare . 0.0000 .
Scaled Pearson X2 . 0.0000 .
Log Likelihood 56.6027
Full Log Likelihood 11.5503
AIC (smaller is better) 35.1007
AICC (smaller is better) .
BIC (smaller is better) 33.8512
Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be
\((8  2)  (4  0) = 2\)
/* complete independence */
proc genmod data=zero_del order=data;
class X Y Z;
model count = X Y Z /link=log dist=poi;
title 'Complete independence with zeros deleted';
run;
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2 5.0616 2.5308
Scaled Deviance 2 5.0616 2.5308
Pearson ChiSquare 2 4.9058 2.4529
Scaled Pearson X2 2 4.9058 2.4529
Log Likelihood 54.0720
Full Log Likelihood 14.0811
AIC (smaller is better) 36.1622
AICC (smaller is better) 76.1622
BIC (smaller is better) 35.3293
The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.
# Zero Margin
X=factor(c(rep("x1",4),rep("x2",4)))
Y=factor(c(rep(c("y1","y1","y2","y2"),2)))
Z=factor(c(rep(c("z1","z2"),4)))
count=c(0,6,5,9,0,5,16,7)
table=xtabs(count~X+Y+Z)
# Homogeneous Association (XY,XZ,YZ) with zero margin
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))
Notice the large standard errors for several of the estimates below.
> summary(model)
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))
Deviance Residuals:
1 2 3 4 5 6
8.988e06 0.000e+00 0.000e+00 0.000e+00 1.664e05 0.000e+00
7 8
5.960e08 1.490e08
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 2.393e+01 4.535e+04 0.001 1.000
Xx2 1.232e+00 9.397e01 1.311 0.190
Yy2 2.554e+01 4.535e+04 0.001 1.000
Zz2 2.572e+01 4.535e+04 0.001 1.000
Xx2:Yy2 6.899e02 7.878e01 0.088 0.930
Xx2:Zz2 1.414e+00 7.187e01 1.968 0.049 *
Yy2:Zz2 2.514e+01 4.535e+04 0.001 1.000

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3.7197e+01 on 7 degrees of freedom
Residual deviance: 3.5779e10 on 1 degrees of freedom
AIC: 37.101
As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the threeway interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get
\(df=(T_e−Z_e)−(T_p−Z_p)=(8−2)−(7−1)=6−6=0\)
As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have
# Delete Zero Values, "subset" used to remove 0s
# Homogeneous Association with zero margin: (XY,XZ,YZ)
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log), subset=count>0)
And the output looks like:
> summary(model)
Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log),
subset = count > 0)
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>z)
(Intercept) 1.20397 0.69121 1.742 0.0815 .
Xx2 1.23214 0.93975 1.311 0.1898
Yy2 0.40547 0.52705 0.769 0.4417
Zz2 0.58779 0.55777 1.054 0.2920
Xx2:Yy2 0.06899 0.78780 0.088 0.9302
Xx2:Zz2 1.41447 0.71866 1.968 0.0490 *
Yy2:Zz2 NA NA NA NA

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 9.5791e+00 on 5 degrees of freedom
Residual deviance: 5.3291e15 on 0 degrees of freedom
AIC: 35.101
Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be
\((8  2)  (4  0) = 2\)
# Delete Zero Values, "subset" used to remove 0s
# Complete Independence with zero margin: (X,Y,Z)
model=glm(count~X+Y+Z,family=poisson(link=log), subset=count>0)
> summary(model)
Call:
glm(formula = count ~ X + Y + Z, family = poisson(link = log),
subset = count > 0)
Deviance Residuals:
2 3 4 6 7 8
0.6314 1.3798 0.8575 0.5820 1.0228 0.7994
Coefficients:
Estimate Std. Error z value Pr(>z)
(Intercept) 1.7944 0.4798 3.740 0.000184 ***
Xx2 0.3365 0.2928 1.149 0.250443
Yy2 0.3747 0.3917 0.957 0.338746
Zz2 0.2719 0.3318 0.819 0.412519

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 9.5791 on 5 degrees of freedom
Residual deviance: 5.0616 on 2 degrees of freedom
AIC: 36.162
Effects of sampling zeros Section
There are two major effects on modeling contingency tables with sampling zeros:
 Maximum likelihood estimates (MLEs) may not exist for loglinear/logit models
 If MLE estimates exist, they may be biased
Recall that terms in a loglinear model correspond to the margins of the tables. Existence of MLE estimates depends on the positions of zeros in the table and what effects are included in a model.
If all \(n_{ij} > 0\) than MLE estimates of parameters are finite. If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite.
The bias problem due to sparseness Section
As mentioned above, the ML estimates of odds ratio can be severely biased. Also, the sampling distributions of fit statistics may be poorly approximated by the chisquared distribution, which is why the standard errors are so large.
One suggested ad hoc solution is to add .5 to each cell in the table. Doing so shrinks the estimated odds ratios that are 1 to finite values and increases estimates that would otherwise be 0.
Other suggested solutions:
 When a model does not converge, try adding a tiny number to all zero cells in the table, such as 0.000000001. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS and R are doing when they fit these types of models.
 Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.
Effect of Sparseness on \(X^2\) and \(G^2\) Section
Unfortunately there is no single rule that covers all situations. We have to be careful about application of these situationspecific rules.
So while the \(G^2\) statistic for model fit may not be well approximated by the chisquared distribution, the difference between \(G^2\)s for two nested models will be closer to chisquared than the \(G^2\) for either model versus the saturated one. Chisquared comparison tests depend more on the size of marginals than on cell sizes in the joint table. So if margins have cells exceeding 5, the chisquared approximation of \(G^{2}\left(\mbox{reduced model}\right)  G^{2}\left(\mbox{full model}\right)\) should be reasonable.
 When \(df > 1\), it is OK to have the \(\hat{\mu}\) as small as 1 as long as fewer than 20% of the cells have \(\hat{\mu} < 5\)
 The permissible size of \(\hat{\mu}\) decreases as the size of the table \(N\) increases.
 The chisquared distribution of \(X^{2}\) and \(G^{2}\) can be poor for sparse tables with both very small and very large \({\hat{\mu}}'s\) (relative to \(n/N\)).
 \(G^2\) is usually poorly approximated by the chisquared distribution when \(\dfrac{n}{N} < 5\).The pvalues for \(G^2\) may be too large or too small (it depends on n/N).
 For fixed \(n\) and \(N\), chisquared approximations are better for smaller df than for larger df.
12.5  Summary Section
In this lesson, we saw an alternative way to estimate parameters and their standard errors for a GLM that doesn't rely on full specification of the likelihood function. This "generalized estimating equation" approach needs only the marginal distribution of the responses, along with a correlation structure to describe pairwise associations between observations within the same subject.
Additionally, we saw different ways in which 0s can appear in categorical data, depending on whether an observation could have been observed but wasn't or whether an observation could not have been observed at all due to its nature.