Printer-friendly versionPrinter-friendly version

As we consider higher-dimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to

  • the sample size n
  • the table size N

We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.

Types of zeros (or empty cells)

Mostly when we talk about sparseness we are talking about zeros. This is a problem that is more frequently encountered. There are different types of zeros, and we need to differentiate between the two. We have discussed different types of zero's earlier in a passing manner. Now we will examine the situations in more detail.

Sampling Zeros occur when there is no observation in the cell; i.e., nij = 0, but probabilistically you still have a chance of observing this value, P(observation in a cell) = πij > 0

For example, if you increase the sample size n, you might get nij > 0. Consider sampling graduate students under 18 years old. They exist, but we didn't sample them. We can think of these values as random zeros ; they could be missing at random, or there maybe more structural missing mechanism due to study design.

Structural Zeros are cells that are theoretically impossible to observe a value, i.e. nij = 0 and πij = 0.

Tables with structural zeros are structurally incomplete. They are also know as incomplete tables.

This is different from a partial classification where an incomplete table results from not being able to completely cross-classify all individuals or units.

Example of a structurally incomplete table

Survey of teenagers regarding their health concerns (Fienberg (1980)):

Health Concern
Gender
Male
Female
Sex/Reproduction
6
16
Menstrual problems
-
12
How healthy am I?
49
29
None
77
102

The reasonable assumption here is that men will not have concerns with menstrual problems, therefore the chance of observing this type of concern is: P(male with menstrual problems) = 0.

Examples of a partial classification:

Data from a study in which elderly were asked whether they took tranquillizers (Agresti, 1990). Some individuals were interviewed in 1979, some in 1985, and some in both 1979 and 1985.

 
1985
 
1975
yes
no
not sampled
total
yes
175
190
230
595
no
139
1518
982
2639
not sampled
64
596
-
659
total
378
2303
1212
3893

In this case, there were some possible subjects that simply were not sampled.

 

Recognizing and understanding different types of incompleteness has implications for statistical analysis and inference.

For example, if there are structural zeros or incomplete classification you should NOT:

  • fill in the cells with zeros
  • collapse the table until there are no zeros in the table
  • quit the analysis

We will focus more on fitting models with sampling zeros. Some good references on modeling incomplete tables are Fienberg (1980), Chapter 8 and Bishop, Holland and Fienberg (1975), Chapters, 5, 6 and 8.

Effects of sampling zeros

There are two major effects on modeling contingency tables with sampling zeros:

  • Maximum likelihood estimates (MLEs) may NOT exist for loglinear/logit models
  • if MLE estimates exist, they may be biased

Recall that terms in a log-linear model correspond to the margins of the tables. Existence of MLE estimates depends on the position of zeros in the table and what effects are included in a model.

If all nij > 0 than MLE estimates of parameters are finite.

If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite. The most recent research which may shed more light on the relationship between the pattern of zeros and MLE existence involves tools from Algebraic Statistics (Rinaldo, 2005).

Hypothetical Example

Reference: Fienberg, 1980, Ch.8, ZerosEx.sas, and ZerosEx.R. There are two examples in these files; first with a non-zero margin, but a bad pattern of zeros, and the second one with a zero margin.

If there is a single sampling zero in a table, the MLE estimates are always positive.

If there are non-zero margins, but a bad pattern of zeros, some MLE estimates of the cells counts will be negative (although you might not observe this in a computer program)!

If there are zero margins, you cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1 For example, n+11 = 0 so there is no MLE estimate for λ11YZ = 0, and any model including this term will not fit properly.

We can force λ11YZ = 0 but then must adjust for the degrees of freedom in the model, and no software will do this automatically (look at the last part of ZerosEx.sas and ZerosEx.lst (or ZerosEx.R) and the homogeneous model)!

A general formula for adjusting DF

df = (Te - Ze) - (Tp - Zp)

where Te = of cells fitted, Ze = of zero cells, Tp = of parameters in the model, Zp = of parameters that cannot be estimated because of zero margins.

Let's see these examples. First consider the following 2 × 2 × 2 table for each level of Z variable:

Z=1:

  Y=1 Y=2
X=1 0 5
X=2 0 16

Z=2


Y=1
Y=2
X=1
6
9
X=2
5
7

Now consider that we want to fit a homogeneous association model: (XY, XZ,YZ). Let's consider the marginal tables that correspond to the highest order terms in the model and see if any of them have 0 counts:

XY:


Y=1
Y=2
X=1
6
14
X=2
5
23

XZ:


Z=1
Z=2
X=1
5
15
X=2
16
12

YZ:


Z=1
Z=2
Y=1
0
11
Y=2
21
16

Here, n+11 = 0 so there is no MLE estimate for λ11YZ = 0, and any model including this term will not fit properly. Next, notice that a software does not usually detect that there is a problem -- you need to pay attention to this! See below for hints want to look for. Sometimes you will just get "NaN" instead of parameter estimates -- that is, a software won't be able to estimate it (e.g. loglin() in R). For now, let's look at R and SAS outputs using glm() that is GENMOD:

From -- -- notice huge standard errors for some of the terms such as Y*Z, y1z1 below!

                         Analysis Of Parameter Estimates

Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 1.9459 0.3780 1.2051 2.6867 26.51 <.0001
X x1 1 0.2513 0.5040 -0.7364 1.2390 0.25 0.6180
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 -0.3365 0.5855 -1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z1 1 0.8267 0.4532 -0.0615 1.7149 3.33 0.0681
Z z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 -0.0690 0.7878 -1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z1 1 -1.4145 0.7187 -2.8230 -0.0059 3.87 0.0490
X*Z x1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z1 1 -26.0000 115148.2 -225712 225660.3 0.00 0.9998
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Homogeneous Association with zero margin: (XY,XZ,YZ)

As expected overall df=1, but they really ought to be 0!

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 1 0.0000 0.0000

Here, if we apply the above formula, the dfs should be:

df = (Te - Ze) - (Tp - Zp)=(8-2)-(7-1)=6-6=0

However, the estimates are huge and the model like this is useless. The best is to remove the cells with zero counts and refit the model, and this should compute the correct degrees of freedom too. In SAS, one way of doing this is:

/*delete zero values*/
data zeros1;
set zeros;
if count=0 then delete;
run;

/* homogeneous associations */
proc genmod order=data ;
class X Y Z;
model count = X Y Z X*Y X*Z Z*Y /link=log dist=poi obstats;
title 'Homogeneous Association with zero margin': (XY,XZ,YZ);
run;

Now the output looks like:


Analysis Of Parameter Estimates

Standard Wald 95% Chi-
Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq

Intercept 1 2.7726 0.2500 2.2826 3.2626 123.00 <.0001
X x1 1 -1.1632 0.5123 -2.1673 -0.1590 5.15 0.0232
X x2 0 0.0000 0.0000 0.0000 0.0000 . .
Y y1 1 -0.3365 0.5855 -1.4841 0.8112 0.33 0.5655
Y y2 0 0.0000 0.0000 0.0000 0.0000 . .
Z z2 1 -0.8267 0.4532 -1.7149 0.0615 3.33 0.0681
Z z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x1 y1 1 -0.0690 0.7878 -1.6131 1.4751 0.01 0.9302
X*Y x1 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Y x2 y2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x1 z2 1 1.4145 0.7187 0.0059 2.8230 3.87 0.0490
X*Z x1 z1 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
X*Z x2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y1 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z2 0 0.0000 0.0000 0.0000 0.0000 . .
Y*Z y2 z1 0 0.0000 0.0000 0.0000 0.0000 . .
Scale 0 1.0000 0.0000 1.0000 1.0000
Homogeneous Association with zero margin: (XY,XZ,YZ)

Also, notice that if you fit a model that does not require the YZ margin which has zero counts, then there are no issues in fitting the model and getting correct degrees of freedom. For example, fit a log-linear model of independence -- the results will be the same if you use the original 2 × 2 × 2 table or the one where you remove the zero counts.

From -- notice huge standard errors for some of the terms such as Yy2 : Zz2 below! 

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))

Deviance Residuals:
         1           2           3           4           5           6           7           8 
-8.988e-06   0.000e+00   0.000e+00  -2.581e-08  -1.664e-05   0.000e+00   0.000e+00  -1.490e-08 

Coefficients:
              Estimate Std. Error z value Pr(>|z|) 
(Intercept) -2.393e+01  4.535e+04  -0.001    1.000 
Xx2          1.232e+00  9.397e-01   1.311    0.190 
Yy2          2.554e+01  4.535e+04   0.001    1.000 
Zz2          2.572e+01  4.535e+04   0.001    1.000 
Xx2:Yy2     -6.899e-02  7.878e-01  -0.088    0.930 
Xx2:Zz2     -1.414e+00  7.187e-01  -1.968    0.049 *
Yy2:Zz2     -2.514e+01  4.535e+04  -0.001    1.000 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3.7197e+01  on 7  degrees of freedom
Residual deviance: 3.5779e-10  on 1  degrees of freedom
AIC: 37.101

As expected df = 1, but they really ought to be 0! Here, if we apply the above formula, the dfs should be:

df = (Te - Ze) - (Tp - Zp)=(8-2)-(7-1)=6-6=0

However, the estimates are huge and the model like this is useless. The best is to remove the cells with zero counts and refit the model, and this should compute the correct degrees of freedom too. In R, one way of doing this is:

count=c(0,6,5,9,0,5,16,7)
count=count[which(count!=0)]
X=X[which(count!=0)]
Y=Y[which(count!=0)]
Z=Z[which(count!=0)]

### Homogeneous Association with zero margin: (XY,XZ,YZ)

model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))
summary(model)

And the output looks like:

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))

Deviance Residuals:
[1]  0  0  0  0  0  0

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   1.7918     0.4082   4.389 1.14e-05 ***
Xx2           0.9808     0.4787   2.049   0.0405 * 
Yy2           0.4055     0.5270   0.769   0.4417   
Zz2          -0.1823     0.6055  -0.301   0.7633   
Xx2:Yy2           NA         NA      NA       NA   
Xx2:Zz2      -0.6444     0.7563  -0.852   0.3942   
Yy2:Zz2      -0.4055     0.8233  -0.493   0.6224   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9.5791e+00  on 5  degrees of freedom
Residual deviance: 1.3323e-15  on 0  degrees of freedom
AIC: 35.101

Also, notice that if you fit a model that does not require the YZ margin which has zero counts, then there are no issues in fitting the model and getting correct degrees of freedom. For example, fit a log-linear model of independence -- the results will be the same if you use the original 2 × 2 × 2 table or the one where you remove the zero counts.

Detecting a problem

The iterative algorithm for computing the MLE estimates of the model parameters may not converge, and the software output will typically give you a warning.

For example, in the .lst SAS file for PROC GENMOD under Criteria For Assessing Goodness Of Fit:

SAS output

and, for example, in the .log SAS file for PROC GENMOD:

SAS output

Look at the fit of the saturated model in ZerosEx.lst and ZerosEx.log

Here is another way to detect that there is an error...

The estimated standard errors of parameters with zero counts, and fitted counts, are huge in comparison to the rest of the estimates.

Notice the huge standard error, 139605.2 for \(\lambda_{111}^{XYZ}\) and for the corresponding fitted count.

SAS output

SAS output

SAS output

SAS output

The bias problem due to sparseness

  • The odds ratio estimates can be severely biased
  • The sampling distribution of fit statistics may be poorly approximated by the chi-squared distribution (this is why you are getting the huge standard errors).

An ad hoc solution: add .5 to each cell in the table

Adding .5 shrinks the estimated odds ratios that are 1 to finite values and increases estimates that are 0. However, for unsaturated models, adding .5 will over-smooth the data.

Be cautious! Keep in mind that an infinite estimate of a model parameter maybe OK, but an infinite estimate of a true odds ratio is unsatisfactory.

Suggested Solutions:

  • When a model does not converge, try adding a tiny number to all cells in the table such as 0.000000001 to all of the zero cells. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS are doing when it fits these types of models.
  • Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.
  • Try alternative modeling procedures and techniques such as Random effects models, Bayesian methods, etc....

Software solutions

The devil is in the details! When you have special situations you have to pay special attention. The software solutions will differ depending if a called procedure treats the zero as a sampling or a structural zero. This can even differ depending on the version of the software that is used! If a sampling zero, the most common solution is to add a very small value (e.g. 1E-20 (10 to the power of -20), to these cells.

You have to consult the manuals.

Some notes for SAS:

For PROC CATMOD: see the SAS links in the first part of this lesson. Note that the treatment of zeros with CATMOD also depends on which iterative algorithm is used for estimating the parameters (e.g. weighted least squares versus maximum likelihood).

In PROC GENMOD: If all possible combinations of categories of independent variables are listed, with the count variable taking value zero for the empty cells, then the zeros will be treated as sampling zeros. If only non-zero cells are included in the data set (that is we delete the empty cells from the data set), then the empty cells are treated as structural zeros (see the end of ZerosEx.sas).

Again, you must check the help/description sections for implemented procedures in the help manuals!