12.4 - Inference for Log-linear Models: Sparse Data

12.4 - Inference for Log-linear Models: Sparse Data

As we consider higher-dimensional contingency tables, we are more likely to encounter sparseness. A sparse table is one where there are many cells with small counts and/or zeros. How many is 'many' and how small is 'small' are relative to

  • the sample size \(n\)
  • the table size \(N\)

We can have a small sample size, or a large sample size with a large number of cells. The large number of cells can be either due to the large number of classification variables, or small number of variables but with lots of levels.

Types of zeros (or empty cells)

Usually, when we talk about sparseness we are talking about zeros. This is a problem that is frequently encountered. There are different types of zeros, and we need to differentiate between the two.

Sampling Zeros occur when there is no observation in a cell, but there is a positive probability of observing an observation. That is, \(n_{ij} = 0\), but \(\pi_{ij} > 0\), for some \(i\) and \(j\). And increasing the sample size will eventually lead to an observation there.

Structural Zeros are are theoretically impossible to observe. That is, both \(n_{ij} = 0\), and \(\pi_{ij} = 0\). Tables with structural zeros are structurally incomplete. They are also known as incomplete tables. This is different from a partial classification where an incomplete table results from not being able to completely cross-classify all individuals or units.

Example of a structurally incomplete table

Sample of adults on whether they got an infection once or twice:

 
Infection
Reinfection
Yes No
Yes 87
16
No - 116
Total 87 132

The explanation here is that one cannot get an infection the second time until one gets it the first time. Thus, \(n_{21}\) is a structural zero.

Hypothetical Example

If there is a single sampling zero in a table, the MLE estimates are always positive. If there are non-zero margins, but a bad pattern of zeros, some MLE estimates of the cells counts can be negative.

If there are zero margins, we cannot estimate the corresponding term in the model, and the partial odds ratios involving this cell will equal 0 or 1. For example, if \(n_{+11} = 0\), there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. We can force \(\lambda_{11}^{YZ} = 0\) but then must adjust for the degrees of freedom in the model, and no software will do this automatically.

A general formula for adjusting the degrees of freedom:

\(df = \left(T_e - Z_e\right) - \left(T_p - Z_p\right)\)

where \(T_e \) is the number of cells fitted, \(Z_e\) is the number of zero cells, \(T_p\) is the number of parameters in the model, and \(Z_p \) is the number of parameters that cannot be estimated because of zero margins.

Let's see how this works in the example below. The data, also available in Zeros.sas and Zeros.R, are from Fienberg (1980) and contain a marginal total if we sum over X:

Z=1:

  Y=1 Y=2
X=1 0 5
X=2 0 16

Z=2

  Y=1 Y=2
X=1 6 9
X=2 5 7

The three marginal tables are

XY:

  Y=1 Y=2
X=1 6 14
X=2 5 23

XZ:

  Z=1 Z=2
X=1 5 15
X=2 16 12

YZ:

  Z=1 Z=2
Y=1 0 11
Y=2 21 16

Here, \(n_{+11} = 0\), so there is no MLE estimate for \(\lambda_{11}^{YZ} = 0\), and any model including this term will not fit properly. Let's take a look at these examples in R and SAS.

The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.

data zero_margin;
input X $ Y $ Z $ count @@;
datalines;
x1 y1 z1 0 x1 y1 z2 6 x1 y2 z1 5 x1 y2 z2 9
x2 y1 z1 0 x2 y1 z2 5 x2 y2 z1 16 x2 y2 z2 7
;
/* homogeneous associations */ 
proc genmod data=zero_margin order=data ;
class X Y Z;
model count = X Y Z  X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;

Notice the large standard errors for some of the terms such as Y*Z, y1z1 below.

Homogeneous Association with zero margin                                                 8

The GENMOD Procedure

                 Analysis Of Maximum Likelihood Parameter Estimates

                                 Standard       Wald 95%             Wald
Parameter          DF  Estimate     Error   Confidence Limits  Chi-Square  Pr > ChiSq
Intercept           1    1.9459    0.3780    1.2051    2.6867       26.51      <.0001
X          x1       1    0.2513    0.5040   -0.7364    1.2390        0.25      0.6180
X          x2       0    0.0000    0.0000    0.0000    0.0000         .         .
Y          y1       1   -0.3365    0.5855   -1.4841    0.8112        0.33      0.5655
Y          y2       0    0.0000    0.0000    0.0000    0.0000         .         .
Z          z1       1    0.8267    0.4532   -0.0615    1.7149        3.33      0.0681
Z          z2       0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x1  y1   1   -0.0690    0.7878   -1.6131    1.4751        0.01      0.9302
X*Y        x1  y2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x2  y1   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Y        x2  y2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x1  z1   1   -1.4145    0.7187   -2.8230   -0.0059        3.87      0.0490
X*Z        x1  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x2  z1   0    0.0000    0.0000    0.0000    0.0000         .         .
X*Z        x2  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y1  z1   1  -25.5664  92704.17   -181722  181671.3        0.00      0.9998
Y*Z        y1  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y2  z1   0    0.0000    0.0000    0.0000    0.0000         .         .
Y*Z        y2  z2   0    0.0000    0.0000    0.0000    0.0000         .         .
Scale               0    1.0000    0.0000    1.0000    1.0000

As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the three-way interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get

             Criteria For Assessing Goodness Of Fit

Criterion                     DF           Value        Value/DF
Deviance                       1          0.0000          0.0000
Scaled Deviance                1          0.0000          0.0000
Pearson Chi-Square             1          0.0000          0.0000
Scaled Pearson X2              1          0.0000          0.0000

\(df = \left(T_e - Z_e\right) - \left(T_p - Z_p\right) = \left(8 - 2\right) - \left(7 - 1\right) = 6 - 6 = 0\)

As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have

/*delete zero values*/
data zero_del;
set zero_margin;
if count=0 then delete;
run;
/* homogeneous associations */ 
proc genmod data=zero_del order=data ;
class X Y Z;
model count = X Y Z  X*Y X*Z Z*Y /link=log dist=poi;
title 'Homogeneous Association with zero margin';
run;
             Criteria For Assessing Goodness Of Fit

Criterion                     DF           Value        Value/DF
Deviance                       0          0.0000           .
Scaled Deviance                0          0.0000           .
Pearson Chi-Square             .          0.0000           .
Scaled Pearson X2              .          0.0000           .
Log Likelihood                           56.6027
Full Log Likelihood                     -11.5503
AIC (smaller is better)                  35.1007
AICC (smaller is better)                   .
BIC (smaller is better)                  33.8512

Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be

\((8 - 2) - (4 - 0) = 2\)

/* complete independence */
proc genmod data=zero_del order=data;
class X Y Z;
model count = X Y Z /link=log dist=poi;
title 'Complete independence with zeros deleted';
run;
             Criteria For Assessing Goodness Of Fit
Criterion                     DF           Value        Value/DF
Deviance                       2          5.0616          2.5308
Scaled Deviance                2          5.0616          2.5308
Pearson Chi-Square             2          4.9058          2.4529
Scaled Pearson X2              2          4.9058          2.4529
Log Likelihood                           54.0720
Full Log Likelihood                     -14.0811
AIC (smaller is better)                  36.1622
AICC (smaller is better)                 76.1622
BIC (smaller is better)                  35.3293

The following code fits the model of homogeneous association (XY, XZ, YZ), which allows all conditional pairwise associations to be unique but assumes they do not depend on the level of the third variable conditioned on.

# Zero Margin
X=factor(c(rep("x1",4),rep("x2",4)))
Y=factor(c(rep(c("y1","y1","y2","y2"),2)))
Z=factor(c(rep(c("z1","z2"),4)))
count=c(0,6,5,9,0,5,16,7)
table=xtabs(count~X+Y+Z)
# Homogeneous Association (XY,XZ,YZ) with zero margin
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log))

Notice the large standard errors for several of the estimates below.

> summary(model)

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log))

Deviance Residuals: 
         1           2           3           4           5           6  
-8.988e-06   0.000e+00   0.000e+00   0.000e+00  -1.664e-05   0.000e+00  
         7           8  
 5.960e-08  -1.490e-08  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
(Intercept) -2.393e+01  4.535e+04  -0.001    1.000  
Xx2          1.232e+00  9.397e-01   1.311    0.190  
Yy2          2.554e+01  4.535e+04   0.001    1.000  
Zz2          2.572e+01  4.535e+04   0.001    1.000  
Xx2:Yy2     -6.899e-02  7.878e-01  -0.088    0.930  
Xx2:Zz2     -1.414e+00  7.187e-01  -1.968    0.049 *
Yy2:Zz2     -2.514e+01  4.535e+04  -0.001    1.000  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3.7197e+01  on 7  degrees of freedom
Residual deviance: 3.5779e-10  on 1  degrees of freedom
AIC: 37.101

As expected, the overall the deviance df is reported to be 1 since exactly one term is omitted from this model: the three-way interaction. But with the zero counts, there is no room for improved fit, and we should have 0 degrees of freedom. If we apply the above formula, we get

\(df=(T_e−Z_e)−(T_p−Z_p)=(8−2)−(7−1)=6−6=0\)

As it is, the estimates and the model overall are not very useful. If however we remove the cells with zero counts and refit the model, we should get the correct degrees of freedom. Thus, we have

# Delete Zero Values, "subset" used to remove 0s
# Homogeneous Association with zero margin: (XY,XZ,YZ)
model=glm(count~X+Y+Z+X*Y+X*Z+Z*Y,family=poisson(link=log), subset=count>0)

And the output looks like:

> summary(model)

Call:
glm(formula = count ~ X + Y + Z + X * Y + X * Z + Z * Y, family = poisson(link = log), 
    subset = count > 0)

Deviance Residuals: 
[1]  0  0  0  0  0  0

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.20397    0.69121   1.742   0.0815 .
Xx2          1.23214    0.93975   1.311   0.1898  
Yy2          0.40547    0.52705   0.769   0.4417  
Zz2          0.58779    0.55777   1.054   0.2920  
Xx2:Yy2     -0.06899    0.78780  -0.088   0.9302  
Xx2:Zz2     -1.41447    0.71866  -1.968   0.0490 *
Yy2:Zz2           NA         NA      NA       NA  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9.5791e+00  on 5  degrees of freedom
Residual deviance: 5.3291e-15  on 0  degrees of freedom
AIC: 35.101

Also, notice that if we fit a model that does not require the YZ margin, such as the model of complete independence (X, Y, Z), then \(Z_p=0\) since all parameters in the model can be estimated. For this example, the df would be

\((8 - 2) - (4 - 0) = 2\)

# Delete Zero Values, "subset" used to remove 0s
# Complete Independence with zero margin: (X,Y,Z)
model=glm(count~X+Y+Z,family=poisson(link=log), subset=count>0)

 

> summary(model)

Call:
glm(formula = count ~ X + Y + Z, family = poisson(link = log), 
    subset = count > 0)

Deviance Residuals: 
      2        3        4        6        7        8  
 0.6314  -1.3798   0.8575  -0.5820   1.0228  -0.7994  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.7944     0.4798   3.740 0.000184 ***
Xx2           0.3365     0.2928   1.149 0.250443    
Yy2           0.3747     0.3917   0.957 0.338746    
Zz2          -0.2719     0.3318  -0.819 0.412519    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 9.5791  on 5  degrees of freedom
Residual deviance: 5.0616  on 2  degrees of freedom
AIC: 36.162

Effects of sampling zeros

There are two major effects on modeling contingency tables with sampling zeros:

  • Maximum likelihood estimates (MLEs) may not exist for loglinear/logit models
  • If MLE estimates exist, they may be biased

Recall that terms in a log-linear model correspond to the margins of the tables. Existence of MLE estimates depends on the positions of zeros in the table and what effects are included in a model.

If all \(n_{ij} > 0\) than MLE estimates of parameters are finite. If a table has a 0 marginal frequency, and there is a term in the model corresponding to that margin, the MLE estimates are infinite.

The bias problem due to sparseness

As mentioned above, the ML estimates of odds ratio can be severely biased. Also, the sampling distributions of fit statistics may be poorly approximated by the chi-squared distribution, which is why the standard errors are so large.

One suggested ad hoc solution is to add .5 to each cell in the table. Doing so shrinks the estimated odds ratios that are 1 to finite values and increases estimates that would otherwise be 0.

Other suggested solutions:

  • When a model does not converge, try adding a tiny number to all zero cells in the table, such as 0.000000001. Some argue you should add this value to all of the cells. Essentially, adding very small values is what the software programs like SAS and R are doing when they fit these types of models.
  • Extend this by doing a sensitivity analysis by adding different numbers of varying sizes to the cells. Examine fit statistics and parameter estimates to see if they change very much.

Effect of Sparseness on \(X^2\) and \(G^2\)

Unfortunately there is no single rule that covers all situations. We have to be careful about application of these situation-specific rules.

So while the \(G^2\) statistic for model fit may not be well approximated by the chi-squared distribution, the difference between \(G^2\)s for two nested models will be closer to chi-squared than the \(G^2\) for either model versus the saturated one. Chi-squared comparison tests depend more on the size of marginals than on cell sizes in the joint table. So if margins have cells exceeding 5, the chi-squared approximation of \(G^{2}\left(\mbox{reduced model}\right) - G^{2}\left(\mbox{full model}\right)\) should be reasonable.

  • When \(df > 1\), it is OK to have the \(\hat{\mu}\) as small as 1 as long as fewer than 20% of the cells have \(\hat{\mu} < 5\)
  • The permissible size of \(\hat{\mu}\) decreases as the size of the table \(N\) increases.
  • The chi-squared distribution of \(X^{2}\) and \(G^{2}\) can be poor for sparse tables with both very small and very large \({\hat{\mu}}'s\) (relative to \(n/N\)).
  • \(G^2\) is usually poorly approximated by the chi-squared distribution when \(\dfrac{n}{N} < 5\).The p-values for \(G^2\) may be too large or too small (it depends on n/N).
  • For fixed \(n\) and \(N\), chi-squared approximations are better for smaller df than for larger df.

12.5 - Summary

In this lesson, we saw an alternative way to estimate parameters and their standard errors for a GLM that doesn't rely on full specification of the likelihood function. This "generalized estimating equation" approach needs only the marginal distribution of the responses, along with a correlation structure to describe pairwise associations between observations within the same subject.

Additionally, we saw different ways in which 0s can appear in categorical data, depending on whether an observation could have been observed but wasn't or whether an observation could not have been observed at all due to its nature.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility