Printer-friendly versionPrinter-friendly version

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

H0: the model M0 fits
vs.
HA
: the model M0 does not fit (or, some other model MA fits)

Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis (HA) will represent the saturated model MA which fits perfectly because each observation has a separate parameter.  Later in the course we will see that MA could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable X corresponds to a Mult (n, π) model for some vector of cell probabilities, π. We will consider two cases:

  1. when vector π is known, and
  2. when vector π is unknown.

In other words, we assume that under the null hypothesis data come from a Mult (n, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: X ∼ Multi (n = 30, π0 = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.

H0: ( π1, π2, π3, π4, π5, π6) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

vs. 

HA: ( π1, π2, π3, π4, π5, π6) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

Most software packages will already have built-in functions that will do this for you; see the next section for examples in SAS and R. Here is a step-by step procedure to help you conceptually understand this test better and what is going on behind these functions.

Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2.

Step 2: Calculate the estimated (fitted) cell probabilities $\hat{\pi_j}$s, and expected cell frequencies, Ej's under H0

Step 3: Calculate the Pearson goodness-of-fit statistic, X 2 and/or the deviance statistic, G2 and compare them to appropriate chi-squared distributions to make a decision.

Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals.

Pearson and deviance test statistics

The Pearson goodness-of-fit statistic is

\(X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\hat{\pi}_j)^2}{n\hat{\pi}_j}\)

An easy way to remember it is

\(X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}\)

where Oj = Xj is the observed count in cell j, and $E_j=E(X_j)=n\hat{\pi}_j$ is the expected count in cell j under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that \(\hat{\pi}_j\) is the estimated (fitted) cell proportion πj under H0.

The deviance statistic is

\(G^2=2\sum\limits_{j=1}^k X_j \text{log}\left(\dfrac{X_j}{n\hat{\pi}_j}\right)\)

where "log" means natural logarithm. An easy way to remember it is

\(G^2=2\sum\limits_j O_j \text{log}\left(\dfrac{O_j}{E_j}\right)\)

In some texts, G2 is also called the likelihood-ratio test statistic, for comparing the likelihoods (l0 and l1 ) of two models, that is comparing the loglikelihoods under H0 (i.e., loglikelihood of the fitted model, L0) and loglikelihood under HA (i.e., loglikelihood of the larger, less restricted, or saturated model L1): G2 = -2log(l0/l1) = -2(L0 - L1). A common mistake in calculating G2 is to leave out the factor of 2 at the front. 

Note that X2 and G2 are both functions of the observed data X and a vector of probabilities π. For this reason, we will sometimes write them as X2(x, π) and G2(x, π), respectively; when there is no ambiguity, however, we will simply use X2 and G2. We will be dealing with these statistics throughout the course; in the analysis of 2-way and k-way tables, and when assessing the fit of log-linear and logistic regression models.

Testing the Goodness-of-Fit

X2 and G2 both measure how closely the model, in this case Mult (n, π) "fits" the observed data.

  • If the sample proportions pj = Xj /n (i.e., saturated model) are exactly equal to the model's πj for cells j = 1, 2, . . . , k, then Oj = Ej for all j, and both X2 and G2 will be zero. That is, the model fits perfectly.
  • If the sample proportions pj deviate from the $\hat{\pi}$'s computed under H0, then X2 and G2 are both positive. Large values of X2 and G2 mean that the data do not agree well with the assumed/proposed model M0.

How can we judge the sizes of X2 and G2?

The answer is provided by this result:

If x is a realization of X ∼ Mult(n, π), then as n becomes large, the sampling distributions of both X2(x, π) and G2(x, π) approach chi-squared distribution with df = k -1, where k = number of cells, χ2k−1.

This means that we can easily test a null hypothesis H0 : π = π0 against the alternative H1 : ππ0 for some pre-specified vector π0. An approximate α-level test of H0 versus H1 is:

Reject H0 if computed X2(x, π0) or G2(x, π0) exceeds the theoretical value χ2 k−1(1 − α).

Here, χ2k−1(1 − α) denotes the (1 − α)th quantile of the χ2k−1 distribution, the value for which the probability that a χ2k−1 random variable is less than or equal to it is 1 − α. The p-value for this test is the area to the right of the computed X2 or G2 under the χ2k−1 density curve. Below is a simple visual example. Consider a chi-squared distribution with df=10. Let's assume that a computed test statistic is X2=21. For α=0.05, the theoretical value is 18.31.

Useful functions in SAS and R to remember for computing the p-values from the chi-square distribution are:

  • In R, p-value=1-pchisq(test statistic, df), e.g., 1-pchisq(21,10)=0.021
  • In SAS, p-value=1-probchi(test statistic,df), e.g.,1-probchi(21,10)=0.021

You can quickly review the chi-squared distribution in Lesson 0, or check out https://www.statsoft.com/textbook/stathome.html and https://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/. The STATSOFT link also has brief reviews of many other statistical concepts and methods.

Here are a few more comments on this test.

  • When n is large and the model is true, X2 and G2 tend to be approximately equal. For large samples, the results of the X2 and G2 tests will be essentially the same.
  • An old-fashioned rule of thumb is that the χ2 approximation for X2 and G2 works well provided that n is large enough to have Ej = nπj ≥ 5 for every j. Nowadays, most agree that we can have Ej< 5 for some of the cells (say, 20% of them). Some of the Ej's can be as small as 2, but none of them should fall below 1. If this happens, then the χ2 approximation isn't appropriate, and the test results are not reliable.

  • In practice, it's a good idea to compute both X2 and G2 to see if they lead to similar results. If the resulting p-values are close, then we can be fairly confident that the large-sample approximation is working well.

  • If it is apparent that one or more of the Ej's are too small, we can sometimes get around the problem by collapsing or combining cells until all the Ej's are large enough. But we can also perform a small-sample inference or exact inference. We will see more on this in Lesson 3. Please note that the small-sample inference can be conservative for discrete distributions, that is may give a larger p-value than it really is (e.g., for more details see Agresti (2007), Sec. 1.4.3-1.4.5, and 2.6; Agresti (2013), Sec. 3.5, and for Bayesian inference Sec 3.6.)

  • In most applications, we will reject the null hypothesis X ∼ Mult (n, π) for large values of X2 or G2. On rare occasions, however, we may want to reject the null hypothesis for unusually small values of X2 or G2. That is, we may want to define the p-value as P2 k−1 X2) or P(χ2 k−1G2). Very small values of X2 or G2 suggest that the model fits the data too well, i.e. the data may have been fabricated or altered in some way to fit the model closely. This is how R.A. Fisher figured out that some of Mendel's experimental data must have been fraudulent (e.g., see Agresti (2007), page 327; Agresti (2013), page 19).