# 2.4 - Goodness-of-Fit Test

A **goodness-of-fit test,** in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

*H*_{0}: the model M_{0} fits

vs. *H _{A}*: the model M

_{0}does not fit (or, some other model M

_{A}fits)

Most often the observed data represent the fit of the * saturated model, the most complex model* possible with the given data. Thus, most often the alternative hypothesis (H

_{A}) will represent the saturated model M

_{A}which fits perfectly because each observation has a separate parameter. Later in the course we will see that M

_{A}could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable *X* corresponds to a Mult (*n*, **π**) model for some vector of cell probabilities, **π**. We will consider two cases:

- when vector
**π**is known, and - when vector
**π**is unknown.

In other words, we assume that under the null hypothesis data come from a Mult (*n*, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: *X* ∼ Multi (*n* = 30, π_{0} = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.

*H*_{0}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

vs.* *

*H*_{A}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

Most software packages will already have built-in functions that will do this for you; see the next section for examples in SAS and R. Here is a step-by step procedure to help you conceptually understand this test better and what is going on behind these functions.

Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2. Step 2: Calculate the estimated (fitted) cell probabilities $\hat{\pi_j}$s, and expected cell frequencies, H_{0}. Step 3: Calculate the Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals |

### Pearson and deviance test statistics

The** Pearson goodness-of-fit statistic** is

\(X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\hat{\pi}_j)^2}{n\hat{\pi}_j}\)

An easy way to remember it is

\(X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}\)

where *O _{j}* =

*X*is the

_{j}**observed count**in cell

*j*, and $E_j=E(X_j)=n\hat{\pi}_j$ is the

**expected count**in cell

*j*under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that \(\hat{\pi}_j\) is the estimated (fitted) cell proportion π

_{j }under H

_{0}.

The **deviance statistic** is

\(G^2=2\sum\limits_{j=1}^k X_j \text{log}\left(\dfrac{X_j}{n\hat{\pi}_j}\right)\)

where "log" means natural logarithm. An easy way to remember it is

\(G^2=2\sum\limits_j O_j \text{log}\left(\dfrac{O_j}{E_j}\right)\)

In some texts, *G*^{2} is also called the** likelihood-ratio test statistic, **for comparing the likelihoods (*l _{0} and l_{1 }*) of two models, that is comparing the loglikelihoods under

*H*

_{0}(i.e., loglikelihood of the fitted model,

*L*

_{0}) and loglikelihood under

*H*

_{A}(i.e., loglikelihood of the larger, less restricted, or saturated model

*L*

_{1}):

*G*

^{2 }

*= -2log(l*

_{0}/l_{1}) = -2(L_{0}

*- L*

_{1}

*)*. A common mistake in calculating

*G*

^{2}is to leave out the factor of 2 at the front.

Note that *X*^{2} and *G*^{2} are both functions of the observed data *X* and a vector of probabilities π. For this reason, we will sometimes write them as *X*^{2}(*x*, π) and *G*^{2}(*x*, π), respectively; when there is no ambiguity, however, we will simply use *X*^{2} and *G*^{2}. We will be dealing with these statistics throughout the course; in the analysis of 2-way and k-way tables, and when assessing the fit of log-linear and logistic regression models.

### Testing the Goodness-of-Fit

*X*^{2} and *G*^{2} both measure how closely the model, in this case Mult (*n*, π) "fits" the observed data.

- If the sample proportions
*p*=_{j}*X*/_{j }*n*(i.e., saturated model) are exactly equal to the model's πfor cells_{j}*j*= 1, 2, . . . ,*k*, then*O*=_{j}*E*for all_{j}*j*, and both*X*^{2}and*G*^{2}will be zero. That is, the model fits perfectly.

- If the sample proportions
*p*deviate from the $\hat{\pi}$'s computed under H_{j}_{0}, then*X*^{2}and*G*^{2}are both positive. Large values of*X*^{2}and*G*^{2}mean that the data do not agree well with the assumed/proposed model M_{0}.

#### How can we judge the sizes of *X*^{2} and *G*^{2}?

The answer is provided by this result:

If

xis a realization ofX∼ Mult(n, π), then asnbecomes large, thesampling distributionsof bothX^{2}(x, π) andG^{2}(x, π) approach chi-squared distribution withdf = k-1, wherek= number of cells, χ2_{k−1}.

This means that we can easily test a null hypothesis *H*_{0} : **π **= **π _{0}** against the alternative

*H*

_{1}:

**π**≠

**π**for some pre-specified vector

_{0}**π**. An approximate α-level test of

_{0}*H*

_{0}versus

*H*

_{1}is:

Reject

H_{0}ifcomputed X^{2}(x, π_{0}) orG^{2}(x, π_{0}) exceeds the theoretical value χ^{2}_{k−1}(1 − α).

Here, χ^{2}_{k−1}(1 − α) denotes the (1 − α)th quantile of the χ^{2}_{k−1} distribution, the value for which the probability that a χ^{2}_{k−1} random variable is less than or equal to it is 1 − α. The ** p-value **for this test is the area to the right of the computed

*X*

^{2}or

*G*

^{2}under the χ

^{2}

_{k−1}density curve. Below is a simple visual example. Consider a chi-squared distribution with

*df*=10. Let's assume that a computed test statistic is X

^{2}=21. For α=0.05, the theoretical value is 18.31.

Useful functions in SAS and R to remember for computing the *p-values *from the chi-square distribution are:

- In R, p-value=1-pchisq(test statistic, df), e.g., 1-pchisq(21,10)=0.021
- In SAS, p-value=1-
`probchi(test statistic,df), e.g.,1-probchi(21,10)=0.021`

You can quickly review the chi-squared distribution in Lesson 0, or check out https://www.statsoft.com/textbook/stathome.html and https://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/. The STATSOFT link also has brief reviews of many other statistical concepts and methods.

Here are a few more comments on this test.

- When
*n*is large and the model is true,*X*^{2}and*G*^{2}tend to be approximately equal. For large samples, the results of the*X*^{2}and*G*^{2}tests will be essentially the same.

- An old-fashioned rule of thumb is that the χ
^{2}approximation for*X*^{2}and*G*^{2}works well provided that*n*is large enough to have*E*=_{j}*n*π_{j}≥ 5 for every*j*. Nowadays, most agree that we can have*E*< 5 for some of the cells (say, 20% of them). Some of the_{j}*E*'s can be as small as 2, but none of them should fall below 1. If this happens, then the χ_{j}^{2}approximation isn't appropriate, and the test results are not reliable. - In practice, it's a good idea to compute both
*X*^{2}and*G*^{2}to see if they lead to similar results. If the resulting*p*-values are close, then we can be fairly confident that the large-sample approximation is working well. - If it is apparent that one or more of the
*E*'s are too small, we can sometimes get around the problem by collapsing or combining cells until all the_{j}*E*'s are large enough. But we can also perform a_{j}*small-sample inference*or*exact inference*. We will see more on this in Lesson 3. Please note that the small-sample inference can be*conservative*for discrete distributions, that is may give a larger*p*-value than it really is (e.g., for more details see Agresti (2007), Sec. 1.4.3-1.4.5, and 2.6; Agresti (2013), Sec. 3.5, and for Bayesian inference Sec 3.6.) - In most applications, we will reject the null hypothesis
*X*∼ Mult (*n*,**π**) for large values of*X*^{2}or*G*^{2}. On rare occasions, however, we may want to reject the null hypothesis for unusuallyvalues of*small**X*^{2}or*G*^{2}. That is, we may want to define the*p*-value as*P*(χ^{2}_{k−1 }≤*X*^{2}) or P(χ^{2}_{k−1}≤*G*^{2}). Very small values of*X*^{2}or*G*^{2}suggest that the model fits the data too well, i.e. the data may have been fabricated or altered in some way to fit the model closely. This is how R.A. Fisher figured out that some of Mendel's experimental data must have been fraudulent (e.g., see Agresti (2007), page 327; Agresti (2013), page 19).