6.2.3 - More on Goodness-of-Fit and Likelihood ratio tests

Suppose two alternative models are under consideration, one model is simpler or more parsimonious than the other. ore often than not, one of the models is the saturated model. Another common situation is to consider ‘nested’ models, where one model is obtained from the other one by putting some of the parameters to be zero. Suppose now we test

\(H_0\): reduced model is true vs. \(H_A\): current model is true

Notice the difference in the null and alternative hypothesis from the section above. Here to test the null hypothesis that an arbitrary group of k coefficients from the model is set equal to zero (e.g. no relationship with the response), we need to fit two models:

  • the reduced model which omits the k predictors in question, and
  • the current model which includes them.

The likelihood-ratio statistic is

\(\delta G^2\) = −2 log L from reduced model
−(−2 log L from current model)

and the degrees of freedom is k (the number of coefficients in question). The p-value is \(P(\chi^2_k \geq \Delta G^2)\).

To perform the test, we must look at the "Model Fit Statistics" section and examine the value of "−2 Log L" for "Intercept and Covariates." Here, the reduced model is the "intercept-only" model (e.g. no predictors) and "intercept and covariates" is the current model we fitted. For our running example, this would be equivalent to testing "intercept-only" model vs. full (saturated) model (since we have only one covariate).

SAS output

Larger values of \(\delta G^2\) ("−2 Log L" ) lead to small p-values, which provide evidence against the reduced model in favor of the current model; you can explore AIC (Akaike Information Criterion) and SC (Schwarz Criterion) on your own through SAS help files or see Lesson 5 for AIC.

For our example, \(\delta G^2\) = 5176.510 − 5147.390 = 29.1207 with df = 2 − 1 = 1. Notice that this matches Deviance we got in the earlier text above.

Another way to calculate the test statistic is

\(\delta G^2\) = G2 from reduced model
−\(G^2\) from current model,

where the \(G^2\)'s are the overall goodness-of-fit statistics.

This value of -2 Log L is useful to compare two nested models which differ by an arbitrary set of coefficients.

Also notice that the \(\delta G^2\) we calculated for this example equals to

Likelihood Ratio 29.1207 1 <.0001

from "Testing Global Hypothesis: BETA=0" section (the next part of the output, see below).

Testing the Joint Significance of All Predictors Section

Testing the null hypothesis that the set of coefficients is simultaneously zero. For example,

\(\text{log}\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\ldots+\beta_k X_k\)

test \(H_0 : \beta_1 = \beta_2 = ... = 0\) versus the alternative that at least one of the coefficients \(\beta_1, . . . , \beta_k\) is not zero.

This is like the overall F−test in linear regression. In other words, this is testing the null hypothesis that an intercept-only model is correct,


versus the alternative that the current model is correct

\(\text{log}\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 X_1+\beta_2 X_2+\ldots+\beta_k X_k\)

In our example, we are testing the null hypothesis that an intercept-only model is correct,


versus the alternative that the current model (in this case saturated model) is correct

\(\text{log}\left(\dfrac{\pi}{1-\pi}\right)=\beta_0+\beta_1 X_1\)

In the SAS output, three different chisquare statistics for this test are displayed in the section "Testing Global Null Hypothesis: Beta=0," corresponding to the likelihood ratio, score and Wald tests. Recall their definitions from the very first lessons.

SAS output

This test has k degrees of freedom (e.g. the number of dummy indicators (design variables), that is the number of \(\beta\)-parameters (except the intercept)).

Large chisquare statistics lead to small p-values and provide evidence against the intercept-only model in favor of the current model.

The Wald test is based on asymptotic normality of ML estimates of \(\beta\)'s. Rather than using the Wald, most statisticians would prefer the LR test.

If these three tests agree, that is evidence that the large-sample approximations are working well and the results are trustworthy. If the results from the three tests disagree, most statisticians would tend to trust the likelihood-ratio test more than the other two.

In our example, the "intercept only" model or the null model says that student's smoking is unrelated to parents' smoking habits. Thus the test of the global null hypothesis \(\beta_1 = 0\) is equivalent to the usual test for independence in the 2 × 2 table. We will see that the estimated coefficients and SE's are as we predicted before, as well as the estimated odds and odds ratios.

Residual deviance is the difference in \(G^2 = −2 \text{logL}\) between a saturated model and the built model. The high residual deviance shows that the model cannot be accepted.

The null deviance is the difference in \(G^2 = −2 \text{logL}\) between a saturated model and the intercept-only model. The high residual deviance shows that the intercept-only model does not fit.

R output

In our 2 × 2 table smoking example, the residual deviance is almost 0 because the model we built is the saturated model. And notice that the degree of freedom is 0, too. Regarding the null deviance, we could see it equivalent to the section "Testing Global Null Hypothesis: Beta=0," by likelihood ratio in SAS output.

For our example, Null deviance = 29.1207 with df = 1. Notice that this matches Deviance we got in the earlier text above.

The Homer-Lemeshow Statistic Section

An alternative statistic for measuring overall goodness-of-fit is Hosmer-Lemeshow statistic.

NOTE! We use one predictor model here, that is, at least one parent smokes.

This is a Pearson-like \(χ^2\) that is computed after data are grouped by having similar predicted probabilities. It is more useful when there is more than one predictor and/or continuous predictors in the model too. We will see more on this later, and in your homework.

\(H_0\) : the current model fits well
\(H_A\) : the current model does not fit well

To calculate this statistic:

  1. Group the observations according to model-predicted probabilities ( \(\hat{\pi}_i\))
  2. The number of groups is typically determined such that there is roughly an equal number of observations per group
  3. Hosmer-Lemeshow (HL) statistic, a Pearson-like chi-square statistic, is computed on the grouped data, but does NOT have a limiting chi-square distribution because the observations in groups are not from identical trials. Simulations have shown, that this statistic can be approximated by chi-squared distribution with df = g − 2 where g is the number of groups.

Warning about Hosmer-Lemeshow goodness-of-fit test:

  • It is a conservative statistic, i.e. its value is smaller than what it aught to be and therefore rejection probability of the null hypothesis is smaller.
  • It has low power in predicting a certain types of lack of fit such as nonlinearity in explanatory variable
  • It is highly dependent on how the observations are grouped
  • If too few groups are used (e.g. 5 or less) it almost always indicates that the model fits the data; this means that it's usually not a good measure if you only have one or two categorical predictor variables, and it's best used for continuous predictors.

In the model statement, the option lackfit tells SAS to compute HL statistics and print the partitioning.

For our example, because we have small number of groups (e.g, 2), this statistic gives a perfect fit (HL = 0, p-value = 1).

Instead of deriving the diagnostics we will look at them from a purely applied viewpoint. Recall the definitions and introductions to the regression residuals and Pearson and Deviance residuals.

Residuals Section

The Pearson residuals are defined as


The contribution of the ith row to the Pearson statistic \(X^2\) is


and the Pearson goodness-of fit statistic is

\(X^2=\sum\limits_{i=1}^N r^2_i\)

which we would compare to a \(\chi^2_{N-p}\) distribution. The deviance test statistic is

\(G^2=2\sum\limits_{i=1}^N \left\{ y_i\text{log}\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_i-y_i)\text{log}\left(\dfrac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right\}\)

which we would again compare to \(\chi^2_{N-p}\), and the contribution of the ith row to the deviance is

\(2\left\{ y_i\text{log}\left(\dfrac{y_i}{\hat{\mu}_i}\right)+(n_i-y_i)\text{log}\left(\dfrac{n_i-y_i}{n_i-\hat{\mu}_i}\right)\right\}\)

We will note how these quantities are derived through appropriate software and how they provide useful information to understand and interpret the models. For an example see the SAS or R analysis in the next section.