A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.
A goodness-of-fit statistic tests the following hypothesis:
\(H_0\colon\) the model \(M_0\) fits
vs.
\(H_A\colon\) the model \(M_0\) does not fit (or, some other model \(M_A\) fits)
Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis \(\left(H_A\right)\) will represent the saturated model \(M_A\) which fits perfectly because each observation has a separate parameter. Later in the course, we will see that \(M_A\) could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.
In the setting for one-way tables, we measure how well an observed variable X corresponds to a \(Mult\left(n, \pi\right)\) model for some vector of cell probabilities, \(\pi\). We will consider two cases:
- when vector \(\pi\) is known, and
- when vector \(\pi\) is unknown.
In other words, we assume that under the null hypothesis data come from a \(Mult\left(n, \pi\right)\) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.
Consider our dice example from Lesson 1. We want to test the hypothesis that there is an equal probability of six faces by comparing the observed frequencies to those expected under the assumed model: \(X \sim Multi(n = 30, \pi_0)\), where \(\pi_0=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)\). We can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value:
\(H_0:\pi =\pi_0\)
where the alternative hypothesis is that any of these elements differ from the null value. There are two statistics available for this test.
Test Statistics Section
Pearson Goodness-of-fit Test Statistic
The Pearson goodness-of-fit statistic is
\(X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\pi_{0j})^2}{n\pi_{0j}}\)
An easy way to remember it is
\(X^2=\sum\limits_{j=1}^k \dfrac{(O_j-E_j)^2}{E_j}\)
where \(O_j = X_j\) is the observed count in cell \(j\), and \(E_j=E(X_j)=n\pi_{0j}\) is the expected count in cell \(j\) under the assumption that null hypothesis is true.
Deviance Test Statistic
The deviance statistic is
\(G^2=2\sum\limits_{j=1}^k X_j \log\left(\dfrac{X_j}{n\pi_{0j}}\right) =2\sum\limits_j O_j \log\left(\dfrac{O_j}{E_j}\right)\)
In some texts, \(G^2\) is also called the likelihood-ratio test (LRT) statistic, for comparing the loglikelihoods \(L_0\) and \(L_1\) of two models under \(H_0\) (reduced model) and \(H_A\) (full model), respectively:
Likelihood-ratio Test Statistic
\(G^2 = -2\log\left(\dfrac{\ell_0}{\ell_1}\right) = -2\left(L_0 - L_1\right)\)
Note that \(X^2\) and \(G^2\) are both functions of the observed data \(X\) and a vector of probabilities \(\pi_0\). For this reason, we will sometimes write them as \(X^2\left(x, \pi_0\right)\) and \(G^2\left(x, \pi_0\right)\), respectively; when there is no ambiguity, however, we will simply use \(X^2\) and \(G^2\). We will be dealing with these statistics throughout the course in the analysis of 2-way and \(k\)-way tables and when assessing the fit of log-linear and logistic regression models.
Testing the Goodness-of-Fit Section
\(X^2\) and \(G^2\) both measure how closely the model, in this case \(Mult\left(n,\pi_0\right)\) "fits" the observed data. And both have an approximate chi-square distribution with \(k-1\) degrees of freedom when \(H_0\) is true. This allows us to use the chi-square distribution to find critical values and \(p\)-values for establishing statistical significance.
- If the sample proportions \(\hat{\pi}_j\) (i.e., saturated model) are exactly equal to the model's \(\pi_{0j}\) for cells \(j = 1, 2, \dots, k,\) then \(O_j = E_j\) for all \(j\), and both \(X^2\) and \(G^2\) will be zero. That is, the model fits perfectly.
- If the sample proportions \(\hat{\pi}_j\) deviate from the \(\pi_{0j}\)s, then \(X^2\) and \(G^2\) are both positive. Large values of \(X^2\) and \(G^2\) mean that the data do not agree well with the assumed/proposed model \(M_0\).
Example: Dice Rolls Section
Suppose that we roll a die 30 times and observe the following table showing the number of times each face ends up on top.
Face | Count |
---|---|
1 | 3 |
2 | 7 |
3 | 5 |
4 | 10 |
5 | 2 |
6 | 3 |
Total | 30 |
We want to test the null hypothesis that the die is fair. Under this hypothesis, \(X \sim Mult\left(n = 30, \pi_0\right)\) where \(\pi_{0j} = 1/6\), for \(j=1,\ldots,6\). This is our assumed model, and under this \(H_0\), the expected counts are \(E_j = 30/6= 5\) for each cell. We now have what we need to calculate the goodness-of-fit statistics:
\begin{eqnarray*} X^2 &= & \dfrac{(3-5)^2}{5}+\dfrac{(7-5)^2}{5}+\dfrac{(5-5)^2}{5}\\ & & +\dfrac{(10-5)^2}{5}+\dfrac{(2-5)^2}{5}+\dfrac{(3-5)^2}{5}\\ &=& 9.2 \end{eqnarray*}
\begin{eqnarray*} G^2 &=& 2\left(3\text{log}\dfrac{3}{5}+7\text{log}\dfrac{7}{5}+5\text{log}\dfrac{5}{5}\right.\\ & & \left.+ 10\text{log}\dfrac{10}{5}+2\text{log}\dfrac{2}{5}+3\text{log}\dfrac{3}{5}\right)\\ &=& 8.8 \end{eqnarray*}
Note that even though both have the same approximate chi-square distribution, the realized numerical values of \(Χ^2\) and \(G^2\) can be different. The \(p\)-values are \(P\left(\chi^{2}_{5} \ge 9.2\right) = .10\) and \(P\left(\chi^{2}_{5} \ge 8.8\right) = .12\). Given these \(p\)-values, with the significance level of \(\alpha=0.05\), we fail to reject the null hypothesis. But rather than concluding that \(H_0\) is true, we simply don't have enough evidence to conclude it's false. That is, the fair-die model doesn't fit the data exactly, but the fit isn't bad enough to conclude that the die is unfair, given our significance threshold of 0.05.
Next, we show how to do this in SAS and R.
The following SAS code will perform the goodness-of-fit test for the example above.
/*----------------------------
| Example: die
-----------------------------*/
data die;
input face $ count;
datalines;
1 3
2 7
3 5
4 10
5 2
6 3
;
run;
proc freq;
weight count;
tables face/ chisq;
/*tables face /nocum all chisq testp=(16.67 16.67 16.67 16.67 16.67 16.67);*/
run;
/********computation of residuales, deviance residuals, X2 and G2 ****/
data cal;
set die;
pi=1/6;
ecount=30*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;
proc print;run;
proc sql;
select sum((count-ecount)**2/ecount) as X2,
1-probchi(calculated X2,5) as pval1,
2*sum(count*log(count/ecount)) as G2,
1-probchi(calculated G2,5) as pval2
from cal;
quit;
Output
The FREQ Procedure
face | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
1 | 3 | 10.00 | 3 | 10.00 |
2 | 7 | 23.33 | 10 | 33.33 |
3 | 5 | 16.67 | 15 | 50.00 |
4 | 10 | 33.33 | 25 | 83.33 |
5 | 2 | 6.67 | 27 | 90.00 |
6 | 3 | 10.00 | 30 | 100.00 |
Chi-Square Test for Equal Proportions |
|
---|---|
Chi-Square | 9.2000 |
DF | 5 |
Pr > ChiSq | 0.1013 |
Sample Size = 30
Obs | face | count | pi | ecount | res | devres |
---|---|---|---|---|---|---|
1 | 1 | 3 | 0.16667 | 5 | -0.89443 | -1.75070 |
2 | 2 | 7 | 0.16667 | 5 | 0.89443 | 2.17039 |
3 | 3 | 5 | 0.16667 | 5 | 0.00000 | 0.00000 |
4 | 4 | 10 | 0.16667 | 5 | 2.23607 | 3.72330 |
5 | 5 | 2 | 0.16667 | 5 | -1.34164 | -1.91446 |
6 | 6 | 3 | 0.16667 | 5 | -0.89443 | -1.75070 |
X2 | pval1 | G2 | pval2 |
---|---|---|---|
9.2 | 0.101348 | 8.778485 | 0.118233 |
Notice that this SAS code only computes the Pearson chi-square statistic and not the deviance statistic.
Stop and Think!
Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percent" represent?
The following R code, dice_rolls.R will perform the same analysis as in SAS. The output will be saved into two files, dice_rolls.out and dice_rolls_Results.
###### Dice Rolls from Lesson 2: one-way tables & GOF
##### Line by line calculations in R
##### Nice R code that corresponds to SAS code and output
##########################################################
### if you want all output into a file use: sink("dice_roll.out")
sink("dice_roll.out")
### run a goodness of fit test
dice<- chisq.test(c(3,7,5,10,2,3))
dice
########OUTPUT gives Pearson chi-squared
# Chi-squared test for given probabilities
#
# data: c(3, 7, 5, 10, 2, 3)
# X-squared = 9.2, df = 5, p-value = 0.1013
########
### to get observed values
dice$observed
### to get expected values
dice$expected
### to get Pearson residuals
dice$residuals
#####Make the output print into a nice table ######
#### creating a table and giving labels to the columns
out<-round(cbind(1:6, dice$observed, dice$expected, dice$residuals),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j")
### printing your table of results into a text file with tab separation
write.table(out, "dice_rolls_Results", row.names=FALSE, col.names=TRUE, sep="\t")
#########TO GET Deviance statistic and it's p-value
G2=2*sum(dice$observed*log(dice$observed/dice$expected))
G2
1-pchisq(G2,5)
##deviance residuals
devres=sign(dice$observed-dice$expected)*sqrt(abs(2*dice$observed*log(dice$observed/dice$expected)))
devres
##to show that the G2 is a sum of deviance residuals
G2=sum(sign(dice$observed-dice$expected)*(sqrt(abs(2*dice$observed*log(dice$observed/dice$expected))))^2)
G2
########## If you want to specify explicitly the vector of probabilities
dice1<-chisq.test(c(3,7,5,10,2,3), p=c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6))
dice1
############################################################
#### Nice R code that corresponds to SAS code and its output
## vector "face" records the face of the dice you get every time you roll it.
face=c(rep(1,3),rep(2,7),rep(3,5),rep(4,10),rep(5,2),rep(6,3))
## Freq Procedure
Percentage=100*as.vector(table(face))/sum(table(face))
CumulativeFrequency=cumsum(c(3,7,5,10,2,3))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(face),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq
## Chi-Square Test for Equal Proportions
chisq.test(table(face))
### if you used sink("dice_roll.out"), and now want to see the output inside the console use: sink()
sink()
Output
[1] 3 7 5 10 2 3
[1] 5 5 5 5 5 5
[1] -0.8944272 0.8944272 0.0000000 2.2360680 -1.3416408 -0.8944272
[1] 8.778485
[1] 0.1182326
Chi-squared test for given probabilities
data: c(3, 7, 5, 10, 2, 3)
X-squared = 9.2, df = 5, p-value = 0.1013
face Freq Percentage CumulativeFrequency CumulativePercentage
1 1 3 10.000000 3 10.00000
2 2 7 23.333333 10 33.33333
3 3 5 16.666667 15 50.00000
4 4 10 33.333333 25 83.33333
5 5 2 6.666667 27 90.00000
6 6 3 10.000000 30 100.00000
Stop and Think!
Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percentage" in dice_rolls.out represent?
Example: Tomato Phenotypes Section
Tall cut-leaf tomatoes were crossed with dwarf potato-leaf tomatoes, and n = 1611 offspring were classified by their phenotypes.
Phenotypes | Count |
---|---|
tall cut-leaf | 926 |
tall potato-leaf | 288 |
dwarf cut-leaf | 293 |
dwarf potato-leaf | 104 |
Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potato-leaf is less likely to observed than the others. Do the observed data support this theory?
Under the null hypothesis, the probabilities are
\(\pi_1 = 9/16 , \pi_2 = \pi_3 = 3/16 , \pi_4 = 1/16\)
and the expected frequencies are
\(E_1 = 1611(9/16) = 906.2, E_2 = E_3 = 1611(3/16) = 302.1,\text{ and }E_4 = 1611(1/16) = 100.7\).
We calculate the fit statistics and find that \(X^2 = 1.47\) and \(G^2 = 1.48\), which are nearly identical. The \(p\)-values based on the \(\chi^2\) distribution with 3 degrees of freedom are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and accept (by default) that the data are consistent with the genetic theory.
Here is the above analysis done in SAS.
/*--------------------------------
| Example: cut-leaf
---------------------------------*/
data leaf;
input type $ count;
datalines;
tallc 926
tallp 288
dwarf 293
dwarfp 104
;
proc freq order=data;
weight count;
tables type /all testp=(56.25, 18.75, 18.75, 6.25);
run;
/********computation of residuales, deviance residuals, X2 and G2 ****/
data pi;
input freq;
pi=freq/16;
cards;
9
3
3
1
;
run;
data cal;
merge leaf pi;
ecount=1611*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;
proc print;run;
proc sql;
select sum((count-ecount)**2/ecount) as X2,
1-probchi(calculated X2, 3) as pval1,
2*sum(count*log(count/ecount)) as G2,
1-probchi(calculated G2,3) as pval2
from cal;
quit;
/*************plot observed values and expected values**********/
goption reset=all ;
symbol1 v=circle c=blue I=none;
symbol2 v=circle c=red I=none;
Legend1 label=(' ') value=('observed count' 'expected count')
across=1 down=2
position = (top right inside)
mode=protect;
title "Tomato Phenotypes";
proc gplot data=cal;
plot count*type ecount*type /overlay legend=legend1 ;
run;
quit;
title;
Output
The FREQ Procedure
type | Frequency | Percent | Test Percent |
Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|---|
tallc | 926 | 57.48 | 56.25 | 926 | 57.48 |
tallp | 288 | 17.88 | 18.75 | 1214 | 75.36 |
dwarf | 293 | 18.19 | 18.75 | 1507 | 93.54 |
dwarfp | 104 | 6.46 | 6.25 | 1611 | 100.00 |
Chi-Square Test for Specified Proportions |
|
---|---|
Chi-Square | 1.4687 |
DF | 3 |
Pr > ChiSq | 0.6895 |
Sample Size = 1611
Obs | type | count | freq | pi | ecount | res | devres |
---|---|---|---|---|---|---|---|
1 | tallc | 926 | 9 | 0.5625 | 906.188 | 0.65816 | 6.32891 |
2 | tallp | 288 | 3 | 0.1875 | 302.063 | -0.80912 | -5.24022 |
3 | dwarf | 293 | 3 | 0.1875 | 302.063 | -0.52143 | -4.22497 |
4 | dwarfp | 104 | 1 | 0.0625 | 100.688 | 0.33012 | 2.59476 |
X2 | pval1 | G2 | pval2 |
---|---|---|---|
1.468722 | 0.689508 | 1.477587 | 0.687453 |
Here is how to do the computations in R using the following code :
#### Lesson 2 Cut-Leaf example
#### (I) Basic GOF line by line calculation
#### (II) Doing GOF with chisq.test()
#### (III) Nice R code that corresponds to SAS code and output
#########################################################
##(I) Basic GOF line by line computation to demonstrate formulas
ob=c(926,288,293,104) ## data
ex=1611*c(9,3,3,1)/16 ## expected counts under the assumed model
X2=sum((ob-ex)^2/ex) ## X2 statistic
X2
#### Output so you check against your output
#[1] 1.468722
1-pchisq(X2,3) ## p-value
#### Output
#[1] 0.6895079
G2=2*sum(ob*log(ob/ex)) ## deviance statistic
G2
#### Output
#[1] 1.477587
1-pchisq(G2,3) ## pvalue
#### Output
#[1] 0.6874529
########################################################
#### (II) Using the built-in chisq.test function in R
tomato=chisq.test(c(926, 288,293,104), p=c(9/16, 3/16, 3/16, 1/16))
tomato
tomato$statistic
tomato$p.value
tomato$residuals
#### To get G2
G2=2*sum(tomato$observed*log(tomato$observed/tomato$expected))
G2
1-pchisq(G2,3)
##deviance residuals
devres=sign(tomato$observed-tomato$expected)*sqrt(abs(2*tomato$observed*log(tomato$observed/tomato$expected)))
devres
##### Creating a nice ouput
### Cell id, Observed count, Expected count, Pearson residuals, Deviance residual
out<-round(cbind(1:5, tomato$observed, tomato$expected, tomato$residuals, devres),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j", "dev_j")
out
#### printing your table of results into a text file with tab separation
write.table(out, "tomato_Results", row.names=FALSE, col.names=TRUE, sep="\t")
#### Ploting expected and observed values
plot(c(1:4), ex$observed, xlab="cell index", ylab="counts", xlim=c(0,5))
points(ex$expected, pch=3, col="red")
legend(3,700, c("observed", "expected"), col=c(1,"red"), pch=c(1,3))
########################################################
#### (III) Nice R code that corresponds to SAS code and output
type=c(rep("tallc",926),rep("tallp",288),rep("dwarf",293),rep("dwarfp",104))
##Please note the table R provides has different order of rows
##from that provided by SAS
table(type)
## Freq Procedure
Percentage=100*as.vector(table(type))/sum(table(type))
CumulativeFrequency=cumsum(c(926,288,293,104))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(type),Percentage,CumulativeFrequency,CumulativePercentage)
Freq
## Chi-Square Test for Specified Proportions
p=c(18.75,6.25,56.25,18.75)
chisq.test(table(type),p=p/sum(p))
########################################################
This has step-by-step calculations and also uses chisq.test() to produce output with Pearson and deviance residuals.
Stop and Think!
Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example?