2.4 - Goodness-of-Fit Test

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

\(H_0\colon\) the model \(M_0\) fits

vs.

\(H_A\colon\) the model \(M_0\) does not fit (or, some other model \(M_A\) fits)

Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis \(\left(H_A\right)\) will represent the saturated model \(M_A\) which fits perfectly because each observation has a separate parameter. Later in the course, we will see that \(M_A\) could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable X corresponds to a \(Mult\left(n, \pi\right)\) model for some vector of cell probabilities, \(\pi\). We will consider two cases:

when vector \(\pi\) is known, and
when vector \(\pi\) is unknown.

In other words, we assume that under the null hypothesis data come from a \(Mult\left(n, \pi\right)\) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our dice example from Lesson 1. We want to test the hypothesis that there is an equal probability of six faces by comparing the observed frequencies to those expected under the assumed model: \(X \sim Multi(n = 30, \pi_0)\), where \(\pi_0=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)\). We can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value:

\(H_0:\pi =\pi_0\)

where the alternative hypothesis is that any of these elements differ from the null value. There are two statistics available for this test.

Test Statistics Section

Pearson Goodness-of-fit Test Statistic

The Pearson goodness-of-fit statistic is

\(X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\pi_{0j})^2}{n\pi_{0j}}\)

An easy way to remember it is

\(X^2=\sum\limits_{j=1}^k \dfrac{(O_j-E_j)^2}{E_j}\)

where \(O_j = X_j\) is the observed count in cell \(j\), and \(E_j=E(X_j)=n\pi_{0j}\) is the expected count in cell \(j\) under the assumption that null hypothesis is true.

Deviance Test Statistic

The deviance statistic is

\(G^2=2\sum\limits_{j=1}^k X_j \log\left(\dfrac{X_j}{n\pi_{0j}}\right) =2\sum\limits_j O_j \log\left(\dfrac{O_j}{E_j}\right)\)

In some texts, \(G^2\) is also called the likelihood-ratio test (LRT) statistic, for comparing the loglikelihoods \(L_0\) and \(L_1\) of two models under \(H_0\) (reduced model) and \(H_A\) (full model), respectively:

Likelihood-ratio Test Statistic

\(G^2 = -2\log\left(\dfrac{\ell_0}{\ell_1}\right) = -2\left(L_0 - L_1\right)\)

Note that \(X^2\) and \(G^2\) are both functions of the observed data \(X\) and a vector of probabilities \(\pi_0\). For this reason, we will sometimes write them as \(X^2\left(x, \pi_0\right)\) and \(G^2\left(x, \pi_0\right)\), respectively; when there is no ambiguity, however, we will simply use \(X^2\) and \(G^2\). We will be dealing with these statistics throughout the course in the analysis of 2-way and \(k\)-way tables and when assessing the fit of log-linear and logistic regression models.

Testing the Goodness-of-Fit Section

\(X^2\) and \(G^2\) both measure how closely the model, in this case \(Mult\left(n,\pi_0\right)\) "fits" the observed data. And both have an approximate chi-square distribution with \(k-1\) degrees of freedom when \(H_0\) is true. This allows us to use the chi-square distribution to find critical values and \(p\)-values for establishing statistical significance.

If the sample proportions \(\hat{\pi}_j\) (i.e., saturated model) are exactly equal to the model's \(\pi_{0j}\) for cells \(j = 1, 2, \dots, k,\) then \(O_j = E_j\) for all \(j\), and both \(X^2\) and \(G^2\) will be zero. That is, the model fits perfectly.

If the sample proportions \(\hat{\pi}_j\) deviate from the \(\pi_{0j}\)s, then \(X^2\) and \(G^2\) are both positive. Large values of \(X^2\) and \(G^2\) mean that the data do not agree well with the assumed/proposed model \(M_0\).

Example: Dice Rolls Section

Suppose that we roll a die 30 times and observe the following table showing the number of times each face ends up on top.

Face	Count
1	3
2	7
3	5
4	10
5	2
6	3
Total	30

We want to test the null hypothesis that the die is fair. Under this hypothesis, \(X \sim Mult\left(n = 30, \pi_0\right)\) where \(\pi_{0j} = 1/6\), for \(j=1,\ldots,6\). This is our assumed model, and under this \(H_0\), the expected counts are \(E_j = 30/6= 5\) for each cell. We now have what we need to calculate the goodness-of-fit statistics:

\begin{eqnarray*} X^2 &= & \dfrac{(3-5)^2}{5}+\dfrac{(7-5)^2}{5}+\dfrac{(5-5)^2}{5}\\ & & +\dfrac{(10-5)^2}{5}+\dfrac{(2-5)^2}{5}+\dfrac{(3-5)^2}{5}\\ &=& 9.2 \end{eqnarray*}

\begin{eqnarray*} G^2 &=& 2\left(3\text{log}\dfrac{3}{5}+7\text{log}\dfrac{7}{5}+5\text{log}\dfrac{5}{5}\right.\\ & & \left.+ 10\text{log}\dfrac{10}{5}+2\text{log}\dfrac{2}{5}+3\text{log}\dfrac{3}{5}\right)\\ &=& 8.8 \end{eqnarray*}

Note that even though both have the same approximate chi-square distribution, the realized numerical values of \(Χ^2\) and \(G^2\) can be different. The \(p\)-values are \(P\left(\chi^{2}_{5} \ge 9.2\right) = .10\) and \(P\left(\chi^{2}_{5} \ge 8.8\right) = .12\). Given these \(p\)-values, with the significance level of \(\alpha=0.05\), we fail to reject the null hypothesis. But rather than concluding that \(H_0\) is true, we simply don't have enough evidence to conclude it's false. That is, the fair-die model doesn't fit the data exactly, but the fit isn't bad enough to conclude that the die is unfair, given our significance threshold of 0.05.

Next, we show how to do this in SAS and R.

The following SAS code will perform the goodness-of-fit test for the example above.

/*----------------------------
| Example: die
-----------------------------*/
data die;
input face $ count;
datalines;
1 3
2 7
3 5
4 10
5 2
6 3
;
run;
proc freq;
weight count;
tables face/ chisq;
/*tables face /nocum all chisq testp=(16.67 16.67 16.67 16.67 16.67 16.67);*/
run;

/********computation of residuales, deviance residuals, X2 and G2 ****/

data cal;
set die;
pi=1/6;
ecount=30*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;

proc print;run;

proc sql;
select sum((count-ecount)**2/ecount) as X2,
       1-probchi(calculated X2,5) as pval1,
       2*sum(count*log(count/ecount)) as G2,
       1-probchi(calculated G2,5) as pval2
from cal;
quit;

Output

The FREQ Procedure


face	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1	3	10.00	3	10.00
2	7	23.33	10	33.33
3	5	16.67	15	50.00
4	10	33.33	25	83.33
5	2	6.67	27	90.00
6	3	10.00	30	100.00


Chi-Square Test for Equal Proportions
Chi-Square	9.2000
DF	5
Pr > ChiSq	0.1013

Sample Size = 30


Obs	face	count	pi	ecount	res	devres
1	1	3	0.16667	5	-0.89443	-1.75070
2	2	7	0.16667	5	0.89443	2.17039
3	3	5	0.16667	5	0.00000	0.00000
4	4	10	0.16667	5	2.23607	3.72330
5	5	2	0.16667	5	-1.34164	-1.91446
6	6	3	0.16667	5	-0.89443	-1.75070


X2	pval1	G2	pval2
9.2	0.101348	8.778485	0.118233

Notice that this SAS code only computes the Pearson chi-square statistic and not the deviance statistic.

Stop and Think!

Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percent" represent?

The following R code, dice_rolls.R will perform the same analysis as in SAS. The output will be saved into two files, dice_rolls.out and dice_rolls_Results.

###### Dice Rolls from Lesson 2: one-way tables & GOF 
##### Line by line calculations in R
##### Nice R code that corresponds to SAS code and output
##########################################################

### if you want all output into a file use: sink("dice_roll.out")
sink("dice_roll.out")

### run a goodness of fit test
dice<- chisq.test(c(3,7,5,10,2,3))
dice

########OUTPUT gives Pearson chi-squared 
#	Chi-squared test for given probabilities
#
#  data:  c(3, 7, 5, 10, 2, 3) 
#  X-squared = 9.2, df = 5, p-value = 0.1013
########

### to get observed values
dice$observed


### to get expected values
dice$expected

### to get Pearson residuals
dice$residuals


#####Make the output print into a nice table ######
#### creating a table and giving labels to the columns 
out<-round(cbind(1:6, dice$observed, dice$expected, dice$residuals),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j")

### printing your table of results into a text file with tab separation
write.table(out, "dice_rolls_Results", row.names=FALSE, col.names=TRUE, sep="\t")


#########TO GET Deviance statistic and it's p-value
G2=2*sum(dice$observed*log(dice$observed/dice$expected))
G2
1-pchisq(G2,5)


##deviance residuals
devres=sign(dice$observed-dice$expected)*sqrt(abs(2*dice$observed*log(dice$observed/dice$expected)))
devres

##to show that the G2 is a sum of deviance residuals
G2=sum(sign(dice$observed-dice$expected)*(sqrt(abs(2*dice$observed*log(dice$observed/dice$expected))))^2)
G2

########## If you want to specify explicitly the vector of probabilities 
dice1<-chisq.test(c(3,7,5,10,2,3), p=c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6))
dice1

############################################################
#### Nice R code that corresponds to SAS code and its output
## vector "face" records the face of the dice you get every time you roll it.
face=c(rep(1,3),rep(2,7),rep(3,5),rep(4,10),rep(5,2),rep(6,3))
## Freq Procedure
Percentage=100*as.vector(table(face))/sum(table(face))
CumulativeFrequency=cumsum(c(3,7,5,10,2,3))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(face),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq
## Chi-Square Test for Equal Proportions
chisq.test(table(face))


### if you used sink("dice_roll.out"), and now want to see the output inside the console use: sink()
sink()

Output

[1]  3  7  5 10  2  3
[1] 5 5 5 5 5 5
[1] -0.8944272  0.8944272  0.0000000  2.2360680 -1.3416408 -0.8944272
[1] 8.778485
[1] 0.1182326

	Chi-squared test for given probabilities

data:  c(3, 7, 5, 10, 2, 3) 
X-squared = 9.2, df = 5, p-value = 0.1013

  face Freq Percentage CumulativeFrequency CumulativePercentage
1    1    3  10.000000                   3             10.00000
2    2    7  23.333333                  10             33.33333
3    3    5  16.666667                  15             50.00000
4    4   10  33.333333                  25             83.33333
5    5    2   6.666667                  27             90.00000
6    6    3  10.000000                  30            100.00000

Important! The Pearson chi-squared statistic is \(X^2 = 9.2\) (p-value=0.101), the deviance statistic (from the R output) is \(G^2=8.78\) (p-value=0.118), and they both follow the chi-squared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics, especially for the smaller sample sizes. Notice, however, that in this case, they do lead to the same conclusion --- there is moderate evidence that the dice is fair.

Stop and Think!

Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percentage" in dice_rolls.out represent?

Example: Tomato Phenotypes Section

Tall cut-leaf tomatoes were crossed with dwarf potato-leaf tomatoes, and n = 1611 offspring were classified by their phenotypes.

Phenotypes	Count
tall cut-leaf	926
tall potato-leaf	288
dwarf cut-leaf	293
dwarf potato-leaf	104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potato-leaf is less likely to observed than the others. Do the observed data support this theory?

Under the null hypothesis, the probabilities are

\(\pi_1 = 9/16 , \pi_2 = \pi_3 = 3/16 , \pi_4 = 1/16\)

and the expected frequencies are

\(E_1 = 1611(9/16) = 906.2, E_2 = E_3 = 1611(3/16) = 302.1,\text{ and }E_4 = 1611(1/16) = 100.7\).

We calculate the fit statistics and find that \(X^2 = 1.47\) and \(G^2 = 1.48\), which are nearly identical. The \(p\)-values based on the \(\chi^2\) distribution with 3 degrees of freedom are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and accept (by default) that the data are consistent with the genetic theory.

Here is the above analysis done in SAS.

 /*--------------------------------
| Example: cut-leaf
---------------------------------*/
data leaf;
input type $ count;
datalines;
tallc 926
tallp 288
dwarf 293
dwarfp 104
;
proc freq order=data;
weight count;
tables type /all testp=(56.25, 18.75, 18.75, 6.25);
run;

/********computation of residuales, deviance residuals, X2 and G2 ****/

data pi;
input freq;
pi=freq/16;
cards;
9
3
3
1
;
run;

data cal;
merge leaf pi;
ecount=1611*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;

proc print;run;

proc sql;
select sum((count-ecount)**2/ecount) as X2,
       1-probchi(calculated X2, 3) as pval1,
       2*sum(count*log(count/ecount)) as G2,
       1-probchi(calculated G2,3) as pval2
from cal;
quit;


/*************plot observed values and expected values**********/

goption reset=all ;
symbol1 v=circle c=blue I=none;
symbol2 v=circle c=red I=none;
Legend1 label=(' ') value=('observed count' 'expected count')
across=1 down=2
position = (top right inside)
mode=protect;

title "Tomato Phenotypes";

proc gplot data=cal;
plot count*type  ecount*type /overlay legend=legend1  ;
run;

quit;

title;

Output

The SAS System

The FREQ Procedure


type	Frequency	Percent	Test Percent	Cumulative Frequency	Cumulative Percent
tallc	926	57.48	56.25	926	57.48
tallp	288	17.88	18.75	1214	75.36
dwarf	293	18.19	18.75	1507	93.54
dwarfp	104	6.46	6.25	1611	100.00


Chi-Square Test for Specified Proportions
Chi-Square	1.4687
DF	3
Pr > ChiSq	0.6895

Sample Size = 1611

The SAS System


Obs	type	count	freq	pi	ecount	res	devres
1	tallc	926	9	0.5625	906.188	0.65816	6.32891
2	tallp	288	3	0.1875	302.063	-0.80912	-5.24022
3	dwarf	293	3	0.1875	302.063	-0.52143	-4.22497
4	dwarfp	104	1	0.0625	100.688	0.33012	2.59476

The SAS System


X2	pval1	G2	pval2
1.468722	0.689508	1.477587	0.687453

Here is how to do the computations in R using the following code :

#### Lesson 2 Cut-Leaf example
#### (I) Basic GOF line by line calculation 
#### (II) Doing GOF with chisq.test() 
#### (III) Nice R code that corresponds to SAS code and output
#########################################################

##(I) Basic GOF line by line computation to demonstrate formulas
ob=c(926,288,293,104) ## data
ex=1611*c(9,3,3,1)/16  ## expected counts under the assumed model

X2=sum((ob-ex)^2/ex) ## X2 statistic
X2
####  Output so you check against your output
#[1] 1.468722

1-pchisq(X2,3) ## p-value
####  Output
#[1] 0.6895079

G2=2*sum(ob*log(ob/ex)) ## deviance statistic
G2
####  Output
#[1] 1.477587

1-pchisq(G2,3) ## pvalue

####  Output
#[1] 0.6874529

########################################################
#### (II) Using the built-in chisq.test function in R
tomato=chisq.test(c(926, 288,293,104), p=c(9/16, 3/16, 3/16, 1/16))
 tomato
 tomato$statistic
 tomato$p.value 
 tomato$residuals
 
#### To get G2
 
G2=2*sum(tomato$observed*log(tomato$observed/tomato$expected))
G2
1-pchisq(G2,3)

##deviance residuals
devres=sign(tomato$observed-tomato$expected)*sqrt(abs(2*tomato$observed*log(tomato$observed/tomato$expected)))
devres

##### Creating a nice ouput 
### Cell id, Observed count, Expected count, Pearson residuals, Deviance residual
out<-round(cbind(1:5, tomato$observed, tomato$expected, tomato$residuals, devres),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j", "dev_j")
out

#### printing your table of results into a text file with tab separation
write.table(out, "tomato_Results", row.names=FALSE, col.names=TRUE, sep="\t")

#### Ploting expected and observed values
plot(c(1:4), ex$observed, xlab="cell index", ylab="counts", xlim=c(0,5))
points(ex$expected, pch=3, col="red")
legend(3,700, c("observed", "expected"), col=c(1,"red"), pch=c(1,3))


########################################################
#### (III) Nice R code that corresponds to SAS code and output
type=c(rep("tallc",926),rep("tallp",288),rep("dwarf",293),rep("dwarfp",104))
##Please note the table R provides has different order of rows 
##from that provided by SAS 
table(type)
## Freq Procedure
Percentage=100*as.vector(table(type))/sum(table(type))
CumulativeFrequency=cumsum(c(926,288,293,104))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(type),Percentage,CumulativeFrequency,CumulativePercentage)
Freq
## Chi-Square Test for Specified Proportions
p=c(18.75,6.25,56.25,18.75)
chisq.test(table(type),p=p/sum(p))
########################################################

This has step-by-step calculations and also uses chisq.test() to produce output with Pearson and deviance residuals.

Stop and Think!

Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example?