2.6  GoodnessofFit Tests: Unspecified Parameters
2.6  GoodnessofFit Tests: Unspecified ParametersFor many statistical models, we do not know the vector of probabilities \(\pi\) a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters. For example, suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is \(p\). If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the socalled HardyWeinberg proportions:
genotype  proportion  no. of dominant genes 

AA  \(\pi_1 = p^2\)  2 
Aa  \(\pi_2 = 2p(1 − p)\)  1 
aa  \(\pi_3 = (1 − p)^2\)  0 
This is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as \(Bin(2,p)\), where the parameter \(p\) is not specified. We have to first estimate it in order to obtain the expected counts under this model.
In such an example, the null hypothesis specifies some constraint on the parameters but stops short of providing their values specifically. In more general notation, the model specifies that:
\(\pi_1 = g_1(\theta)\),
\(\pi_2 = g_2(\theta)\),
\(\vdots\)
\(\pi_k = g_k(\theta)\),
where \(g_1, \ldots, g_k\) are known functions, but the parameter \(\theta\) is unknown (e.g., \(p\) in the example above). Let \(S_0\) denote the set of all \(\pi\) that satisfy these constraints for some parameter \(\theta\). We want to test
\(H_0 \colon \pi \in S_0\) versus \(H_A \colon \pi \in S\)
where \(S\) denotes the probability simplex (the space) of all possible values of \(\pi\). (Notice that \(S\) is a \((k1)\)dimensional space, but the dimension of \(S_0\) is the number of free parameters in \(\theta\).) The method for conducting this test is as follows:
 Estimate \(\theta\) by an efficient method (e.g., maximum likelihood), and denote the estimate by \(\hat{\theta}\).
 Calculate (estimated) expected counts under the null hypothesis by \(n\hat{\pi}_{0j}\), where \(\hat{\pi}_{0j} = g_j(\hat{\theta})\), for \(j=0,\ldots,k\). We may still denote these as \(E_j\), even though they involve an estimated parameter.
 Calculate the goodnessoffit statistics \(X^2(x,\hat{\pi}_0)\) and \(G^2(x,\hat{\pi}_0)\) using the usual formulas from the \(O_j\) and the \(E_j\).
That is, once the \(E_j\) are obtained from the estimated parameter(s), we use the usual formulas:
\(X^2=\sum\limits_j \dfrac{(O_jE_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \log\dfrac{O_j}{E_j}\)
If \(X^2\) and \(G^2\) are calculated as described above, then under the null hypothesis, both are approximately \(\chi^{2}_{\nu}\), where \(\nu\) equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, \(\nu = (k − 1) − d\) where \(d = dim(\theta)\), i.e., the number of parameters that required estimation.
Example: Number of Children (the Poisson model)
Suppose that we observe the following numbers of children in \(n=100\) families:
no. of children:  0  1  2  3  4+ 

count:  19  26  29  13  13 
Are these data consistent with a Poisson distribution? Recall that if a random variable \(Y\) has a Poisson distribution with mean \(\lambda\), then
\(P(Y=y)=\dfrac{\lambda^y e^{\lambda}}{y!}\)
for \(y=0,1,\ldots\). Therefore, under the Poisson model, the proportions given some unknown \(\lambda\), are provided in the table below. For example, \(\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{\lambda}}{0!}=e^{\lambda}\).
no. of children  proportion 

0  \(\pi_1 = e^{−λ}\) 
1  \(\pi_2 = \lambda e^{−λ}\) 
2  \(\pi_3 = \dfrac{\lambda^2e^{−\lambda}}{2}\) 
3  \(\pi_4 = \dfrac{\lambda^3e^{−\lambda}}{6}\) 
4+  \(\pi_5 = 1 − \sum^{4}_{j=1} \pi_j\) 
Are the data below consistent with a Poisson model?
no. of children  0  1  2  3  4+ 

count  19  26  29  13  13 
Let's test the null hypothesis that these data are Poisson. First, we need to estimate \(\lambda\), the mean of the Poisson distribution, and thus here \(d=1\). Recall that if we have an iid sample \(y_1, \ldots , y_n\) from a Poisson distribution, then the ML estimate of \(\lambda\) is just the sample mean, \(\hat{\lambda}=\overline{Y}\). Based on the table above, we know that the original data \(y_1,\ldots , y_n\) contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. To make matters easy, suppose that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of \(\lambda\) is (the sample mean)
\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)
With this value of \(\lambda\), the (estimated) expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are
 \(E_1 = 100e−1.78 = 16.86\)
 \(E_2 = 100(1.78)^{e−1.78} = 30.02\)
 \(E_3 = 100(1.78)^{2e−1.78/2} = 26.72\)
 \(E_4 = 100(1.78)^{3e−1.78/6} = 15.85\)
The expected count for the 4+ cell is most easily found by noting that \(\sum_j E_j = n\), and thus
\(E_5 = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55\)
This leads to:
\(X^2 = 2.08\) and \(G^2 = 2.09\).
Since the multinomial model here has \(k=5\) categories (with the usual sumto\(n\) constraint), and the Poisson model required one parameter \(\lambda\) to be estimated, the degrees of freedom for this test are \(\nu = 511 = 3\), and the \(p\)values are
\(P(\chi^2_3\geq2.08)=0.56\)
\(P(\chi^2_3\geq2.09)=0.55\)
The Poisson distribution seems to be a reasonable model for these data; at least, there is not significant evidence to say otherwise. Below is an example of these computations using R and SAS.
Here is this goodnessoffit test in SAS, using precalculated proportions (\(\hat{\pi}_{0}\)). The TESTP option specifies expected proportions for a oneway table chisquare test. But notice in the output below, that the degrees of freedom, and thus the pvalue, are not correct:
/*Test of proportions for number of childern example*/
/*To test the poisson model for specified probabilities*/
options ls=90 nocenter nodate;
DATA children;
INPUT children $ count;
DATALINES;
0 19
1 26
2 29
3 13
4+ 13
;
RUN;
PROC FREQ DATA=children ORDER=data;
WEIGHT count;
tables children/chisq testp=(16.86, 30.02, 26.72, 15.86, 10.55);
RUN;
Output
ChiSquare Test for Specified Proportions 


ChiSquare  2.0892 
DF  4 
Pr > ChiSq  0.7194 
You can use this \(X^2\) statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.
Here is how to fit the Poisson Model in R using the following code :
########################################################
#### Number of Children Example
#### Basic Poisson calculations broken down line by line
#### Nicer R code that corresponds to SAS code and its ouput
#########################################################
#### input data
ob<c(19,26,29,13,13)
ob
# [1] 19 26 29 13 13
#### find estimated expected probabilities
lambdahat<c(19*0+26*1+29*2+13*3+10*4+3*5)/100
lambdahat
# [1] 1.78
kids<c(0,1,2,3)
pihat<dpois(kids,lambdahat)
pihat
# [1] 0.1686381 0.3001759 0.2671566 0.1585129
#### attach the probability for the 4+ cell
pihat<c(pihat,1sum(pihat))
ex<100*pihat
X2<sum((obex)^2/ex)
X2
# [1] 2.084625
G2<2*sum(ob*log(ob/ex))
G2
# [1] 2.088668
#### find the pvalue for X^2
1pchisq(X2,3)
# [1] 0.5550296
#### find the pvalue for G^2
1pchisq(G2,3)
# [1] 0.5542087
#############################################################
#### Nicer R code that corresponds to SAS code and its ouput
children=c(rep("0",19),rep("1",26),rep("2",29),rep(3,13),rep("4+",13))
#### Freq Procedure
Percentage=100*as.vector(table(children))/sum(table(children))
CumulativeFrequency=cumsum(c(19,26,29,13,13))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(children),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq
#### ChiSquare Test for Specified Proportions
p=c(16.86,30.02,26.72,15.86,10.55)
chisq.test(table(children),p=p/sum(p))
The function dpois() calculates Poisson probabilities. You can also get the \(X^2\) in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the pvalue, is not correct:
Chisquared test for given probabilities
data: ob
Xsquared = 2.0846, df = 4, pvalue = 0.7202
You can use this \(X^2\) statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.