2.7  GoodnessofFit Tests: Cell Probabilities Functions of Unknown Parameters
For many statistical models, we do not know the vector of probabilities π a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters.
HardyWeinberg problem. Suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is p. If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the socalled HardyWeinberg proportions:
genotype

proportion

no. of dominant genes

AA

π_{1} = p^{2}

2

Aa

π_{2} = 2p(1 − p)

1

aa

π_{3} = (1 − p)^{2}

0

Note that this is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as Bin(2, p), where the parameter p is not specified. we have to first estimate p to be able to estimate (i.e., say something about the) the unknown cell proportions in vector π.
Number of Children (The Poisson Model). Suppose that we observe the following numbers of children in n = 100 families:
no. of children: 
0

1

2

3

4+

count: 
19

26

29

13

13

Are these data consistent with a Poisson distribution? Recall that if a random variable Y has a Poisson distribution with mean λ, then
\(P(Y=y)=\dfrac{\lambda^y e^{\lambda}}{y!}\)
for y = 0, 1, 2, . . .. Therefore, under the Poisson model, the proportions given some unknown λ, are provided in the table below. For example, $\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{\lambda}}{0!}=e^{\lambda}$.
no. of children

proportion

0

π_{1} = e^{−λ} 
1

π_{2} = λe^{−λ} 
2

π_{3} = λ^{2}e^{−λ}/2 
3

π_{4} = λ^{3}e^{−λ}/6 
4+

π_{5} = 1 − Σ^{4}_{j=1} π_{j} 
In both of these examples, the null hypothesis is that the multinomial probabilities π_{j} depend on one or more unknown parameters in a known way. In the children's example, we maybe want to know the proportion of the families in the sample population that have 2 children. In more general notation, the model specifies that:
π_{1} = g_{1}(θ),
π_{2} = g_{2}(θ),
...
π_{k} = g_{k}(θ),
where g_{1}, g_{2}, . . . , g_{k} are known functions but the parameter θ is unknown (e.g., $\lambda$ in the children's example or $p$ in the genetics example). Let S_{0} denote the set of all π that satisfy these constraints for some parameter θ. We want to test
H_{0} : π ∈ S_{0} versus H_{1} : π ∈ S,
where S denotes the probability simplex (the space) of all possible values of π. (Notice that S is a (k − 1)dimensional space, but the dimension of S_{0} is the number of free parameters in θ.)
The method for conducting this test is as follows.
 Estimate θ by an efficient method (e.g. maximum likelihood). Call the estimate \(\hat{\theta}\).
 Calculate estimated cell probabilities \(\hat{\pi}=(\hat{\pi}_1,\hat{\pi}_2,\ldots,\hat{\pi}_k)\), where
 Calculate the goodnessoffit statistics \(X^2(x,\hat{\pi})\) and \(G^2(x,\hat{\pi})\). That is, calculate the expected cell counts 1\(E_1=n\hat{\pi}_1\), \(E_2=n\hat{\pi}_2\), . . ., \(E_k=n\hat{\pi}_k\) , and find
\(\hat{\pi}_1=g_1(\hat{\theta})\)
\(\hat{\pi}_2=g_2(\hat{\theta})\)
...
\(\hat{\pi}_k=g_k(\hat{\theta})\)
\(X^2=\sum\limits_j \dfrac{(O_jE_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \text{log}\dfrac{O_j}{E_j}\) as usual.
If \(X^2(x,\hat{\pi})\) and \(G^2(x,\hat{\pi})\) are calculated as described above, then the distribution of both X^{2} and G^{2} under the null hypothesis as n → ∞, approaches χ^{2}_{ν}, where ν equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, ν = (k − 1) − d, where d = dim(θ), i.e., the number of parameters in θ.
The difference between this result and the previous one is that the expected cell counts E_{1}, E_{2}, . . . , E_{k} used to calculate X^{2} and G^{2} now contain unknown parameters. Because we need to estimate d parameters to find E_{1}, E_{2}, . . . , E_{k}, the largesample distribution of X^{2} and G^{2} has changed; it's still a chisquared distribution, but the degrees of freedom have dropped by d , the number of unknown parameters we first need to estimate.
Example  Number of Children, continued
Are the data below consistent with a Poisson model?
no. of children: 
0

1

2

3

4+

count: 
19

26

29

13

13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate λ, the mean of the Poisson distribution, and thus here $d=1$. Recall that if we have an iid sample y_{1}, y_{2}, . . . , y_{n} from a Poisson distribution, then the ML estimate of λ is just the sample mean, \(\hat{\lambda}=n^{1}\sum_{i=1}^n y_i\). Based on the table above, we know that the original data y_{1}, y_{2}, . . . , y_{n} contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. It is also not easy to find MLE of λ in this situation without involved numerical computations. To make matters easy, suppose for now that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of λ is, therefore,
\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)
Under this estimate of λ, the expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are
E_{1} = 100e−1.78 = 16.86,
E_{2} = 100(1.78)e−1.78 = 30.02,
E_{3} = 100(1.78)^{2}e−1.78/2 = 26.72,
E_{4} = 100(1.78)^{3}e−1.78/6 = 15.85.
The expected count for the 4+ cell is most easily found by noting that Σ_{j} E_{j} = n, and thus
E_{5} = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55.
This leads to:
X^{2} = 2.08 and G^{2} = 2.09.
Since the general multinomial model here has k − 1 = 4 parameters, where k is the number of cell that is $\pi_j$'s, and the Poisson model has just one parameter $\lambda$, the degrees of freedom for this test are ν = 4 − 1 = 3, and the pvalues are
\(P(\chi^2_3\geq2.08)=0.56\)
\(P(\chi^2_3\geq2.09)=0.55\)
The Poisson model seems to fit well; there is no evidence that these data are not Poisson. Below is an example of how to do these computations using R and SAS.
Here is how to fit the Poisson Model in R using the following code :
The function dpois() calculates Poisson probabilities. You can also get the X^{2} in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the pvalue, are not correct:
Chisquared test for given probabilities
data: ob
Xsquared = 2.0846, df = 4, pvalue = 0.7202
You can use this X^{2} statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.
Here is how we can do this goodnessoffit test in SAS, by using precalculated proportions (pihats). The TESTP option specifies expected proportions for a oneway table chisquare test. But notice in the output below, that the degrees of freedom, and thus the pvalue, are not correct:
ChiSquare Test for Specified Proportions  ChiSquare 2.0892 DF 4 Pr > ChiSq 0.7194
You can use this X^{2} statistic, but need to calculate the new pvalue based on the correct degrees of freedom, in order to obtain correct inference.
children.sas (program  text file)

children.lst (output  text file)
