2.6 - Goodness-of-Fit Tests: Unspecified Parameters

For many statistical models, we do not know the vector of probabilities \(\pi\) a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters. For example, suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is \(p\). If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the so-called Hardy-Weinberg proportions:

genotype proportion no. of dominant genes
AA \(\pi_1 = p^2\) 2
Aa \(\pi_2 = 2p(1 − p)\) 1
aa \(\pi_3 = (1 − p)^2\) 0

This is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as \(Bin(2,p)\), where the parameter \(p\) is not specified. We have to first estimate it in order to obtain the expected counts under this model.

In such an example, the null hypothesis specifies some constraint on the parameters but stops short of providing their values specifically. In more general notation, the model specifies that:

\(\pi_1 = g_1(\theta)\),
\(\pi_2 = g_2(\theta)\),
\(\vdots\)
\(\pi_k = g_k(\theta)\),

where \(g_1, \ldots, g_k\) are known functions, but the parameter \(\theta\) is unknown (e.g., \(p\) in the example above). Let \(S_0\) denote the set of all \(\pi\) that satisfy these constraints for some parameter \(\theta\). We want to test

\(H_0 \colon \pi \in S_0\) versus \(H_A \colon \pi \in S\)

where \(S\) denotes the probability simplex (the space) of all possible values of \(\pi\). (Notice that \(S\) is a \((k-1)\)-dimensional space, but the dimension of \(S_0\) is the number of free parameters in \(\theta\).) The method for conducting this test is as follows:

  1. Estimate \(\theta\) by an efficient method (e.g., maximum likelihood), and denote the estimate by \(\hat{\theta}\).
  2. Calculate (estimated) expected counts under the null hypothesis by \(n\hat{\pi}_{0j}\), where \(\hat{\pi}_{0j} = g_j(\hat{\theta})\), for \(j=0,\ldots,k\). We may still denote these as \(E_j\), even though they involve an estimated parameter.
  3. Calculate the goodness-of-fit statistics \(X^2(x,\hat{\pi}_0)\) and \(G^2(x,\hat{\pi}_0)\) using the usual formulas from the \(O_j\) and the \(E_j\).

That is, once the \(E_j\) are obtained from the estimated parameter(s), we use the usual formulas:

\(X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \log\dfrac{O_j}{E_j}\)

If \(X^2\) and \(G^2\) are calculated as described above, then under the null hypothesis, both are approximately  \(\chi^{2}_{\nu}\), where \(\nu\) equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, \(\nu = (k − 1) − d\) where \(d = dim(\theta)\), i.e., the number of parameters that required estimation.

Example: Number of Children (the Poisson model) Section

Suppose that we observe the following numbers of children in \(n=100\) families:

no. of children: 0 1 2 3 4+
count: 19 26 29 13 13

Are these data consistent with a Poisson distribution? Recall that if a random variable \(Y\) has a Poisson distribution with mean \(\lambda\), then

\(P(Y=y)=\dfrac{\lambda^y e^{-\lambda}}{y!}\)

for \(y=0,1,\ldots\). Therefore, under the Poisson model, the proportions given some unknown \(\lambda\), are provided in the table below. For example, \(\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=e^{-\lambda}\).

no. of children proportion
0 \(\pi_1 = e^{−λ}\)
1 \(\pi_2 = \lambda e^{−λ}\)
2 \(\pi_3 = \dfrac{\lambda^2e^{−\lambda}}{2}\)
3 \(\pi_4 = \dfrac{\lambda^3e^{−\lambda}}{6}\)
4+ \(\pi_5 = 1 − \sum^{4}_{j=1} \pi_j\)

Are the data below consistent with a Poisson model?

no. of children 0 1 2 3 4+
count 19 26 29 13 13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate \(\lambda\), the mean of the Poisson distribution, and thus here \(d=1\). Recall that if we have an iid sample \(y_1, \ldots , y_n\) from a Poisson distribution, then the ML estimate of \(\lambda\) is just the sample mean, \(\hat{\lambda}=\overline{Y}\). Based on the table above, we know that the original data \(y_1,\ldots , y_n\) contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. To make matters easy, suppose that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of \(\lambda\) is (the sample mean)

\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)

With this value of \(\lambda\), the (estimated) expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are

  • \(E_1 = 100e−1.78 = 16.86\)
  • \(E_2 = 100(1.78)^{e−1.78} = 30.02\)
  • \(E_3 = 100(1.78)^{2e−1.78/2} = 26.72\)
  • \(E_4 = 100(1.78)^{3e−1.78/6} = 15.85\)

The expected count for the 4+ cell is most easily found by noting that \(\sum_j E_j = n\), and thus

\(E_5 = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55\)

This leads to:

\(X^2 = 2.08\) and \(G^2 = 2.09\).

Since the multinomial model here has \(k=5\) categories (with the usual sum-to-\(n\) constraint), and the Poisson model required one parameter \(\lambda\) to be estimated, the degrees of freedom for this test are \(\nu = 5-1-1 = 3\), and the \(p\)-values are

\(P(\chi^2_3\geq2.08)=0.56\)

\(P(\chi^2_3\geq2.09)=0.55\)

The Poisson distribution seems to be a reasonable model for these data; at least, there is not significant evidence to say otherwise. Below is an example of these computations using R and SAS.

Here is this goodness-of-fit test in SAS, using pre-calculated proportions (\(\hat{\pi}_{0}\)). The TESTP option specifies expected proportions for a one-way table chi-square test. But notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

/*Test of proportions for number of childern example*/
/*To test the poisson model for specified probabilities*/

options ls=90 nocenter nodate;

DATA children;
        INPUT children $ count;
        DATALINES;
        0 19
        1 26
        2 29
        3 13
        4+ 13
        ;
RUN;

PROC FREQ DATA=children ORDER=data;
        WEIGHT count;
        tables children/chisq testp=(16.86, 30.02, 26.72, 15.86, 10.55);
RUN;

Output

 
Chi-Square Test
for Specified Proportions
Chi-Square 2.0892
DF 4
Pr > ChiSq 0.7194

You can use this \(X^2\) statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

Here is how to fit the Poisson Model in R using the following code :

########################################################
#### Number of Children Example
#### Basic Poisson calculations broken down line by line 
#### Nicer R code that corresponds to SAS code and its ouput
#########################################################

#### input data 
ob<-c(19,26,29,13,13) 
ob

# [1] 19 26 29 13 13


#### find estimated expected probabilities 

lambdahat<-c(19*0+26*1+29*2+13*3+10*4+3*5)/100 
lambdahat

# [1] 1.78 


kids<-c(0,1,2,3) 

pihat<-dpois(kids,lambdahat) 
pihat 

# [1] 0.1686381 0.3001759 0.2671566 0.1585129 


#### attach the probability for the 4+ cell 

pihat<-c(pihat,1-sum(pihat)) 
ex<-100*pihat 
X2<-sum((ob-ex)^2/ex) 
X2 

# [1] 2.084625 

G2<-2*sum(ob*log(ob/ex)) 
G2 

# [1] 2.088668 


#### find the p-value for X^2

1-pchisq(X2,3)  

# [1] 0.5550296 



#### find the p-value for G^2 

1-pchisq(G2,3) 

# [1] 0.5542087 

#############################################################
#### Nicer R code that corresponds to SAS code and its ouput

children=c(rep("0",19),rep("1",26),rep("2",29),rep(3,13),rep("4+",13))

#### Freq Procedure

Percentage=100*as.vector(table(children))/sum(table(children))
CumulativeFrequency=cumsum(c(19,26,29,13,13))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(children),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq

#### Chi-Square Test for Specified Proportions

p=c(16.86,30.02,26.72,15.86,10.55)
chisq.test(table(children),p=p/sum(p))

The function dpois() calculates Poisson probabilities. You can also get the \(X^2\) in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the p-value, is not correct:

Chi-squared test for given probabilities

data: ob
X-squared = 2.0846, df = 4, p-value = 0.7202

You can use this \(X^2\) statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.