2: Binomial and Multinomial Inference

2: Binomial and Multinomial Inference

Overview

Statistical inference is the process of using sample data to make meaningful statements about population parameters. Maximum likelihood estimation is a good starting point for this, but because random samples vary, our estimators themselves are subject to variation that we have to take into account before making conclusions. A typical example of this is to form a confidence interval for a parameter by starting with the point estimate and adding and subtracting a margin of error, depending on sample variability and desired confidence. Hypothesis testing must likewise take into account sample variation before establishing significant results. This lesson considers both of these approaches for binomial and multinomial distributions.

Objectives
Upon completion of this lesson, you should be able to:

No objectives have been defined for this lesson yet.

 Lesson 2 Code Files


2.1 - Normal and Chi-Square Approximations

2.1 - Normal and Chi-Square Approximations

Central Limit Theorem

Recall the bell-shaped (standard) normal distribution with mean 0 and variance 1.

Figure 2.1: Standard Normal Distribution

Although continuous in nature, it can still be a useful approximation for many discrete random variables formed from sums and means.

Central Limit Theorem

The Central Limit Theorem (CLT) states that if \(X_1,\ldots,X_n\) are a random sample from a distribution with mean \(E(X_i)=\mu\) and variance \(V(X_i)=\sigma^2\), then the distribution of

\(\dfrac{\overline{X}-\mu}{\sigma/\sqrt{n}} \)

converges to standard normal as the sample size \(n\) gets larger. In other words, if \(n\) is reasonably large, then \(\overline{X}\) can be approximated by a normal random variable, regardless of what distribution the individual \(X_i\) values came from. Let's see how this works for binomial data.

Suppose \(X_1,\ldots,X_n\) are a random sample from a Bernoulli distribution with \(Pr(X_i=1)=1-Pr(X_i=0)=\pi\) so that \(E(X_i)=\pi\) and \(V(X_i)=\pi(1-\pi)\). By the CLT,

\(\dfrac{\overline{X}-\pi}{\sqrt{\pi(1-\pi)/n}} \)

has an approximate standard normal distribution if \(n\) is large. "Large" in this context usually means the counts of 1s and 0s (successes and failures) should be at least 5, although some authors suggest at least 10 to be more conservative. Note that since the \(X_i\) take on the values 1 and 0, the sample mean \(\overline{X}\) is just the sample proportion of 1s (successes).

Normal to Chi-Square

The chi-square distribution with \(\nu\) degrees of freedom can be defined as the sum of the squares of \(\nu\) independent standard normal random variables. In particular, if \(Z\) is standard normal, then \(Z^2\) is chi-square with one degree of freedom. For the approximation above, we have (with \(Y=\sum_i X_i\)) that 

\( \left(\dfrac{\overline{X}-\pi}{\sqrt{\pi(1-\pi)/n}}\right)^2 = \left(\dfrac{Y-n\pi}{\sqrt{n\pi(1-\pi)}}\right)^2 \)

is approximately chi-square with one degree of freedom, provided \(n\) is large. The advantage of working with a chi-square distribution is that it allows us to generalize readily to multinomial data when more than two outcomes are possible.

Example: Smartphone Data

Recall our earlier binomial application to the 20 smartphone user data. If we assume that the population proportion of Android users is \(\pi=.4\), then we can plot the exact binomial distribution corresponding to this situation---very close to the normal bell curve!

barplot(dbinom(0:20,20,.4), xlim=c(0,20), main="Binomial(20,.4)")

distribution of smartphone data

Figure 2.2: Binomial Distribution of Smartphone Data


2.2 - Tests and CIs for a Binomial Parameter

2.2 - Tests and CIs for a Binomial Parameter

For the discussion here, we assume that \(X_1,\ldots,X_n\) are a random sample from the Bernoulli (or binomial with \(n=1\)) distribution with success probability \(\pi\) so that, equivalently, \(Y=\sum_i X_i\) is binomial with \(n\) trials and success probability \(\pi\). In either case, the MLE of \(\pi\) is \(\hat{\pi}=\overline{X}=Y/n\), with \(E(\hat{\pi})=\pi\) and \(V(\hat{\pi})=\pi(1-\pi)/n\), and the quantity

\(\dfrac{\overline{X}-\mu}{\sigma/\sqrt{n}} \)

is approximately standard normal for large \(n\). The following approaches make use of this. 

Wald Test and CI

If wish to test the null hypothesis \(H_0\colon \pi=\pi_0\) versus \(H_a\colon \pi\ne\pi_0\) for some specified value \(\pi_0\), the Wald test statistic uses the MLE for \(\pi\) in the variance expression:

Wald Test Statistic

\(Z_w=\displaystyle \dfrac{\hat{\pi}-\pi_0}{\sqrt{\hat{\pi}(1-\hat{\pi})/n}} \)

For large \(n\), \(Z_w\) is approximately standard normal (the MLE is very close to \(\pi\) when \(n\) is large), and we can use standard normal quantiles for thresholds to determine statistical significance. That is, at significance level \(\alpha\), we would reject \(H_0\) if \(|Z_w|\ge z_{\alpha/2}\), the upper \(\alpha/2\) quantile. The reason for the absolute value is because we want to reject \(H_0\) if the MLE significantly differs from \(\pi_0\) in either direction.

Note that \(\pi_0\) will not be rejected if

\(\displaystyle -z_{\alpha/2}<\dfrac{\hat{\pi}-\pi_0}{\sqrt{\hat{\pi}(1-\hat{\pi})/n}} < z_{\alpha/2} \)

Rearranging slightly, we can write the equivalent result

\(\displaystyle \hat{\pi} -z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} < \pi_0 < \hat{\pi} +z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} \)

The limits are the familiar \((1-\alpha)100\%\) confidence interval for \(\pi\), which we refer to as the Wald interval:

\(\displaystyle \hat{\pi} \pm z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} \  \text{(Wald Interval)}\)

The most common choice for \(\alpha\) is 0.05, which gives 95% confidence and multiplier \(z_{.025}=1.96\). We should also mention here that one-sided confidence intervals can be constructed from one-sided tests, although they're less common.

Example: Smartphone users

To demonstrate these results in a simple example, suppose in our sample of 20 smartphone users that we observe six who use Android. Then, the MLE for \(\pi\) is \(6/20=0.30\), and we can be 95% confident that the true value of \(\pi\) is within

\(\displaystyle 0.30 \pm 1.96\sqrt{\dfrac{0.30(1-0.30)}{20}} = \)(0.0992, 0.5008)

Score Test and CI

Another popular way to test \(H_0\colon \pi=\pi_0\) is with the Score test statistic:

Score Test Statistic

\(Z_s=\displaystyle \dfrac{\hat{\pi}-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} \)

Considering that a hypothesis test proceeds by assuming the null hypothesis is true until significant evidence shows otherwise, it makes sense to use \(\pi_0\) in place of \(\pi\) in both the mean and variance of \(\hat{\pi}\). The score test thus rejects \(\pi_0\) when \(|Z_s|\ge z_{\alpha/2}\), or equivalently, will not reject \(\pi_0\) when

\(\displaystyle -z_{\alpha/2}<\dfrac{\hat{\pi}-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} < z_{\alpha/2} \)

Carrying out the score test is straightforward enough when a particular value \(\pi_0\) is to be tested. Constructing a confidence interval as the set of all \(\pi_0\) that would not be rejected requires a bit more work. Specifically, the score confidence interval limits are roots of the equation \(|Z_s|- z_{\alpha/2}=0\), which is quadratic with respect to \(\pi_0\) and can be solved with the quadratic formula. The full expressions are in the R file below, but the center of this interval is particularly noteworthy (we'll let \(z\) stand for \(z_{\alpha/2}\) for convenience):

\( \displaystyle \dfrac{\hat{\pi}+z^2/2n}{1+z^2/n} \)

Note that this center is close to the MLE \(\hat{\pi}\), but it is pushed slightly toward \(1/2\), depending on the confidence and sample size. This helps the interval's coverage probability when the sample size is small, particularly when \(\pi\) is close to 0 or 1. Recall that we're using a normal approximation for \(\hat{\pi}\) with mean \(\pi\) and variance \(\pi(1-\pi)/n\). The images below illustrate how this approximation depends on values of \(\pi\) and \(n\). 

Figure 1.3

Problems arise when too much of the normal curve falls outside the (0,1) boundaries allowed for \(\pi\). In the first case, the sample size is small, but \(\pi=0.3\) is far enough away from the boundary that the normal approximation is still useful, whereas in the second case, the normal approximation is quite poor. The larger sample size in the third case decreases the variance enough to offset the small \(\pi\) value.

In practice, the results of a poor normal approximation tend to be intervals that include values outside the range of (0,1), which we know cannot apply to \(\pi\). The score interval performs better than the Wald in these situations because it shifts the center of the interval closer to 0.5. Compare the score interval limits (below) to those of the Wald when applied to the smartphone data.

\( (0.1455, 0.5190) \)

Code

ci = function(y,n,conf)
{   pi.hat = y/n
    z = qnorm(1-(1-conf)/2)
    wald = pi.hat+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n)
    score = (pi.hat+z^2/2/n+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n+z^2/4/n^2))/(1+z^2/n)
    cbind(wald, score) }
ci(6,20,.95)

Likelihood Ratio Test and CI

Our third approach to binomial inference follows the same idea of inverting a test statistic to construct a confidence interval but utilizes the likelihood ratio test (LRT) for the binomial parameter. Recall the likelihood function for \(Y\sim Bin(n,\pi)\):

\(\displaystyle \ell(\pi)= {n\choose y}\pi^y(1 - \pi)^{(n-y)}\)

The LRT statistic for \(H_0:\pi=\pi_0\) versus \(H_a:\pi\ne\pi_0\) is

\(\displaystyle G^2=2\log\dfrac{\ell(\hat{\pi})}{\ell(\pi_0)} = 2\left(y\log\dfrac{\hat{\pi}}{\pi_0}+(n-y)\log\dfrac{1-\hat{\pi}}{1-\pi_0}\right) \)

For large \(n\), \(G^2\) is approximately chi-square with one degree of freedom, and \(\pi_0\) will be rejected if \(G^2\ge \chi^2_{1,\alpha}\). Like the Wald and score test statistics, the LRT statistic is essentially a measure of disagreement between the sample estimate and the hypothesized value for \(\pi\). Larger values indicate more disagreement and more evidence to reject \(H_0\). And we can likewise construct a confidence interval as the set of all values of \(\pi_0\) that would not be rejected. Unfortunately, we must resort to numerical approximation for these limits. Here are the results for the smartphone data.

\( (0.1319, 0.5165) \)

Like the score interval the limits for the LRT interval are centered at a value closer to 0.5, compared with the Wald limits, which are centered at the MLE.

Code
library('rootSolve')
ci = function(y,n,conf)
{   pi.hat = y/n
    z = qnorm(1-(1-conf)/2)
    wald = pi.hat+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n)
    score = (pi.hat+z^2/2/n+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n+z^2/4/n^2))/(1+z^2/n)
    loglik = function(p) 2*(y*log(pi.hat/p)+(n-y)*log((1-pi.hat)/(1-p)))-z^2
    lrt = uniroot.all(loglik,c(0.01,0.99))
    cbind(wald,score,lrt) }
ci(6,20,.95)	

2.3 - The Multinomial Distribution

2.3 - The Multinomial Distribution

Following up on our brief introduction to this extremely useful distribution, we go into more detail here in preparation for the goodness-of-fit test coming up. Recall that the multinomial distribution generalizes the binomial to accommodate more than two categories. For example, what if the respondents in a survey had three choices:

  1. I feel optimistic.
  2. I don't feel optimistic.
  3. I'm not sure.

If we separately count the number of respondents answering each of these and collect them in a vector, we can use the multinomial distribution to model the behavior of this vector.

Properties of the Multinomial Distribution

The multinomial distribution arises from an experiment with the following properties:

  • a fixed number \(n\) of trials
  • each trial is independent of the others
  • each trial has \(k\) mutually exclusive and exhaustive possible outcomes, denoted by \(E_1, \dots, E_k\)
  • on each trial, \(E_j\) occurs with probability \(\pi_j , j = 1, \dots , k\).

If we let \(X_j\) count the number of trials for which outcome \(E_j\) occurs, then the random vector \(X = \left(X_1, \dots, X_k\right)\) is said to have a multinomial distribution with index \(n\) and parameter vector \(\pi = \left(\pi_1, \dots, \pi_k\right)\), which we denote as

\(X ∼ Mult\left(n, \pi\right)\)

In most problems, \(n\) is known (e.g., it will represent the sample size). Note that we must have \(\pi_1 + \cdots + \pi_k = 1\) and \(X_1+\cdots+X_k=n\).

Marginal Counts

The individual or marginal components of a multinomial random vector are binomial and have a binomial distribution. That is, if we focus on the \(j\)th category as "success" and all other categories collectively as "failure", then \(Xj \sim Bin\left(n, \pi_j\right)\), for \(j=1,\ldots,k\).


2.3.1 - Distribution function

2.3.1 - Distribution function

The function that relates a given value of a random variable to its probability is known as the distribution function. As we saw with maximum likelihood estimation, this can also be viewed as the likelihood function with respect to the parameters \(\pi_k\).

The distribution function

The probability that \(X = \left(X_1, \dots, X_k \right)\) takes a particular value \(x = \left(x_1, \dots , x_k \right)\) is

\(f(x)=\dfrac{n!}{x_1!x_2!\cdots x_k!}\pi_1^{x_1} \pi_2^{x_2} \cdots \pi_k^{x_k}\)

The possible values of X are the set of x-vectors such that each \(x_j ∈ {0, 1, . . . , n}\) and \(x_1 + \dots + x_k = n\).

Example: Jury Selection

Suppose that the racial/ethnic distribution in a large city is given by the table that follows. Consider these three options as the parameters of a multinomial distribution.

Black Hispanic Other
20% 15% 65%

Suppose that a jury of twelve members is chosen from this city in such a way that each resident has an equal probability of being selected independently of every other resident. There are a number of questions that we can ask of this type of distribution.

Let's find the probability that the jury contains:

  • three Black, two Hispanic, and seven Other members;
  • four Black and eight Other members;
  • at most one Black member.

To solve this problem, let \(X = \left(X_1, X_2, X_3\right)\) where \(X_1 =\) number of Black members, \(X_2 =\) number of Hispanic members, and \(X_3 =\) number of Other members. Then \(X\) has a multinomial distribution with parameters \(n = 12\) and \(\pi = \left(.20, .15, .65\right)\). The answer to the first part is

\begin{align} P(X_1=3,X_2=2,X_3=7) &= \dfrac{n!}{x_1!x_2!x_3!} \pi_1^{x_1}\pi_2^{x_2}\pi_3^{x_3}\\ &= \dfrac{12!}{3!2!7!}(0.20)^3(0.15)^2(0.65)^7\\ &= 0.0699\\ \end{align}

The answer to the second part is

\begin{align} P(X_1=4,X_2=0,X_3=8) &= \dfrac{12!}{4!0!8!}(0.20)^4(0.15)^0(0.65)^8\\ &= 0.0252\\ \end{align}

For the last part, note that "at most one Black member" means \(X_1 = 0\) or \(X_1 = 1\). \(X_1\) is a binomial random variable with \(n = 12\) and \(\pi_1 = .2\). Using the binomial probability distribution,

\(P(X_1=0) = \dfrac{12!}{0!12!}(0.20)^0(0.8)^{12}= 0.0687\)

and

\(P(X_1=1) = \dfrac{12!}{1!11!}(0.20)^1(0.8)^{11}= 0.2061\)

Therefore, the answer is:

\(P\left(X_1 = 0\right) + P\left(X_1 = 1\right) = 0.0687 + 0.2061 =\) \(0.2748\)


2.3.2 - Moments

2.3.2 - Moments

Many of the elementary properties of the multinomial can be derived by decomposing \(X\) as the sum of iid random vectors,

\(X=Y_1+\cdots+Y_n\)

where each \(Y_i \sim Mult\left(1, \pi\right)\). In this decomposition, \(Y_i\) represents the outcome of the \(i\)th trial; it's a vector with a 1 in position \(j\) if \(E_j)\) occurred on that trial and 0s in all other positions. The elements of \(Y_i\) are correlated Bernoulli random variables. For example, with \(k=2\) possible outcomes on each trial, then \(Y_i=(\# E_1,\# E_2)\) on the \(i\)th trial, and the possible values of \(Y_i\) are

(1, 0) with probability \(\pi_1\),

(0, 1) with probability \(\pi_2 = 1− \pi_1\).

Because the individual elements of \(Y_i\) are Bernoulli, the mean of \(Y_i\) is \(\pi = \left(\pi_1, \pi_2\right)\), and its covariance matrix is

\begin{bmatrix} \pi_1(1-\pi_1) & -\pi_1\pi_2 \\ -\pi_1\pi_2 & \pi_2(1-\pi_2) \end{bmatrix}

Establishing the covariance term (off-diagonal element) requires a bit more work, but note that intuitively it should be negative because exactly one of either \(E_1\) or \(E_2\) must occur.

More generally, with \(k\) possible outcomes, the mean of \(Y_i\) is \(\pi = \left(\pi_1, \dots , \pi_k\right)\), and the covariance matrix is

\begin{bmatrix} \pi_1(1-\pi_1) & -\pi_1\pi_2 & \cdots & -\pi_1\pi_k \\ -\pi_1\pi_2 & \pi_2(1-\pi_2) & \cdots & -\pi_2\pi_k \\ \vdots & \vdots & \ddots & \vdots \\ -\pi_1\pi_k & -\pi_2\pi_k & \cdots & \pi_k(1-\pi_k) \end{bmatrix}

And finally returning to \(X=Y_1+\cdots+Y_n\) in full generality, we have that

\(E(X)=n\pi=(n\pi_1,\ldots,n\pi_k)\)

with covariance matrix

\begin{bmatrix} n\pi_1(1-\pi_1) & -n\pi_1\pi_2 & \cdots & -n\pi_1\pi_k \\ -n\pi_1\pi_2 & n\pi_2(1-\pi_2) & \cdots & -n\pi_2\pi_k \\ \vdots & \vdots & \ddots & \vdots \\ -n\pi_1\pi_k & -n\pi_2\pi_k & \cdots & n\pi_k(1-\pi_k) \end{bmatrix}

Because the elements of \(X\) are constrained to sum to \(n\), this covariance matrix is singular. If all the \(\pi_j\)s are positive, then the covariance matrix has rank \(k-1\). Intuitively, this makes sense since the last element \(X_k\) can be replaced by \(n − X_1− \dots − X_{k−1}\); there are really only \(k-1\) "free" elements in \(X\). If some elements of \(\pi\) are zero, the rank drops by one for every zero element.


2.3.3 - Parameter space

2.3.3 - Parameter space

If we don't impose any restrictions on the parameter

\(\pi=(\pi_1,\pi_2,\ldots,\pi_k)\)

other than the logically necessary constraints

\(\pi_j \in [0,1],j=1,\ldots,k\) (1)

and

\(\pi_1+\pi_2+\ldots+\pi_k=1\) (2)

then the parameter space is the set of all \(\pi\)-vectors that satisfy (1) and (2). This set is called a simplex. In the special case of k = 3, we can visualize \(\pi = \left(\pi_1, \pi_2, \pi_3\right)\) as a point in three-dimensional space. The simplex S is the triangular portion of a plane with vertices at (1, 0, 0), (0, 1, 0) and (0, 0, 1):

More generally, the simplex is a portion of a (k − 1)-dimensional hyperplane in k-dimensional space. Alternatively, we can replace

\(\pi_k\text{ by }1-\pi_1-\pi_2-\cdots-\pi_{k-1}\)

because it's not really a free parameter and view the simplex in (k − 1)-dimensional space. For example, with k = 3, we can replace \(\pi_3\) by \(1 − \pi_1− \pi_2\) and view the parameter space as a triangle:


2.3.4 - Maximum Likelihood Estimation

2.3.4 - Maximum Likelihood Estimation

If \(X \sim Mult\left(n, \pi\right)\) and we observe \(X = x\), then the loglikelihood function for \(\pi\) is

\(L(\pi)=\log\dfrac{n!}{n_1!\cdots n_k!}+x_1 \log\pi_1+\cdots+x_k \log\pi_k\) 

We usually ignore the leading factorial coefficient because it doesn't involve \(\pi\) and will not influence the point where \(L\) is maximized. Using multivariate calculus with the constraint that

\(\pi_1+\ldots+\pi_k=1\)

the maximum is achieved at the vector of sample proportions:

\begin{align} \hat{\pi} = \dfrac{1}{n}x= (x_1/n,x_2/n,\ldots,x_k/n)\\ \end{align}


2.3.5 - Fusing and Partitioning Cells

2.3.5 - Fusing and Partitioning Cells

We can collapse a multinomial vector by fusing cells (i.e. by adding some of the cell counts \(X_j\) together). If

\(X=(X_1,\ldots,X_k)\sim Mult(n,\pi)\)

where \(\pi = \left(\pi_1, \dots , \pi_k\right)\), then

\(X^\ast=(X_1+X_2,X_3,X_4,\ldots,X_k)\)

is also multinomial with the same index \(n\) and modified parameter \(\pi* = \left(\pi_1 + \pi_2, \pi_3, \dots , \pi_k\right)\). In the multinomial experiment, we are simply fusing the events \(E_1\) and \(E_2\) into the single event "\(E_1\) or \(E_2\)". Because these events are mutually exclusive,

\(P(E_1\text{ or }E_2)=P(E_1)+P(E_2)=\pi_1+\pi_2\)

We can also partition the multinomial by conditioning on (treating as fixed) the totals of subsets of cells. For example, consider the conditional distribution of \(X\) given that...

\(X_1+X_2=z\)

\(X_3+X_4+\cdots+X_k=n-z\)

The subvectors \(\left(X_1, X_2\right)\) and \(\left(X_3, X_4, \dots, X_k \right)\) are conditionally independent and multinomial,

\((X_1,X_2)\sim Mult\left[z,\left(\dfrac{\pi_1}{\pi_1+\pi_2},\dfrac{\pi_2}{\pi_1+\pi_2}\right)\right]\)

\((X_3,\ldots,X_k)\sim Mult\left[n-z,\left(\dfrac{\pi_3}{\pi_3+\cdots+\pi_k},\cdots,\dfrac{\pi_k}{\pi_3+\cdots+\pi_k}\right)\right]\)

The joint distribution of two or more independent multinomials is called the "product-multinomial." If we condition on the sums of non-overlapping groups of cells of a multinomial vector, its distribution splits into the product-multinomial. The parameter for each part of the product-multinomial is a portion of the original \(\pi\) vector, normalized to sum to one.


2.3.6 - Relationship between the Multinomial and the Poisson

2.3.6 - Relationship between the Multinomial and the Poisson

Suppose that \(X_{1}, \dots, X_{k}\) are independent Poisson random variables,

\(\begin{aligned}&X_{1} \sim P\left(\lambda_{1}\right)\\&X_{2} \sim P\left(\lambda_{2}\right)\\&...\\&X_{k} \sim P\left(\lambda_{k}\right)\end{aligned}\)

where the \(\lambda_{j}\)'s are not necessarily equal. Then the conditional distribution of the vector

\(X=(X_1,\ldots,X_k)\)

given the total \(n=X_1+\ldots+X_k\) is \(Mult\left(n, \pi\right)\), where \(\pi=(\pi_1,\ldots,\pi_k)\), and

\(\pi_j=\dfrac{\lambda_j}{\lambda_1+\cdots+\lambda_k}\)

That is, \(\pi\) is simply the vector of \(\lambda_{j}\)s normalized to sum to one.

This fact is important because it implies that the unconditional distribution of \(\left(X_{1}, \dots, X_{k}\right)\) can be factored into the product of two distributions: a Poisson distribution for the overall total,

\(n\sim P(\lambda_1+\lambda_2+\cdots+\lambda_k)\),

and a multinomial distribution for \(X = \left(X_{1}, \dots, X_{k}\right)\) given \(n\),

\(X\sim Mult(n,\pi)\)

The likelihood factors into two independent functions, one for \(\sum\limits_{j=1}^k \lambda_j\) and the other for \(\pi\). The total \(n\) carries no information about \(\pi\) and vice-versa. Therefore, likelihood-based inferences about \(\pi\) are the same whether we regard \(X_{1}, \dots, X_{k}\) as sampled from \(k\) independent Poissons or from a single multinomial, and any estimates, tests, etc. for \(\pi\) or functions of \(\pi\) will be the same, whether we regard \(n\) as random or fixed.

Example: Vehicle Color

Suppose, while waiting at a busy intersection for one hour, we record the color of each vehicle as it drives by. Let

 

\(X_{1} =\) number of white vehicles

\(X_{2} =\) number of black vehicles

\(X_{3} =\) number of silver vehicles

\(X_{4} =\) number of red vehicles

\(X_{5} =\) number of blue vehicles

\(X_{6} =\) number of green vehicles

\(X_{7} =\) number of any other color

In this experiment, the total number of vehicles observed, \(n=X_1+\cdots+X_7\) is random. (It would have been fixed if, for example, we had decided to classify the first \(n=500\) vehicles we see. But because we decided to wait for one hour, \(n\) is random.)

In this case, it's reasonable to regard the \(X_{j}\)s as independent Poisson random variables with means \(\lambda_{1}, \ldots, \lambda_{7}\). But if our interest lies not in the \(\lambda_{j}\)s but in the proportions of various colors in the vehicle population, inferences about these proportions will be the same whether we regard the sample sizes \(n_{j}\) as random or fixed. That is, we can proceed as if

\(X=(X_1,\ldots,X_7)\sim Mult(n,\pi)\)

where \(\pi = \left(\pi_{1}, \dots, \pi_{7}\right)\), even though \(n\) is actually random.


2.3.7 - Chi-Square Approximation

2.3.7 - Chi-Square Approximation

Recall for large \(n\) that the chi-square distribution (\(\nu=1\)) may be used as an approximation to \(X\sim Bin(n,\pi)\):

\( \left(\dfrac{X-n\pi}{\sqrt{n\pi(1-\pi)}}\right)^2 \)

With a little algebraic manipulation, we can expand this into parts due to successes and failures:

\( \left(\dfrac{X-n\pi}{\sqrt{n\pi}}\right)^2 + \left(\dfrac{(n-X)-n(1-\pi)}{\sqrt{n(1-\pi)}}\right)^2\)

The benefit of writing it this way is to see how it can be generalized to the multinomial setting. That is, if \(X=(X_1,\ldots,X_k)\sim Mult(n,\pi)\), then

\(Q=\left(\dfrac{X_1-n\pi_1}{\sqrt{n\pi_1}}\right)^2 +\cdots+ \left(\dfrac{X_k-n\pi_k}{\sqrt{n\pi_k}}\right)^2\)

And \(Q\) has an approximate chi-square distribution with \(\nu=k-1\) degrees of freedom, provided the sample size is large. The usual condition to check for the sample size requirement is that all sample counts \(n\hat{\pi}_j\) are at least 5, although this is not a strict rule.


2.4 - Goodness-of-Fit Test

2.4 - Goodness-of-Fit Test

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

\(H_0\colon\) the model \(M_0\) fits

vs.

\(H_A\colon\) the model \(M_0\) does not fit (or, some other model \(M_A\) fits)

Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis \(\left(H_A\right)\) will represent the saturated model \(M_A\) which fits perfectly because each observation has a separate parameter. Later in the course, we will see that \(M_A\) could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable X corresponds to a \(Mult\left(n, \pi\right)\) model for some vector of cell probabilities, \(\pi\). We will consider two cases:

  1. when vector \(\pi\) is known, and
  2. when vector \(\pi\) is unknown.

In other words, we assume that under the null hypothesis data come from a \(Mult\left(n, \pi\right)\) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our dice example from Lesson 1. We want to test the hypothesis that there is an equal probability of six faces by comparing the observed frequencies to those expected under the assumed model: \(X \sim Multi(n = 30, \pi_0)\), where \(\pi_0=(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)\). We can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value:

\(H_0:\pi =\pi_0\)

where the alternative hypothesis is that any of these elements differ from the null value. There are two statistics available for this test.

Test Statistics

Pearson Goodness-of-fit Test Statistic

The Pearson goodness-of-fit statistic is

\(X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\pi_{0j})^2}{n\pi_{0j}}\)

An easy way to remember it is

\(X^2=\sum\limits_{j=1}^k \dfrac{(O_j-E_j)^2}{E_j}\)

where \(O_j = X_j\) is the observed count in cell \(j\), and \(E_j=E(X_j)=n\pi_{0j}\) is the expected count in cell \(j\) under the assumption that null hypothesis is true.

Deviance Test Statistic

The deviance statistic is

\(G^2=2\sum\limits_{j=1}^k X_j \log\left(\dfrac{X_j}{n\pi_{0j}}\right) =2\sum\limits_j O_j \log\left(\dfrac{O_j}{E_j}\right)\)

In some texts, \(G^2\) is also called the likelihood-ratio test (LRT) statistic, for comparing the loglikelihoods \(L_0\) and \(L_1\) of two models under \(H_0\) (reduced model) and \(H_A\) (full model), respectively:

Likelihood-ratio Test Statistic

\(G^2 = -2\log\left(\dfrac{\ell_0}{\ell_1}\right) = -2\left(L_0 - L_1\right)\)

Note that \(X^2\) and \(G^2\) are both functions of the observed data \(X\) and a vector of probabilities \(\pi_0\). For this reason, we will sometimes write them as \(X^2\left(x, \pi_0\right)\) and \(G^2\left(x, \pi_0\right)\), respectively; when there is no ambiguity, however, we will simply use \(X^2\) and \(G^2\). We will be dealing with these statistics throughout the course in the analysis of 2-way and \(k\)-way tables and when assessing the fit of log-linear and logistic regression models.

Testing the Goodness-of-Fit

\(X^2\) and \(G^2\) both measure how closely the model, in this case \(Mult\left(n,\pi_0\right)\) "fits" the observed data. And both have an approximate chi-square distribution with \(k-1\) degrees of freedom when \(H_0\) is true. This allows us to use the chi-square distribution to find critical values and \(p\)-values for establishing statistical significance.

  • If the sample proportions \(\hat{\pi}_j\) (i.e., saturated model) are exactly equal to the model's \(\pi_{0j}\) for cells \(j = 1, 2, \dots, k,\) then \(O_j = E_j\) for all \(j\), and both \(X^2\) and \(G^2\) will be zero. That is, the model fits perfectly.
  • If the sample proportions \(\hat{\pi}_j\) deviate from the \(\pi_{0j}\)s, then \(X^2\) and \(G^2\) are both positive. Large values of \(X^2\) and \(G^2\) mean that the data do not agree well with the assumed/proposed model \(M_0\).

Example: Dice Rolls

Suppose that we roll a die 30 times and observe the following table showing the number of times each face ends up on top.

 

Face Count
1 3
2 7
3 5
4 10
5 2
6 3
Total 30

We want to test the null hypothesis that the die is fair. Under this hypothesis, \(X \sim Mult\left(n = 30, \pi_0\right)\) where \(\pi_{0j} = 1/6\), for \(j=1,\ldots,6\). This is our assumed model, and under this \(H_0\), the expected counts are \(E_j = 30/6= 5\) for each cell. We now have what we need to calculate the goodness-of-fit statistics:

\begin{eqnarray*} X^2 &= & \dfrac{(3-5)^2}{5}+\dfrac{(7-5)^2}{5}+\dfrac{(5-5)^2}{5}\\ & & +\dfrac{(10-5)^2}{5}+\dfrac{(2-5)^2}{5}+\dfrac{(3-5)^2}{5}\\ &=& 9.2 \end{eqnarray*}

\begin{eqnarray*} G^2 &=& 2\left(3\text{log}\dfrac{3}{5}+7\text{log}\dfrac{7}{5}+5\text{log}\dfrac{5}{5}\right.\\ & & \left.+ 10\text{log}\dfrac{10}{5}+2\text{log}\dfrac{2}{5}+3\text{log}\dfrac{3}{5}\right)\\ &=& 8.8 \end{eqnarray*}

Note that even though both have the same approximate chi-square distribution, the realized numerical values of \(Χ^2\) and \(G^2\) can be different. The \(p\)-values are \(P\left(\chi^{2}_{5} \ge 9.2\right) = .10\) and \(P\left(\chi^{2}_{5} \ge 8.8\right) = .12\). Given these \(p\)-values, with the significance level of \(\alpha=0.05\), we fail to reject the null hypothesis. But rather than concluding that \(H_0\) is true, we simply don't have enough evidence to conclude it's false. That is, the fair-die model doesn't fit the data exactly, but the fit isn't bad enough to conclude that the die is unfair, given our significance threshold of 0.05.

Next, we show how to do this in SAS and R.

The following SAS code will perform the goodness-of-fit test for the example above.

/*----------------------------
| Example: die
-----------------------------*/
data die;
input face $ count;
datalines;
1 3
2 7
3 5
4 10
5 2
6 3
;
run;
proc freq;
weight count;
tables face/ chisq;
/*tables face /nocum all chisq testp=(16.67 16.67 16.67 16.67 16.67 16.67);*/
run;

/********computation of residuales, deviance residuals, X2 and G2 ****/

data cal;
set die;
pi=1/6;
ecount=30*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;

proc print;run;

proc sql;
select sum((count-ecount)**2/ecount) as X2,
       1-probchi(calculated X2,5) as pval1,
       2*sum(count*log(count/ecount)) as G2,
       1-probchi(calculated G2,5) as pval2
from cal;
quit;

 

Output

The FREQ Procedure

 
face Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 3 10.00 3 10.00
2 7 23.33 10 33.33
3 5 16.67 15 50.00
4 10 33.33 25 83.33
5 2 6.67 27 90.00
6 3 10.00 30 100.00
 
Chi-Square Test
for Equal Proportions
Chi-Square 9.2000
DF 5
Pr > ChiSq 0.1013
Bar Chart of Relative Deviations for face -0.5 0.0 0.5 1.0 Relative Deviation 1 2 3 4 5 6 face Deviations of face -0.5 0.0 0.5 1.0 Relative Deviation 1 2 3 4 5 6 face 0.1013 Pr > ChiSq Deviations of face

Sample Size = 30


 
Obs face count pi ecount res devres
1 1 3 0.16667 5 -0.89443 -1.75070
2 2 7 0.16667 5 0.89443 2.17039
3 3 5 0.16667 5 0.00000 0.00000
4 4 10 0.16667 5 2.23607 3.72330
5 5 2 0.16667 5 -1.34164 -1.91446
6 6 3 0.16667 5 -0.89443 -1.75070

 
X2 pval1 G2 pval2
9.2 0.101348 8.778485 0.118233

Notice that this SAS code only computes the Pearson chi-square statistic and not the deviance statistic.

 Stop and Think!

Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percent" represent?

The following R code, dice_rolls.R will perform the same analysis as in SAS. The output will be saved into two files, dice_rolls.out and dice_rolls_Results.

###### Dice Rolls from Lesson 2: one-way tables & GOF 
##### Line by line calculations in R
##### Nice R code that corresponds to SAS code and output
##########################################################

### if you want all output into a file use: sink("dice_roll.out")
sink("dice_roll.out")

### run a goodness of fit test
dice<- chisq.test(c(3,7,5,10,2,3))
dice

########OUTPUT gives Pearson chi-squared 
#	Chi-squared test for given probabilities
#
#  data:  c(3, 7, 5, 10, 2, 3) 
#  X-squared = 9.2, df = 5, p-value = 0.1013
########

### to get observed values
dice$observed


### to get expected values
dice$expected

### to get Pearson residuals
dice$residuals


#####Make the output print into a nice table ######
#### creating a table and giving labels to the columns 
out<-round(cbind(1:6, dice$observed, dice$expected, dice$residuals),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j")

### printing your table of results into a text file with tab separation
write.table(out, "dice_rolls_Results", row.names=FALSE, col.names=TRUE, sep="\t")


#########TO GET Deviance statistic and it's p-value
G2=2*sum(dice$observed*log(dice$observed/dice$expected))
G2
1-pchisq(G2,5)


##deviance residuals
devres=sign(dice$observed-dice$expected)*sqrt(abs(2*dice$observed*log(dice$observed/dice$expected)))
devres

##to show that the G2 is a sum of deviance residuals
G2=sum(sign(dice$observed-dice$expected)*(sqrt(abs(2*dice$observed*log(dice$observed/dice$expected))))^2)
G2

########## If you want to specify explicitly the vector of probabilities 
dice1<-chisq.test(c(3,7,5,10,2,3), p=c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6))
dice1

############################################################
#### Nice R code that corresponds to SAS code and its output
## vector "face" records the face of the dice you get every time you roll it.
face=c(rep(1,3),rep(2,7),rep(3,5),rep(4,10),rep(5,2),rep(6,3))
## Freq Procedure
Percentage=100*as.vector(table(face))/sum(table(face))
CumulativeFrequency=cumsum(c(3,7,5,10,2,3))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(face),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq
## Chi-Square Test for Equal Proportions
chisq.test(table(face))


### if you used sink("dice_roll.out"), and now want to see the output inside the console use: sink()
sink()

Output

[1]  3  7  5 10  2  3
[1] 5 5 5 5 5 5
[1] -0.8944272  0.8944272  0.0000000  2.2360680 -1.3416408 -0.8944272
[1] 8.778485
[1] 0.1182326

	Chi-squared test for given probabilities

data:  c(3, 7, 5, 10, 2, 3) 
X-squared = 9.2, df = 5, p-value = 0.1013

  face Freq Percentage CumulativeFrequency CumulativePercentage
1    1    3  10.000000                   3             10.00000
2    2    7  23.333333                  10             33.33333
3    3    5  16.666667                  15             50.00000
4    4   10  33.333333                  25             83.33333
5    5    2   6.666667                  27             90.00000
6    6    3  10.000000                  30            100.00000
Important! The Pearson chi-squared statistic is \(X^2 = 9.2\) (p-value=0.101), the deviance statistic (from the R output) is \(G^2=8.78\) (p-value=0.118), and they both follow the chi-squared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics, especially for the smaller sample sizes. Notice, however, that in this case, they do lead to the same conclusion --- there is moderate evidence that the dice is fair.

 Stop and Think!

Can you identify the relevant statistics and the \(p\)-value in the output? What does the column labeled "Percentage" in dice_rolls.out represent?

Example: Tomato Phenotypes

Tall cut-leaf tomatoes were crossed with dwarf potato-leaf tomatoes, and n = 1611 offspring were classified by their phenotypes.

Phenotypes Count
tall cut-leaf 926
tall potato-leaf 288
dwarf cut-leaf 293
dwarf potato-leaf 104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potato-leaf is less likely to observed than the others. Do the observed data support this theory?

Under the null hypothesis, the probabilities are

\(\pi_1 = 9/16 , \pi_2 = \pi_3 = 3/16 , \pi_4 = 1/16\)

and the expected frequencies are

\(E_1 = 1611(9/16) = 906.2, E_2 = E_3 = 1611(3/16) = 302.1,\text{ and }E_4 = 1611(1/16) = 100.7\).

We calculate the fit statistics and find that \(X^2 = 1.47\) and \(G^2 = 1.48\), which are nearly identical. The \(p\)-values based on the \(\chi^2\) distribution with 3 degrees of freedom are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and accept (by default) that the data are consistent with the genetic theory.

Here is the above analysis done in SAS.

 /*--------------------------------
| Example: cut-leaf
---------------------------------*/
data leaf;
input type $ count;
datalines;
tallc 926
tallp 288
dwarf 293
dwarfp 104
;
proc freq order=data;
weight count;
tables type /all testp=(56.25, 18.75, 18.75, 6.25);
run;

/********computation of residuales, deviance residuals, X2 and G2 ****/

data pi;
input freq;
pi=freq/16;
cards;
9
3
3
1
;
run;

data cal;
merge leaf pi;
ecount=1611*pi;
res=(count-ecount)/sqrt(ecount);
devres=sqrt(abs(2*count*log(count/ecount)))*sign(count-ecount);
run;

proc print;run;

proc sql;
select sum((count-ecount)**2/ecount) as X2,
       1-probchi(calculated X2, 3) as pval1,
       2*sum(count*log(count/ecount)) as G2,
       1-probchi(calculated G2,3) as pval2
from cal;
quit;


/*************plot observed values and expected values**********/

goption reset=all ;
symbol1 v=circle c=blue I=none;
symbol2 v=circle c=red I=none;
Legend1 label=(' ') value=('observed count' 'expected count')
across=1 down=2
position = (top right inside)
mode=protect;

title "Tomato Phenotypes";

proc gplot data=cal;
plot count*type  ecount*type /overlay legend=legend1  ;
run;

quit;

title;

Output

The SAS System

The FREQ Procedure

 
type Frequency Percent Test
Percent
Cumulative
Frequency
Cumulative
Percent
tallc 926 57.48 56.25 926 57.48
tallp 288 17.88 18.75 1214 75.36
dwarf 293 18.19 18.75 1507 93.54
dwarfp 104 6.46 6.25 1611 100.00
 
Chi-Square Test
for Specified Proportions
Chi-Square 1.4687
DF 3
Pr > ChiSq 0.6895
Bar Chart of Relative Deviations for type -0.04 -0.02 0.00 0.02 Relative Deviation tallc tallp dwarf dwarfp type Deviations of type -0.04 -0.02 0.00 0.02 Relative Deviation tallc tallp dwarf dwarfp type 0.6895 Pr > ChiSq Deviations of type

Sample Size = 1611


The SAS System

 
Obs type count freq pi ecount res devres
1 tallc 926 9 0.5625 906.188 0.65816 6.32891
2 tallp 288 3 0.1875 302.063 -0.80912 -5.24022
3 dwarf 293 3 0.1875 302.063 -0.52143 -4.22497
4 dwarfp 104 1 0.0625 100.688 0.33012 2.59476

The SAS System

 
X2 pval1 G2 pval2
1.468722 0.689508 1.477587 0.687453

Plot of count by type count 100 200 300 400 500 600 700 800 900 1000 type dwarf dwarfp tallc tallp Tomato Phenotypes observed count expected count

Here is how to do the computations in R using the following code :

#### Lesson 2 Cut-Leaf example
#### (I) Basic GOF line by line calculation 
#### (II) Doing GOF with chisq.test() 
#### (III) Nice R code that corresponds to SAS code and output
#########################################################

##(I) Basic GOF line by line computation to demonstrate formulas
ob=c(926,288,293,104) ## data
ex=1611*c(9,3,3,1)/16  ## expected counts under the assumed model

X2=sum((ob-ex)^2/ex) ## X2 statistic
X2
####  Output so you check against your output
#[1] 1.468722

1-pchisq(X2,3) ## p-value
####  Output
#[1] 0.6895079

G2=2*sum(ob*log(ob/ex)) ## deviance statistic
G2
####  Output
#[1] 1.477587

1-pchisq(G2,3) ## pvalue

####  Output
#[1] 0.6874529

########################################################
#### (II) Using the built-in chisq.test function in R
tomato=chisq.test(c(926, 288,293,104), p=c(9/16, 3/16, 3/16, 1/16))
 tomato
 tomato$statistic
 tomato$p.value 
 tomato$residuals
 
#### To get G2
 
G2=2*sum(tomato$observed*log(tomato$observed/tomato$expected))
G2
1-pchisq(G2,3)

##deviance residuals
devres=sign(tomato$observed-tomato$expected)*sqrt(abs(2*tomato$observed*log(tomato$observed/tomato$expected)))
devres

##### Creating a nice ouput 
### Cell id, Observed count, Expected count, Pearson residuals, Deviance residual
out<-round(cbind(1:5, tomato$observed, tomato$expected, tomato$residuals, devres),3)
out<-as.data.frame(out)
names(out)<-c("cell_j", "O_j", "E_j", "res_j", "dev_j")
out

#### printing your table of results into a text file with tab separation
write.table(out, "tomato_Results", row.names=FALSE, col.names=TRUE, sep="\t")

#### Ploting expected and observed values
plot(c(1:4), ex$observed, xlab="cell index", ylab="counts", xlim=c(0,5))
points(ex$expected, pch=3, col="red")
legend(3,700, c("observed", "expected"), col=c(1,"red"), pch=c(1,3))


########################################################
#### (III) Nice R code that corresponds to SAS code and output
type=c(rep("tallc",926),rep("tallp",288),rep("dwarf",293),rep("dwarfp",104))
##Please note the table R provides has different order of rows 
##from that provided by SAS 
table(type)
## Freq Procedure
Percentage=100*as.vector(table(type))/sum(table(type))
CumulativeFrequency=cumsum(c(926,288,293,104))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(type),Percentage,CumulativeFrequency,CumulativePercentage)
Freq
## Chi-Square Test for Specified Proportions
p=c(18.75,6.25,56.25,18.75)
chisq.test(table(type),p=p/sum(p))
########################################################


This has step-by-step calculations and also uses chisq.test() to produce output with Pearson and deviance residuals.

 Stop and Think!

Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example?


2.5 - Residuals

2.5 - Residuals

How large is the discrepancy between the two proposed models? The previous analysis provides a summary of the overall difference between them, but if we want to know more specifically where these differences are coming from, cell-specific residuals can be inspected for relevant clues. There are two types of residuals we will consider: Pearson and deviance residuals.

Pearson Residuals

Pearson Goodness-of-fit Test Statistic

The Pearson goodness-of-fit statistic can be written as \(X^2=\sum\limits_{j=1}^k r^2_j\) , where

\(r_j=\dfrac{X_j-n\pi_{0j}}{\sqrt{n\pi_{0j}}}\)

is called the Pearson residual for cell \(j\), and it compares the observed with the expected counts. The sign (positive or negative) indicates whether the observed frequency in cell \(j\) is higher or lower than the value implied under the null model, and the magnitude indicates the degree of departure. When data do not fit the null model, examination of the Pearson residuals often helps to diagnose where the model has failed.

How large should a "typical" value of \(r_j\) be? Recall that the expectation of a \(\chi^2\) random variable is its degrees of freedom. Thus, if a model is correct, \(E(X^2) \approx E(G^2) \approx k − 1\), and the typical size of a single \(r_{j}^{2}\) is \(\dfrac{k − 1}{k}\). Thus, if the absolute value, \(|r_j|\), is much larger than \(\sqrt{(k-1)/k}\)—say, 2.0 or more—then the model does not appear to fit well for cell \(j\).

Deviance Residuals

Although not as intuitively as the \(X^2\) statistic, the deviance statistic \(G^2=\sum\limits_{j=1}^k d^2_j\) can be regarded as the sum of squared deviance residuals,

\(d_j=\sqrt{\left|2X_j\log\dfrac{X_j}{n\pi_{0j}}\right|}\times \text{sign}(X_j-n\pi_{0j})\)

The sign function can take three values:

  • -1 if \((X_j - n\pi_{0j} ) < 0\),
  • 0 if \((X_j- n\pi_{0j} ) = 0\), or
  • 1 if \((X_j- n\pi_{0j}) > 0\).

When the expected counts \(n\pi_{0j}\) are all fairly large (much greater than 5) the deviance and Pearson residuals resemble each other quite closely.

Example: Die Rolls continued

Below is a table of observed counts, expected counts, and residuals for the fair-die example; for calculations see dice_rolls.R. Unfortunately, the CELLCHI2 option in SAS that gives these residuals does NOT work for one-way tables; we will use it for higher-dimensional tables.

cell j \(O_j\) \(E_j\) \(r_j\) \(d_j\)
1 3 5 -0.89 -1.75
2 7 5 +0.89 2.17
3 5 5 +0.00 0.00
4 10 5 +2.24 3.72
5 2 5 -1.34 -1.91
6 3 5 +0.89 -1.75

The only cell that seems to deviate substantially from the fair-die model is for \(j=4\). If the die is not fair, then it may be "loaded" in favor of the outcome 4. But recall that the \(p\)-value was about .10, so the evidence against fairness is not overwhelming.

Effects of Zero Cell Counts

If an \(X_j\) is zero and all \(\pi_{0j}\)s are positive, then the Pearson \(X^2\) can be calculated without any problems, but there is a problem in computing the deviance, \(G^2\); if \(X_j = 0\) then the deviance residual is undefined, and if we use the standard formula,

\(G^2=2\sum\limits_{j=1}^k X_j\log\dfrac{X_j}{n\pi_{0j}}\)

an error will result. But if we write the deviance as

\(G^2=2\log\dfrac{L(\pi_0;X)}{L(\hat{\pi};X)}=2\log\prod\limits_{j=1}^k \left(\dfrac{X_j/n}{\pi_{0j}}\right)^{X_j}\)

Now, a cell with \(X_j=0\) contributes 1 to the product and may be ignored. Thus, we may calculate the deviance statistic as

\(G^2=2\sum\limits_{i:X_j>0} X_j\log\dfrac{X_j}{n\pi_{0j}}\)

Alternatively, we can set the deviance residuals to zero for cells with \(X_j=0\) and take \(G^2= \sum_j d_j^2\) as before. But if we do that, \(d_j = 0\) should not be interpreted as "the model fits well in cell \(j\)". The fit could be quite poor, especially if \(E_j\) is large.

If any element of vector \(\pi_0\) is zero, then \(X^2\) and \(G^2\) both break down. Simple ways to avoid such problems include putting a very small mass in all the cells, including the zero-cell(s) (e.g., \(0.05\)) or removing or combining cells and re-doing the tests.

Structural zeros and sampling zeros

Note that an observed zero in a cell does not necessarily mean that there cannot be any observation in that cell. A sampling zero is an observed zero count in a cell for which a positive value is possible. An example of this would have occurred in our die experiment if by chance the number 1 had not shown up in any of the 30 rolls. Alternatively, consider an example of categorizing male and female patients on whether they have carcinoma. Since only females can have ovarian cancer, and only males can have prostate cancer, the following cells are said to have structural zeros, and the underlying cell probabilities \(\pi_j=0\). To handle structural zeros, an adjustment to the degrees-of-freedom is required.

Patient Ovarian Cancer Prostate cancer
Male 0 7
Female 10 0
Total 10 7

 


2.6 - Goodness-of-Fit Tests: Unspecified Parameters

2.6 - Goodness-of-Fit Tests: Unspecified Parameters

For many statistical models, we do not know the vector of probabilities \(\pi\) a priori, but can only specify it up to some unknown parameters. More specifically, the cell proportions may be known functions of one or more other unknown parameters. For example, suppose that a gene is either dominant (A) or recessive (a), and the overall proportion of dominant genes in the population is \(p\). If we assume mating is random (i.e. members of the population choose their mates in a manner that is completely unrelated to this gene), then the three possible genotypes—AA, Aa, and aa—should occur in the so-called Hardy-Weinberg proportions:

genotype proportion no. of dominant genes
AA \(\pi_1 = p^2\) 2
Aa \(\pi_2 = 2p(1 − p)\) 1
aa \(\pi_3 = (1 − p)^2\) 0

This is equivalent to saying that the number of dominant genes that an individual has (0, 1, or 2) is distributed as \(Bin(2,p)\), where the parameter \(p\) is not specified. We have to first estimate it in order to obtain the expected counts under this model.

In such an example, the null hypothesis specifies some constraint on the parameters but stops short of providing their values specifically. In more general notation, the model specifies that:

\(\pi_1 = g_1(\theta)\),
\(\pi_2 = g_2(\theta)\),
\(\vdots\)
\(\pi_k = g_k(\theta)\),

where \(g_1, \ldots, g_k\) are known functions, but the parameter \(\theta\) is unknown (e.g., \(p\) in the example above). Let \(S_0\) denote the set of all \(\pi\) that satisfy these constraints for some parameter \(\theta\). We want to test

\(H_0 \colon \pi \in S_0\) versus \(H_A \colon \pi \in S\)

where \(S\) denotes the probability simplex (the space) of all possible values of \(\pi\). (Notice that \(S\) is a \((k-1)\)-dimensional space, but the dimension of \(S_0\) is the number of free parameters in \(\theta\).) The method for conducting this test is as follows:

  1. Estimate \(\theta\) by an efficient method (e.g., maximum likelihood), and denote the estimate by \(\hat{\theta}\).
  2. Calculate (estimated) expected counts under the null hypothesis by \(n\hat{\pi}_{0j}\), where \(\hat{\pi}_{0j} = g_j(\hat{\theta})\), for \(j=0,\ldots,k\). We may still denote these as \(E_j\), even though they involve an estimated parameter.
  3. Calculate the goodness-of-fit statistics \(X^2(x,\hat{\pi}_0)\) and \(G^2(x,\hat{\pi}_0)\) using the usual formulas from the \(O_j\) and the \(E_j\).

That is, once the \(E_j\) are obtained from the estimated parameter(s), we use the usual formulas:

\(X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}\) and \(G^2=2\sum\limits_j O_j \log\dfrac{O_j}{E_j}\)

If \(X^2\) and \(G^2\) are calculated as described above, then under the null hypothesis, both are approximately  \(\chi^{2}_{\nu}\), where \(\nu\) equals the number of unknown parameters under the alternative hypothesis minus the number of unknown parameters under the null hypothesis, \(\nu = (k − 1) − d\) where \(d = dim(\theta)\), i.e., the number of parameters that required estimation.

Example: Number of Children (the Poisson model)

Suppose that we observe the following numbers of children in \(n=100\) families:

no. of children: 0 1 2 3 4+
count: 19 26 29 13 13

Are these data consistent with a Poisson distribution? Recall that if a random variable \(Y\) has a Poisson distribution with mean \(\lambda\), then

\(P(Y=y)=\dfrac{\lambda^y e^{-\lambda}}{y!}\)

for \(y=0,1,\ldots\). Therefore, under the Poisson model, the proportions given some unknown \(\lambda\), are provided in the table below. For example, \(\pi_1=P(Y=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=e^{-\lambda}\).

no. of children proportion
0 \(\pi_1 = e^{−λ}\)
1 \(\pi_2 = \lambda e^{−λ}\)
2 \(\pi_3 = \dfrac{\lambda^2e^{−\lambda}}{2}\)
3 \(\pi_4 = \dfrac{\lambda^3e^{−\lambda}}{6}\)
4+ \(\pi_5 = 1 − \sum^{4}_{j=1} \pi_j\)

Are the data below consistent with a Poisson model?

no. of children 0 1 2 3 4+
count 19 26 29 13 13

Let's test the null hypothesis that these data are Poisson. First, we need to estimate \(\lambda\), the mean of the Poisson distribution, and thus here \(d=1\). Recall that if we have an iid sample \(y_1, \ldots , y_n\) from a Poisson distribution, then the ML estimate of \(\lambda\) is just the sample mean, \(\hat{\lambda}=\overline{Y}\). Based on the table above, we know that the original data \(y_1,\ldots , y_n\) contained 19 values of 0, 26 values of 1, and so on; however, we don't know the exact values of the original data that fell into the category 4+. To make matters easy, suppose that of the 13 values that were classified as 4+, ten were equal to 4 and three were equal to 5. Then the ML estimate of \(\lambda\) is (the sample mean)

\(\hat{\lambda}=\dfrac{19(0)+26(1)+29(2)+13(3)+10(4)+3(5)}{100}=1.78\)

With this value of \(\lambda\), the (estimated) expected counts for the first four cells (0, 1, 2, and 3 children, respectively) are

  • \(E_1 = 100e−1.78 = 16.86\)
  • \(E_2 = 100(1.78)^{e−1.78} = 30.02\)
  • \(E_3 = 100(1.78)^{2e−1.78/2} = 26.72\)
  • \(E_4 = 100(1.78)^{3e−1.78/6} = 15.85\)

The expected count for the 4+ cell is most easily found by noting that \(\sum_j E_j = n\), and thus

\(E_5 = 100 − (16.86 + 30.02 + 26.72 + 15.85) = 10.55\)

This leads to:

\(X^2 = 2.08\) and \(G^2 = 2.09\).

Since the multinomial model here has \(k=5\) categories (with the usual sum-to-\(n\) constraint), and the Poisson model required one parameter \(\lambda\) to be estimated, the degrees of freedom for this test are \(\nu = 5-1-1 = 3\), and the \(p\)-values are

\(P(\chi^2_3\geq2.08)=0.56\)

\(P(\chi^2_3\geq2.09)=0.55\)

The Poisson distribution seems to be a reasonable model for these data; at least, there is not significant evidence to say otherwise. Below is an example of these computations using R and SAS.

Here is this goodness-of-fit test in SAS, using pre-calculated proportions (\(\hat{\pi}_{0}\)). The TESTP option specifies expected proportions for a one-way table chi-square test. But notice in the output below, that the degrees of freedom, and thus the p-value, are not correct:

/*Test of proportions for number of childern example*/
/*To test the poisson model for specified probabilities*/

options ls=90 nocenter nodate;

DATA children;
        INPUT children $ count;
        DATALINES;
        0 19
        1 26
        2 29
        3 13
        4+ 13
        ;
RUN;

PROC FREQ DATA=children ORDER=data;
        WEIGHT count;
        tables children/chisq testp=(16.86, 30.02, 26.72, 15.86, 10.55);
RUN;

Output

 
Chi-Square Test
for Specified Proportions
Chi-Square 2.0892
DF 4
Pr > ChiSq 0.7194

You can use this \(X^2\) statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.

Here is how to fit the Poisson Model in R using the following code :

########################################################
#### Number of Children Example
#### Basic Poisson calculations broken down line by line 
#### Nicer R code that corresponds to SAS code and its ouput
#########################################################

#### input data 
ob<-c(19,26,29,13,13) 
ob

# [1] 19 26 29 13 13


#### find estimated expected probabilities 

lambdahat<-c(19*0+26*1+29*2+13*3+10*4+3*5)/100 
lambdahat

# [1] 1.78 


kids<-c(0,1,2,3) 

pihat<-dpois(kids,lambdahat) 
pihat 

# [1] 0.1686381 0.3001759 0.2671566 0.1585129 


#### attach the probability for the 4+ cell 

pihat<-c(pihat,1-sum(pihat)) 
ex<-100*pihat 
X2<-sum((ob-ex)^2/ex) 
X2 

# [1] 2.084625 

G2<-2*sum(ob*log(ob/ex)) 
G2 

# [1] 2.088668 


#### find the p-value for X^2

1-pchisq(X2,3)  

# [1] 0.5550296 



#### find the p-value for G^2 

1-pchisq(G2,3) 

# [1] 0.5542087 

#############################################################
#### Nicer R code that corresponds to SAS code and its ouput

children=c(rep("0",19),rep("1",26),rep("2",29),rep(3,13),rep("4+",13))

#### Freq Procedure

Percentage=100*as.vector(table(children))/sum(table(children))
CumulativeFrequency=cumsum(c(19,26,29,13,13))
CumulativePercentage=cumsum(Percentage)
Freq=data.frame(table(children),Percentage,CumulativeFrequency,CumulativePercentage)
row.names(Freq)=NULL
Freq

#### Chi-Square Test for Specified Proportions

p=c(16.86,30.02,26.72,15.86,10.55)
chisq.test(table(children),p=p/sum(p))

The function dpois() calculates Poisson probabilities. You can also get the \(X^2\) in R by using function chisq.test(ob, p=pihat) in the above code, but notice in the output below, that the degrees of freedom, and thus the p-value, is not correct:

Chi-squared test for given probabilities

data: ob
X-squared = 2.0846, df = 4, p-value = 0.7202

You can use this \(X^2\) statistic, but need to calculate the new p-value based on the correct degrees of freedom, in order to obtain correct inference.


2.7 - Lesson 2 Summary

2.7 - Lesson 2 Summary

In this lesson, we made a distinction among three types of large-sample hypothesis tests and related confidence intervals used in the analysis of categorical data: Wald, Likelihood-Ratio, and Score. The Wald test is the most widely used one. For example, we will see when we fit a logistic regression model that the z-statistic and the confidence intervals for the regression parameter estimates are Wald CIs.

Lesson 2 also introduced the goodness-of-fit test and the corresponding residuals. The distinction was made between the goodness-of-fit tests with known versus unknown population parameters. This idea of comparing the assumed model (H0) to a distribution of the observed data, and assessing how close they fit, will be used throughout the course, as will these test statistics. For example, in the next lesson, we will see that the chi-square test of independence is just a goodness-of-fit test where the assumed model (H0) of the independence of two random variables is compared to a saturated model, that is to the observed data.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility