2.2 - Tests and CIs for a Binomial Parameter

2.2 - Tests and CIs for a Binomial Parameter

For the discussion here, we assume that \(X_1,\ldots,X_n\) are a random sample from the Bernoulli (or binomial with \(n=1\)) distribution with success probability \(\pi\) so that, equivalently, \(Y=\sum_i X_i\) is binomial with \(n\) trials and success probability \(\pi\). In either case, the MLE of \(\pi\) is \(\hat{\pi}=\overline{X}=Y/n\), with \(E(\hat{\pi})=\pi\) and \(V(\hat{\pi})=\pi(1-\pi)/n\), and the quantity

\(\dfrac{\overline{X}-\mu}{\sigma/\sqrt{n}} \)

is approximately standard normal for large \(n\). The following approaches make use of this. 

Wald Test and CI

If wish to test the null hypothesis \(H_0\colon \pi=\pi_0\) versus \(H_a\colon \pi\ne\pi_0\) for some specified value \(\pi_0\), the Wald test statistic uses the MLE for \(\pi\) in the variance expression:

Wald Test Statistic

\(Z_w=\displaystyle \dfrac{\hat{\pi}-\pi_0}{\sqrt{\hat{\pi}(1-\hat{\pi})/n}} \)

For large \(n\), \(Z_w\) is approximately standard normal (the MLE is very close to \(\pi\) when \(n\) is large), and we can use standard normal quantiles for thresholds to determine statistical significance. That is, at significance level \(\alpha\), we would reject \(H_0\) if \(|Z_w|\ge z_{\alpha/2}\), the upper \(\alpha/2\) quantile. The reason for the absolute value is because we want to reject \(H_0\) if the MLE significantly differs from \(\pi_0\) in either direction.

Note that \(\pi_0\) will not be rejected if

\(\displaystyle -z_{\alpha/2}<\dfrac{\hat{\pi}-\pi_0}{\sqrt{\hat{\pi}(1-\hat{\pi})/n}} < z_{\alpha/2} \)

Rearranging slightly, we can write the equivalent result

\(\displaystyle \hat{\pi} -z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} < \pi_0 < \hat{\pi} +z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} \)

The limits are the familiar \((1-\alpha)100\%\) confidence interval for \(\pi\), which we refer to as the Wald interval:

\(\displaystyle \hat{\pi} \pm z_{\alpha/2}\sqrt{\dfrac{\hat{\pi}(1-\hat{\pi}}{n}} \  \text{(Wald Interval)}\)

The most common choice for \(\alpha\) is 0.05, which gives 95% confidence and multiplier \(z_{.025}=1.96\). We should also mention here that one-sided confidence intervals can be constructed from one-sided tests, although they're less common.

Example: Smartphone users

To demonstrate these results in a simple example, suppose in our sample of 20 smartphone users that we observe six who use Android. Then, the MLE for \(\pi\) is \(6/20=0.30\), and we can be 95% confident that the true value of \(\pi\) is within

\(\displaystyle 0.30 \pm 1.96\sqrt{\dfrac{0.30(1-0.30)}{20}} = \)(0.0992, 0.5008)

Score Test and CI

Another popular way to test \(H_0\colon \pi=\pi_0\) is with the Score test statistic:

Score Test Statistic

\(Z_s=\displaystyle \dfrac{\hat{\pi}-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} \)

Considering that a hypothesis test proceeds by assuming the null hypothesis is true until significant evidence shows otherwise, it makes sense to use \(\pi_0\) in place of \(\pi\) in both the mean and variance of \(\hat{\pi}\). The score test thus rejects \(\pi_0\) when \(|Z_s|\ge z_{\alpha/2}\), or equivalently, will not reject \(\pi_0\) when

\(\displaystyle -z_{\alpha/2}<\dfrac{\hat{\pi}-\pi_0}{\sqrt{\pi_0(1-\pi_0)/n}} < z_{\alpha/2} \)

Carrying out the score test is straightforward enough when a particular value \(\pi_0\) is to be tested. Constructing a confidence interval as the set of all \(\pi_0\) that would not be rejected requires a bit more work. Specifically, the score confidence interval limits are roots of the equation \(|Z_s|- z_{\alpha/2}=0\), which is quadratic with respect to \(\pi_0\) and can be solved with the quadratic formula. The full expressions are in the R file below, but the center of this interval is particularly noteworthy (we'll let \(z\) stand for \(z_{\alpha/2}\) for convenience):

\( \displaystyle \dfrac{\hat{\pi}+z^2/2n}{1+z^2/n} \)

Note that this center is close to the MLE \(\hat{\pi}\), but it is pushed slightly toward \(1/2\), depending on the confidence and sample size. This helps the interval's coverage probability when the sample size is small, particularly when \(\pi\) is close to 0 or 1. Recall that we're using a normal approximation for \(\hat{\pi}\) with mean \(\pi\) and variance \(\pi(1-\pi)/n\). The images below illustrate how this approximation depends on values of \(\pi\) and \(n\). 

Figure 1.3

Problems arise when too much of the normal curve falls outside the (0,1) boundaries allowed for \(\pi\). In the first case, the sample size is small, but \(\pi=0.3\) is far enough away from the boundary that the normal approximation is still useful, whereas in the second case, the normal approximation is quite poor. The larger sample size in the third case decreases the variance enough to offset the small \(\pi\) value.

In practice, the results of a poor normal approximation tend to be intervals that include values outside the range of (0,1), which we know cannot apply to \(\pi\). The score interval performs better than the Wald in these situations because it shifts the center of the interval closer to 0.5. Compare the score interval limits (below) to those of the Wald when applied to the smartphone data.

\( (0.1455, 0.5190) \)


ci = function(y,n,conf)
{   pi.hat = y/n
    z = qnorm(1-(1-conf)/2)
    wald = pi.hat+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n)
    score = (pi.hat+z^2/2/n+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n+z^2/4/n^2))/(1+z^2/n)
    cbind(wald, score) }

Likelihood Ratio Test and CI

Our third approach to binomial inference follows the same idea of inverting a test statistic to construct a confidence interval but utilizes the likelihood ratio test (LRT) for the binomial parameter. Recall the likelihood function for \(Y\sim Bin(n,\pi)\):

\(\displaystyle \ell(\pi)= {n\choose y}\pi^y(1 - \pi)^{(n-y)}\)

The LRT statistic for \(H_0:\pi=\pi_0\) versus \(H_a:\pi\ne\pi_0\) is

\(\displaystyle G^2=2\log\dfrac{\ell(\hat{\pi})}{\ell(\pi_0)} = 2\left(y\log\dfrac{\hat{\pi}}{\pi_0}+(n-y)\log\dfrac{1-\hat{\pi}}{1-\pi_0}\right) \)

For large \(n\), \(G^2\) is approximately chi-square with one degree of freedom, and \(\pi_0\) will be rejected if \(G^2\ge \chi^2_{1,\alpha}\). Like the Wald and score test statistics, the LRT statistic is essentially a measure of disagreement between the sample estimate and the hypothesized value for \(\pi\). Larger values indicate more disagreement and more evidence to reject \(H_0\). And we can likewise construct a confidence interval as the set of all values of \(\pi_0\) that would not be rejected. Unfortunately, we must resort to numerical approximation for these limits. Here are the results for the smartphone data.

\( (0.1319, 0.5165) \)

Like the score interval the limits for the LRT interval are centered at a value closer to 0.5, compared with the Wald limits, which are centered at the MLE.

ci = function(y,n,conf)
{   pi.hat = y/n
    z = qnorm(1-(1-conf)/2)
    wald = pi.hat+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n)
    score = (pi.hat+z^2/2/n+c(-1,1)*z*sqrt(pi.hat*(1-pi.hat)/n+z^2/4/n^2))/(1+z^2/n)
    loglik = function(p) 2*(y*log(pi.hat/p)+(n-y)*log((1-pi.hat)/(1-p)))-z^2
    lrt = uniroot.all(loglik,c(0.01,0.99))
    cbind(wald,score,lrt) }

Has Tooltip/Popover
 Toggleable Visibility