2.2 - Sample Size Determination

The estimation approach to determining sample size addresses the question: "How accurate do you want your estimate to be?" In this case, we are estimating the difference in means. This approach requires us to specify how large a difference we are interested in detecting, say B for the Bound on the margin of error, and then to specify how certain we want to be that we can detect a difference that large. Recall that when we assume equal sample sizes of n, a confidence interval for \(\mu_1-\mu_2\) is given by:

\(\left\{\bar{Y}_1-\bar{Y}_2 \pm t(1-\alpha/2;df)\cdot s\cdot \sqrt{\frac{2}{n}}\right\}\)

Where n is the sample size for each group, and df = n + n - 2 = 2(n - 1) and s is the pooled standard deviation. Therefore, we first specify B and then solve this equation:

\(B=t(1-\alpha/2;df)\cdot s\cdot \sqrt{\frac{2}{n}}\)

for n. Therefore,

\(n=\left[t(1-\alpha/2;df)\cdot s\cdot \frac{\sqrt{2}}{B}\right]^2=\left[\dfrac{t^2(1-\alpha/2;df)\cdot s^2\cdot 2}{B^2}\right]\)

Since in practice, we don't know what s will be, prior to collecting the data, we will need a guesstimate of \(\sigma\) to substitute into this equation. To do this by hand and we use z rather than t since we don't know the df if we don't know the sample size n - the computer will iteratively update the d.f. as it computes the sample size, giving a slightly larger sample size when n is small.

So we need to have an estimate of \(\sigma^2\), a desired margin of error bound B, that we want to detect, and a confidence level 1-\(\alpha\). With this, we can determine the sample size in this comparative type of experiment. We may or may not have direct control over \(\sigma^2\), but by using different experimental designs we do have some control over this and we will address this later in this course. In most cases, an estimate of \(\sigma^2\) is needed in order to determine the sample size.

One special extension of this method is when we have a binomial situation. In this case where we are estimating proportions rather than some quantitative mean level, we know that the worst-case variance, p(1-p), is where p (the true proportion) is equal to 0.5 and then we would have an approximate sample size formula that is simpler, namely \(n = 2/B^2 \text{ for } \alpha = 0.05\).

Another Two-Sample Example – Paired Samples Section

In the paired sample situation, we have a group of subjects where each subject has two measurements taken. For example, blood pressure was measured before and after a treatment was administered for five subjects. These are not independent samples, since for each subject, two measurements are taken, which are typically correlated – hence we call this paired data. If we perform a two-sample independent t-test, ignoring the pairing for the moment we lose the benefit of the pairing, and the variability among subjects is part of the error. By using a paired t-test, the analysis is based on the differences (after – before) and thus any variation among subjects is eliminated.

In our Minitab output, we show the example with Blood Pressure on five subjects.

By viewing the output, we see that the different patients' blood pressures seem to vary a lot (standard deviation about 12) but the treatment seems to make a small but consistent difference with each subject. Clearly, we have a nuisance factor involved - the subject - which is causing much of this variation. This is a stereotypical situation where because the observations are correlated and paired and we should do a paired t-test.

These results show that by using a paired design and taking into account the pairing of the data we have reduced the variance. Hence our test gives a more powerful conclusion regarding the significance of the difference in means.

The paired t-test is our first example of a blocking design. In this context, the subject is used as a block, and the results from the paired t-test are identical to what we will find when we analyze this as a Randomize Complete Block Design from lesson 4.