Random error (variability, imprecision) can be overcome by increasing the sample size. This is illustrated in this section via *hypothesis testing* and *confidence intervals*, two accepted forms of statistical inference.

##
Review of Hypothesis testing
Section* *

In hypothesis testing, a null hypothesis and an alternative hypothesis are formed. Typically, the null hypothesis reflects the lack of an effect and the alternative hypothesis reflects the presence of an effect (supporting the research hypothesis). The investigator needs to have sufficient evidence, based on data collected in a study, to reject the null hypothesis in favor of the alternative hypothesis.

Suppose an investigator is conducting a two-armed clinical trial in which subjects are randomized to group A or group B, and the outcome of interest is the change in serum cholesterol after 8 weeks. Because the outcome is measured on a continuous scale, the hypotheses are stated as:

\(H_0\colon \mu_A = \mu_B \) versus \(H_0: \mu_A \ne \mu_B\)

where \(\mu_{A} \text{ and } μ_{B}\) represent the population means for groups A and B, respectively.

The alternative hypothesis of \(H_1\colon \mu_{A} \ne \mu_{B}\) is labeled a “two-sided alternative” because it does not indicate whether A is better than B or vice versa. Rather, it just indicates that A and B are different. A “one-sided alternative” of \(H_1\colon \mu_{A}< \mu_{B}\) (or \(H_1\colon \mu_{A} > \mu_{B}\)) is possible, but it is more conservative to use the two-sided alternative.

The investigator conducts a study to test his hypothesis with 40 subjects in each of group A and group B \(\left(n_{A} = 40 \text{ and } n_{B} = 40\right)\). The investigator estimates the population means via the sample means (labeled \(\bar{x}_A\) and \(\bar{x}_B\), respectively). Suppose the average changes that we observed are \(\bar{x}_A = 7.3\) and \(\bar{x}_B = 4.8 \text { mg/dl}\). Do these data provide enough evidence to reject the null hypothesis that the average changes in the two populations means are equal? (The question cannot be answered yet. We do not know if this is a statistically significant difference!)

If the data approximately follow a normal distribution or are from large enough samples, then a two-sample *t* test is appropriate for comparing groups A and B where:

\(t = (\bar{x}_A - \bar{x}_B) / (\text{standard error of } \bar{x}_A - \bar{x}_B)\).

We can think of the two-sample *t* test as representing a signal-to-noise ratio and ask if the signal is large enough, relative to the noise detected? In the example, \(\bar{x}_A = 7.3\) and \(\bar{x}_B = 4.8 mg/dl\). If the standard error of \(\bar{x}_A - \bar{x}_B\) is 1.2 mg/dl, then:

\( t_{obs} = (7.3 - 4.8) / 1.2 = 2.1\)

But what does this value mean?

Each *t* value has associated probabilities. In this case, we want to know the probability of observing a *t* value as extreme or more extreme than the *t* value actually observed, if the null hypothesis is true. This is the *p*-value. At the completion of the study, a statistical test is performed and its corresponding *p*-value calculated. If the *p*-value \(< \alpha\), then \(H_0\) is rejected in favor of \(H_1\).

Two types of errors can be made in testing hypotheses: rejecting the null hypothesis when it is true or failing to reject the null hypothesis when it is false. The probability of making a Type I error, represented by \(\alpha\) (the significance level), is determined by the investigator prior to the onset of the study. Typically, \(\alpha\) is set at a low value, say 0.01 or 0.05.

Here is an interactive table that presents these options. Roll your cursor over the specific decisions to view results.

Decision | Reality | |
---|---|---|

\(H_0\) is true | \(H_0\) is false | |

Reject \(H_0\), (conclude \(H_a\)) | Type I error | Correct decision |

Fail to reject \(H_0\) | Correct decision | Type II error |

In our example, the *p*-value = [probability that \(|t| > 2.1] = 0.04\)

Thus, the null hypothesis of equal mean change for in the two populations is rejected at the 0.05 significance level. The treatments were different in the mean change in serum cholesterol at 8 weeks.

Note that \(\beta\) (the probability of not rejecting \(H_0\) when it is false) did not play a role in the test of hypothesis.

The importance of \(\beta\) came into play during the design phase when the investigator attempted to determine the appropriate sample size for the study. To do so, the investigator had to decide on the *effect size* of interest, i.e., a clinically meaningful difference between groups A and B in the average change in cholesterol at 8 weeks. The statistician cannot determine this but can help the researcher decide whether he has the resources to have a reasonable chance of observing the desired effect or should rethink his proposed study design.

The effect size is expressed as: \(\delta = \mu_{A} - \mu_{B}\).

The sample size should be determined such that there exists good statistical power \(\left(\beta = 0.1\text{ or }0.2\right)\) for detecting this effect size with a test of hypothesis that has significance level α.

A *sample size formula* that can be used for a two-sided, two-sample test with \(\alpha = 0.05\) and \(\beta = 0.1\) (90% statistical power) is:

\(n_A = n_A = 21\sigma^{2}/\delta^{2}\)

where σ = the population standard deviation (more detailed information will be discussed in a later lesson).

Note that the sample size increases as σ increases (noise increases).

Note that the sample size increases as \(\delta\) decreases (effect size decreases).

In the serum cholesterol example, the investigator had selected a meaningful difference, \(\delta = 3.0 \text{ mg/dl}\) and located a similar study in the literature that reported \(\sigma = 4.0 \text{ mg/dl}\). Then:

\(n_A = n_B = 21\sigma^{2}/\delta^{2} = (21 \times 16) / 9 = 37 \)

Thus, the investigator randomized 40 subjects to each of group A and group B to assure 90% power for detecting an effect size that would have clinical relevance..

Many studies suffer from low statistical power (large Type II error) because the investigators do not perform sample size calculations.

If a study has very large sample sizes, then it may yield a statistically significant result without any clinical meaning. Suppose in the serum cholesterol example that \(\bar{x}_A = 7.3\) and \(\bar{x}_A = 7.1 \text {mg/dl}\) , with \(n_{A} = n_{B} = 5,000\). The two-sample *t* test may yield a *p*-value = 0.001, but \(\bar{x}_A - \bar{x}_B = 7.3 - 7.1 = 0.2 \text { mg/dl}\) is not clinically interesting.

##
Confidence Intervals
Section* *

A confidence interval provides a plausible range of values for a population measure. Instead of just reporting \(\bar{x}_A - \bar{x}_B\) as the sample estimate of \(\mu_{A} - \mu_{B}\), a range of values can be reported using a confidence interval..

The confidence interval is constructed in a manner such that it provides a high percentage of “confidence” (95% is commonly used) that the true value of \(\mu_{A} - \mu_{B}\) lies within it.

If the data approximately follow a bell-shaped normal distribution, then a 95% confidence interval for \(\mu_{A} - \mu_{B}\) is

\((\bar{x}_A - \bar{x}_B) \pm \left \{1.96 \times (\text{standard error of } \bar{x}_A - \bar{x}_B)\right \}\)

In the serum cholesterol example, \( (\bar{x}_A - \bar{x}_B) = 7.3 - 4.8 = 2.5 \text{mg/dl}\) and the standard error = \(1.2 \text{mg/dl}\). Thus, the approximate 95% confidence interval is:

\(2.5 \pm (1.96 \times 1.2) = \left [ 0.1, 4.9 \right ] \)

Note that the 95% confidence interval *does not* contain 0, which is consistent with the results of the 0.05-level hypothesis test (*p*-value = 0.04). 'No difference' is not a plausible value for the difference between the treatments.

Notice also that the length of the confidence interval depends on the standard error. The standard error decreases as the sample size increases, so the confidence interval gets narrower as the sample size increases (hence, greater precision).

A confidence interval is actually is more informative than testing a hypothesis. Not only does it indicate whether \(H_0\) can be rejected, but it also provides a plausible range of values for the population measure. Many of the major medical journals request the inclusion of confidence intervals within submitted reports and published articles.