One reason for performing sample size calculations in the planning phase of a study is to assure confidence in the study results and conclusions. We certainly wish to propose a study that has a chance to be scientifically meaningful.

Are there other implications, beyond a lack of confidence in the results, to an inadequately-powered study? Suppose you are reviewing grants for a funding agency. If insufficient numbers of subjects are to be enrolled for the study to have a reasonable chance of finding a statistically significant difference, should the investigator receive funds from the granting agency? Of course not. The FDA, NIH, NCI and most other funding agencies are concerned about sample size and power in the studies they support and do not consider funding studies that would waste limited resources.

Money is not the only limited resource. What about potential study subjects? Is it ethical to enroll subjects in a study with a small probability of producing clinically meaningful results, precluding their participation in a more adequately-powered study? What about the horizon of patients not yet treated? Are there ethical implications to conducting a study in which treatment and care actually help prolong life, yet due to inadequate power, the results are unable to alter clinical practice?

Too many subjects is also problematic. If more subjects are recruited than needed, the study is prolonged. Wouldn't it be preferable to quickly disseminate the results if the treatment is worthwhile instead of continuing a study beyond the point where a significant effect is clear? Or, if the treatment proves detrimental to some, how many subjects will it take for the investigator to conclude there is a clear safety issue?

Recognizing that careful consideration of statistical power and sample size is critical to assuring scientifically meaningful results, protection of human subjects and good stewardship of fiscal, tissue, physcial and staff resources, let's review how power and sample size are determined.

#### One-Sided Hypothesis Testing

- Null hypothesis – \(H_0\colon \text{disease frequency}_1=\text{disease frequency}_2\)
- Alternative hypothesis – \(H_1\colon\text{disease frequency}_1 \gt \text{disease frequency}_2\)

Power is calculated with regard to a particular set of hypotheses. Often epidemiologic hypotheses compare an observed proportion or rate to a hypothesized value. The above hypotheses are *one-sided*, i.e. testing whether the proportion is significantly less in group 2 than group 1. An example of *two-sided* hypotheses would be testing equality of proportions as the null hypothesis; using as the alternative, inequality of proportions.

#### Possible Outcomes for Tests of Hypotheses

When testing hypotheses, there are two types of error as shown in the table below:

Accept \(H_0\) | Reject \(H_0\) | |
---|---|---|

\(H_0\) True | Correct Decision | Type I Error (alpha; \(\alpha\)) |

\(H_0\) False | Type II Error (beta; \(\beta\)) |
Correct Decision |

Using the analogy of a trial, we want to make correct decisions: declare the guilty, 'guilty' and the innocent, 'innocent'. We do not wish to declare the innocent 'guilty' or the guilty 'innocent'.

##
Statistical Power
Section* *

Power is the probability that the null hypothesis is rejected, if a specific alternative hypothesis is true. \(\beta\) represents Type II error, the probability of not rejecting the null hypothesis when the given alternative is true.

\(1-\beta\) = power

The power of a study should be minimally 80% and often, studies are designed to have 90-95% power to detect a particular clinical effect.

What factors affect power?

#### \(\alpha\),\(\beta\), effect size, variability, (baseline incidence), n

**\(\alpha\)** is the level of significance, the probability of a Type I error. This is usually 5% or 1%, meaning the investigator is willing to accept this level of risk of declaring the null hypothesis false when it is actually true.

The **effect size** is the deviation from the null that the investigator wishes to be able to detect. The effect size should be clinically meaningful. It may be based on the results of prior or pilot studies. For example, a study might be powered to be able to detect a relative risk of 2 or greater.

Sometimes a standardized effect size is given, i.e., the effect size divided by the standard deviation. This is a unitless value. If power is calculated in this manner, the standardized effect size is usually between 0.1 and 0.5, with 0.5 meaning \(H_1\) is 0.5 standard deviations away from \(H_0\).

**Variability** may be expressed in terms of a standard deviation, or an appropriate measure of variability for the statistic. If the hypotheses are concerned with a population proportion, the value of the proportion and the sample size are used to calculate the variability. The investigator will need an estimate of the variability in order to calculate power. Reasonable estimates may be obtained from historical data, pilot study data or a literature search.

A study may have multiple sources of variation, each accounted for in the analysis. For example, a repeated measures design will need to account for both within-subject and between-subject variability.

The **baseline** incidence rate is related to the effect size. If it is hypothesized that a rate has increased or decreased, the baseline rate and the effect size must both be known to calculate the power for detecting such a change.

With knowledge of the above factors, power of a statistical test can be calculated for a given sample size. Alternatively, the required sample size for a given power can be calculated.

Power is directly related to effect size, sample size and significance level. An* increase in either the effect size, the sample size or the significance level will produce increased statistical power*, all other factors being equal. Power is inversely related to variability. *Decreasing variability* will increase the power of a study.

If the power of a study is relatively high and a statistically significant effect is not observed, this implies the effect, if any, is small.

##
Sample Size in Epidemiologic Studies
Section* *

Epidemiologic studies can be population-based or non-population-based, such as case-control studies.

- Population-based studies (cohort or cross-sectional studies)
- Differences in proportions (e.g., attributable risk)
- Ratios (e.g., relative risks, relative rates, prevalence ratios)

- Case-control studies (e.g., calculating an odds ratios)
- Unmatched study designs
- Multiple controls/case
- Matched study designs