One reason for performing sample size calculations in the planning phase of a study is to assure confidence in the study results and conclusions. We certainly wish to propose a study that has a chance to be scientifically meaningful.
Are there other implications, beyond a lack of confidence in the results, to an inadequately-powered study? Suppose you are reviewing grants for a funding agency. If insufficient numbers of subjects are to be enrolled for the study to have a reasonable chance of finding a statistically significant difference, should the investigator receive funds from the granting agency? Of course not. The FDA, NIH, NCI, and most other funding agencies are concerned about sample size and power in the studies they support and do not consider funding studies that would waste limited resources.
Money is not the only limited resource. What about potential study subjects? Is it ethical to enroll subjects in a study with a small probability of producing clinically meaningful results, precluding their participation in a more adequately-powered study? What about the horizon of patients not yet treated? Are there ethical implications to conducting a study in which treatment and care actually help prolong life, yet due to inadequate power, the results are unable to alter clinical practice?
Too many subjects are also problematic. If more subjects are recruited than needed, the study is prolonged. Wouldn't it be preferable to quickly disseminate the results if the treatment is worthwhile instead of continuing a study beyond the point where a significant effect is clear? Or, if the treatment proves detrimental to some, how many subjects will it take for the investigator to conclude there is a clear safety issue?
Recognizing that careful consideration of statistical power and the sample size is critical to assuring scientifically meaningful results, protection of human subjects, and good stewardship of fiscal, tissue, physical, and staff resources, let's review how power and sample size are determined.
Type I and II Error Section
When we are planning the sample size to use for a new study, we want to balance the use of resources while minimizing the chance of making an incorrect conclusion. Suppose our study is comparing an outcome between groups.
When we simply do hypothesis testing, without a priori sample size calculations, we use alpha (\(\boldsymbol{\alpha})\)=0.05 as our typical cutoff for significance. In this situation, the \(\boldsymbol{\alpha}\) of 0.05 means that we are comfortable with a 5% chance that we incorrectly rejected the null hypothesis when it was in fact true. In other words, a 5% chance that we concluded a significant difference between groups when there was not actually a difference. It makes sense that we really want to minimize the chance of making this error. This type of error is referred to as a Type I error (\(\boldsymbol{\alpha}\)).
Power comes in when we ALSO want to ensure that our sample size is large enough to detect a difference when one truly exists. We want our study to have large power (usually at least 80%) to correctly reject the null hypothesis when it is false. In other words, we want a high chance that if there truly is a difference between groups, we detect it with our statistical test. The type of error that comes in when we fail to reject the null hypothesis when it is in fact false is type II error (\(\boldsymbol{\beta}\)). Power is defined as \(1 - \boldsymbol{\beta}\).
Possible outcomes for hypothesis testing
Decision | Reality | |
---|---|---|
Null is true | Null is NOT true | |
Don't reject null | Correct decision | Type II error (\(\boldsymbol{\beta}\)) - Miss detecting a difference when there is one |
Reject null | Type I error (\(\boldsymbol{\alpha}\)) - Conclude difference when there isn't one |
Correct decision |
Using the analogy of a trial, we want to make correct decisions: declare the guilty, 'guilty' and the innocent, 'innocent'. We do not wish to declare the innocent 'guilty' (\(\sim \boldsymbol{\alpha}\)) or the guilty 'innocent' (\(\sim \boldsymbol{\beta}\)).
Factors that affect the sample size needed for a study Section
- Alpha - is the level of significance, the probability of a Type I error (\(\boldsymbol{\alpha}\)). This is usually 5% or 1%, meaning the investigator is willing to accept this level of risk of declaring the null hypothesis false when it is actually true (i.e. concluding a difference when there is not one).
- Beta - is the probability of making a Type II error, and Power = (\(1 - \boldsymbol{\beta}\)). Beta is usually 20% or lower, resulting in a power of 80% or greater, meaning the investigator is willing to accept this level of risk of not rejecting the null hypothesis when in fact it should be rejected (i.e. missing a difference when one exists)
- Effect size - is the deviation from the null (or difference between groups) that the investigator wishes to be able to detect. The effect size should be clinically meaningful. It may be based on the results of prior or pilot studies. For example, a study might be powered to be able to detect a relative risk of 2 or greater.
- Variability - may be expressed in terms of a standard deviation, or an appropriate measure of variability for the statistic. The investigator will need an estimate of the variability in order to calculate the sample size. Reasonable estimates may be obtained from historical data, pilot study data, or a literature search.
Change in factor | Change in sample size |
---|---|
Alpha \(\downarrow\) | Sample size \(\uparrow\) |
Beta \(\downarrow\) (power \(\uparrow\)) | Sample size \(\uparrow\) |
Effect size \(\downarrow\) | Sample size \(\uparrow\) |
Variability \(\downarrow\) | Sample size \(\downarrow\) |
Sample Size Considerations Section
- There are nice closed-form formulas for many of the standard comparisons we are interested in. For other scenarios, formulas do not exist, and simulations must be used.
- When calculating necessary sample sizes, there are various options (formulas, tables, online calculators, proprietary software). It is worthwhile to use more than one to check yourself while you are still getting comfortable doing these calculations and until you find a method that works best for you. They will likely make slightly different assumptions or use slightly different formulas, but the results should be similar.
- It is often good to make a table to see how sample size estimates will change based on different assumptions.
- One-sided versus two-sided alpha (depends on hypothesis)
- Alpha (level of type I error you’re comfortable with)
- Power (level of type II error you’re comfortable with)
- Preliminary estimates (depends if you have good preliminary data, or you’re just hypothesizing the null values)
- Estimated differences (see how sample size changes based on difference you’re trying to detect)