Unit 3: Statistical Analysis
Unit 3: Statistical AnalysisIntroduction
While there are some very specific types of measures and analyses that are appropriate for certain epidemiologic study designs, there are also plenty of situations where standard statistical methods are appropriate for analyzing data from epidemiologic studies. The specific measures we have already covered include prevalence and incidence (including cumulative incidence as well as incidence rates), and specific analysis methods include using odds ratios for case-control studies.
Standard statistical methods can be used across many types of applied research, and an overview is provided here in Unit 3. Methods are broken down by the type of outcome being measured (continuous, categorical, time-to-event), and within each outcome type by techniques for descriptive, bivariable associations, and modeling. Depending on the outcome, and planned primary comparison, it is important to also know how to estimate the necessary sample size. Thus, Unit 3 also covers power/sample size and outlines scenarios where closed-form formulas exist.
Lesson 9 - Statistical Analysis Methods
Lesson 9 - Statistical Analysis MethodsLesson 9 Objectives
- Use plots, tables, and summary statistics to describe variables and relationships between variables
- Identify which modeling strategy to use based on the type of data (continuous, categorical, time-to-event)
- Interpret results of statistical analyses
- Differentiate between odds ratios, risk ratios, and hazard ratios
Epidemiologic data can be analyzed using a variety of statistical methods. Here we outline the fundamentals according to the type of outcome measure. We can generally think of outcome data as one of three types: 1) continuous, 2) categorical, and 3) time-to-event. While it is true that time-to-event is continuous, we do not always observe the true time for each person, so special consideration needs to be given in that scenario. Once the type of outcome data is known, there are standard techniques one can use to provide descriptive statistics, look at bivariable associations, and use modeling to describe the association between multiple covariates and the outcome.
Example: a subset of data from the Framingham Heart Study
Our motivating example will be based on the SAS-provided dataset “Heart” which includes a small subset of data from participants in the Framingham Heart Study. This dataset contains over 5000 patients from the cohort study and provides data on their baseline age, sex, weight, smoking status, cholesterol, blood pressure, and coronary heart disease (CHD) development. Patients were contacted every 2 years for over 30 years.
9.1 - Continuous outcome
9.1 - Continuous outcomeFrom our example, we may be interested in the relationship of age with cholesterol, and want to consider a possible confounder (or effect modifier) of sex.
- The outcome is cholesterol and is a continuous value.
- The predictors/covariates to be considered are age and sex. Age can be either continuous, or put into categories, and sex is a categorical variable.
Descriptive
For the continuous outcome of cholesterol, first, we can look at the distribution of the data via a histogram and by calculating descriptive statistics:
Analysis Variable: Cholesterol | |||||||||
---|---|---|---|---|---|---|---|---|---|
N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |
5057 | 227.42 | 44.94 | 226.18 | 228.66 | 96.00 | 196.00 | 223.00 | 255.00 | 268.00 |
Here, we see that cholesterol appears normally distributed, with a mean of 227.4 and confidence interval around the mean of (226.18, 228.66). This CI is very narrow due to the large sample size.
Since we have a continuous outcome, we will likely plan to use linear regression. We can do a test for normality, but with such a large sample size, even if there appears to be a deviation from normality, it is still reasonable to use linear regression. With smaller datasets, or highly skewed data, a transformation may be necessary. The Kolmogorov-SMirnov test for normality for cholesterol does result in a significant p-value (p<0.01), but since we have such a large sample size, we will still proceed with linear regression.
Bivariable Associations
We hypothesize that age is related to cholesterol, with cholesterol increasing with increasing age. Since age is continuous, we can use it as a continuous predictor, and we may want to categorize it to help with visualization or interpretability.
Treating age as continuous would lead us to look at a scatter plot between the two continuous variables, as well as estimate a correlation coefficient as a measure of association.
We see that cholesterol does appear to increase as age increases, and this best fit line suggests a positive slope. The correlation coefficient between the two variables is 0.27. A correlation coefficient ranges from -1 to 1 with values closest to 0 indicating no relationship. The closer to 1 (or -1) the correlation coefficient is, the stronger the correlation. A correlation coefficient of 1 (or -1) would indicate perfect correlation - as demonstrated by all points falling along a single line. Values closer to 0 indicate no relationship and the graph would just appear to be a random cloud of points. The positive or negative sign of the correlation coefficient indicates if it is a positive or negative correlation. Positive correlation means that as one variable increases, so does the other, and negative means that as one variable increases, the other decreases.
We could also group age into categories and look at the relationship. Here, we would calculate means per group, and could visualize the relationship with boxplots.
agegrp | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
<40 | 1877 | 36.03% | 1877 | 36.03% |
[40-50] | 1740 | 33.42% | 3618 | 69.46% |
>=50 | 1591 | 30.54% | 52.09 | 100.00% |
Analysis Variable: Cholesterol | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
agegrp | N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |
<40 | 1819 | 213.18 | 41.97 | 211.25 | 215.11 | 115.00 | 183.00 | 209.00 | 235.00 | 534.00 |
[40-50] | 1690 | 229.63 | 42.92 | 227.59 | 231.68 | 117.00 | 200.00 | 226.00 | 253.00 | 568.00 |
>=50 | 1548 | 241.73 | 45.49 | 239.46 | 244.00 | 96.00 | 210.00 | 238.00 | 270.00 | 425.00 |
We see that about a third of patients are in each age group (<40, 40 - 50, and 50 and older), and that for each increasing age group, the mean cholesterol is higher. For the boxplot, the box indicates the 25th, 50th (median), and 75th percentiles as the bottom, middle, and top of the box, respectively. The marker inside the box shows the mean, which is often close to the median for large sample sizes with normally distributed data. The whiskers extend out relative to the interquartile range, and data points that fall out of that limit are shown with dots.
Since we are also interested in sex, we should summarize that vairable as well. Females have higher cholesterol on average than males, but only by about 2 points:
Analysis Variable: Cholesterol | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
sex | N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |
Female | 2774 | 228.54 | 46.92 | 226.79 | 230.29 | 117.00 | 196.00 | 224.00 | 257.00 | 493.00 |
Male | 2283 | 226.05 | 42.37 | 224.31 | 227.79 | 96.00 | 198.00 | 223.00 | 250.00 | 568.00 |
Modeling (Multivariable Associations)
In order to look at the relationship of multiple variables with our outcome, we need to move to modeling. With a continuous outcome, we can use linear regression.
First we want to see if the differences in cholesterol by age group are significant. Our model can then be fit with just age group as a covariate and we see:
Analysis of Maximum Likelihood of Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits |
Wald Chi-Square |
Pr > ChiSq | ||
Intercept | 1 | 213.1781 | 1.0170 | 211.1848 | 215.1715 | 43934.9 | <.0001 | |
agegrp | >=50 | 1 | 28.5525 | 1.4999 | 25.6127 | 31.4923 | 362.36 | <.0001 |
agegrp | [40-50] | 1 | 16.4550 | 1.4655 | 13.5827 | 19.3273 | 126.07 | <.0001 |
agegrp | <40 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
The estimate for the difference in cholesterol between the oldest and youngest age group is 28.6 (which we can confirm from our earlier descriptive table), the CI for this estimate is (25.6 - 31.5), and the p-value is <0.0001, all clearly providing evidence that there is a significant difference in cholesterol between the oldest and youngest age groups. A similar conclusion is seen with significantly higher cholesterol in the middle age group compared to the younger - on average about 16.5 points higher.
Next, we may want to see if this relationship still holds after controlling for sex. The model including both covariates in the model shows this:
Analysis of Maximum Likelihood of Parameter Estimates | ||||||||
---|---|---|---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Wald 95% Confidence Limits |
Wald Chi-Square |
Pr > ChiSq | ||
Intercept | 1 | 211.7290 | 1.2206 | 209.3367 | 214.1213 | 30090.1 | <.0001 | |
agegrp | >=50 | 1 | 28.5721 | 1.4993 | 25.6336 | 31.5107 | 363.18 | <.0001 |
agegrp | [40-50] | 1 | 16.4595 | 1.4648 | 13.5885 | 19.3305 | 126.26 | <.0001 |
agegrp | <40 | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ||
Sex | Female | 1 | 2.6280 | 1.2252 | 0.2266 | 5.0293 | 4.60 | 0.0320 |
Sex | Male | 0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
The estimates for differences by age group are still about the same: 28 points higher for oldest vs youngest age group, and 16 points higher for the middle vs youngest group, even after controlling for sex. Thus, it does not appear that sex is a confounder. This model is also consistent with the simple descriptives of cholesterol by sex that showed on average females have slightly higher cholesterol (about 2.5 points).
Finally, we may want to investigate if sex is an effect modifier, and thus we also include the interaction term of agegrp*sex. The p-value for this is significant, and the model estimates show that these are the estimated means per group:
female | male | |
---|---|---|
<40 | 206.1 | 221.9 |
[40-50] | 230.2 | 228.9 |
>=50 | 253.4 | 227.8 |
We can see that as age group increases, so does cholesterol, but much more dramatically in females. Thus age group is an effect modifier. Males have an average cholesterol around 220-230, and this does not seem to change with age. Females, on the other hand, have a greater change in cholesterol with increasing age. We can see this better by graphing the means by group and seeing that the mean cholesterol for males is mainly flat line, but the line connecting the means for females has a slope.
9.2 - Categorical outcome
9.2 - Categorical outcomeFrom our example, we may be interested in the relationship of BMI with high blood pressure, and want to consider a possible confounder (or effect modifier) of sex.
The outcome is high blood pressure and is a dichotomous value (either present or not).
The predictors/covariates to be considered are BMI and sex. BMI can be either continuous or put into categories, and sex is a categorical variable.
Descriptive
First, we want to report the percentage of patients who have high blood pressure with a 95% confidence interval. We find that 43.5% of patients report high blood pressure. The exact CI for this estimate is (42.2% - 44.8%), again very narrow due to the large sample size.
high_BP | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|
low/normal | 2942 | 56.48% | 2942 | 56.48% |
high | 2267 | 43.52% | 5209 | 100.00% |
Bivariable Associations
Next we want to look at the relationship with BMI, and can consider BMI as both continuous and categorical variables
Table of BMIGrp by high_BP | |||
---|---|---|---|
BMIgrp | high_BP | ||
low/normal | high | Total | |
[18/5-25] normal | 1727 (70.32%) |
729 (29.68%) |
2456 |
[25-30] overwght | 953 (48.38%) |
1017 (51.62%) |
1970 |
>= obese | 188 (27.09%) |
506 (72.91%) |
694 |
Total | 2868 | 2252 | 5120 |
Frequency Missing = 89 |
We see that as BMI level increases, so does the rate of high BP (30%, 52%, and 73% for increasing levels of BMI). We can use a chi-squared test here to test the association between the two variables. It is highly significant, and not surprisingly so, due to the large sample size.
Considerations specifically related to Non-matched Case-Control Studies:
- Chi-squared tests can be used for the bivariable association of exposure and outcome. If any cell counts are less than 5, Fisher’s Exact tests should be used instead.
- If we want to evaluate potential effect modifiers using these types of bivariable association tables, we can use the Mantel-Haenszel statistic, which essentially breaks the exposure * outcome table up by potential effect modifier to evaluate if there are different effects for different strata.
We can also look at a boxplot or histogram for the continuous version of BMI and see that on average, patients with high BP tend to have higher BMI compared to those without high BP.
Analysis Variable: bmi | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
high_BP | N | Mean | Std Dev |
Lower 95% CL for Mean |
Upper 95% CL for Mean |
Minimum | 25th Pctl |
Median | 75th Pctl |
Maximum |
low/normal | 2936 | 24.37 | 3.61 | 24.24 | 24.50 | 14.12 | 21.87 | 24.01 | 26.49 | 51.96 |
high | 2263 | 27.15 | 4.48 | 26.97 | 27.34 | 15.77 | 24.05 | 26.73 | 29.52 | 56.68 |
We can use a two group t-test to compare the means by group, but it is often more streamlined to consider the modeling technique you plan to use, and use that for both bivariable and multivariable associations. The model with just a single covariate in the model will provide an unadjusted result, and the model with multiple covariates will provide an adjusted result.
Modeling (Multivariable Associations)
For a dichotomous outcome we may want to estimate odds ratios or risk ratios, and thus will use logistic or log-binomial regression, respectively.
Logistic regression to estimate Odds Ratio:
Using the table which shows the raw counts in each BMI group who have high BP, we can calculate the OR of high BP for the overweight vs normal BMI group as (1017*1727)/(729*953) = 2.52, and similarly for the obese vs normal groups as (506*1727)/(729*188) = 6.38.
From the logisitic regression model, we get these same estimates, along with 95% CIs:
Label | Estimate | Standard Error |
Confidence Limits | |
---|---|---|---|---|
(OR overwght vs. normal) | 2.5281 | 0.1596 | 2.2339 | 2.8610 |
(OR obese vs. normal) | 6.3761 | 0.6131 | 5.2809 | 7.6985 |
Considerations specifically related to Case-Control Studies:
Remember that for non-matched case-control studies, OR must be calculated since the distribution of exposure is not necessarily representative of the population. The sampling fractions cancel out in the OR calculation, but not in the RR.
These logistic regression models can be considered unconditional, which is appropriate for non-matched case control studies, but not MATCHed case control studies. For matched case control studies conditional logistic regression modeling should be used, and the OR is calculated based on concordant and discordant pairs.
Log-binomial regression to estimate Risk Ratio:
Using the table which shows the raw counts in each BMI group who have high BP, we can calculate the RR of high BP for the overweight vs normal BMI group as (1017/1970)/(729/2456) = 1.74, and similarly for the obese vs normal groups as (506/694)/1017/2456)) = 1.41.
From the log-binomial regression model, we get these same estimates, along with 95% CIs: (note that these are the unadusted RR, which can also be calculated just from the table in the previous section). Note that if the log-binomial model does not converge, modified poisson regression modeling can be used.
Label | Estimate | Standard Error |
Confidence Limits | |
---|---|---|---|---|
(RR overwght vs. normal) | 1.4536 | 0.0267 | 1.3794 | 1.5317 |
(RR obese vs. normal) | 2.5958 | 0.0636 | 2.2914 | 2.9406 |
As stated earlier, we often want the RR, so we’ll proceed with those estimates. But notice how the RR are less extreme than the OR, which is often the case. And if readers don’t know the distinction between OR and RR, and assume the OR can be interpreted as the RR, they will incorrectly overestimate the difference in risk between groups.
If we want adjusted RR, we can simply add the other covariates to the model. In this case we want to see if sex is a possible confounder or effect modifier. Adding sex to the model, does not meaningfully change the RR based on BMI (the estimates are essentially the same), thus sex is not a confounder.
- RR overwght v normal = 1.44
- RR for obese v normal = 2.59
The model also shows that sex is a significant predictor of blood pressure. (In the unadjusted setting, we see that the rates of high BP in males is 46% compared to 41% in females - not a huge clinical difference). The adjusted RR for female v males is 1.06 (95% CI: 1.01 - 1.11). Suggesting a small increased risk of high BP for females compared to males.
To evaluate sex as a potential effect modifier, we can include an interaction term in the model. Doing so shows no statistical evidence of an interaction, thus we can assume the relationship between BMI and high blood is similar for both males and females. If the interaction had been significant, the next step would be to provide stratified analyses, where we estimate RR estimates for BMI with high blood separately for females and males.
9.3 - Time-to-event outcome
9.3 - Time-to-event outcomeDescriptive
Examples of time-to-event data are:
- Time to death
- Time to development of a disease
- Time to first hospitalization
- And many others
One may think that time-to-event data is simply continuous, but since we do not observe the true time for each person in the dataset, this is not the case. The people who do not experience the event still contribute valuable information, and we refer to these patients as “censored”. We use the time they contribute until they are censored, which is the time they stop being followed because the study has ended, they are lost to follow-up, or they withdraw from the study.
For our example, we are interested in the time to development of coronary heart disease (CHD). No patients had CHD upon study entry, and patients were surveyed every 2 years to see if they had developed CHD. Each patient’s “time-to-CHD” will fall into one of these categories:
- They develop CHD within the 30-year study period
Time = years until they develop CHD
Status = event - They do not develop CHD within the 30-year study period, and they stay in the study until the end
Time = 30 years
Status = censored - They do not develop CHD within the 30-year study period, and they leave the study before the 30-year study period is finished (due to death, moving, lost contact, voluntarily withdraw, etc.)
Time = time on study
Status = censored
The best way to describe time-to-event data is by the Kaplan-Meier method. This uses information from all patients, and differentiates between patients who did and did not experience the event. A Kaplan Meier (KM) plot is how we visualize time-to-event data and starts with all patients being event-free at time 0. The KM method uses the number of patients still at risk over time, and patients drop out once they experience the event or are censored. A Kaplan Meier plot and a Cumulative Incidence plot are inverses of each other, so you can choose which best fits your data. Often for “Overall Survival” we use KM plots, which start at 100%, and decrease over time as patients either die or are censored. This can really be considered as plotting the percentage of patients still alive. For our example, it makes more sense to look at a cumulative incidence plot, which starts at 0% and shows how the incidence of CHD increases over time. (A KM plot would plot the percent of people who are CHD-free, and this would decrease over time.)
This plot shows that over time CHD is increasing, and we can get estimates of rates of CHD at different time points using the KM estimate.
Bivariable
When comparing time-to-event data between groups, we can use the KM method again, as well as perform a log-rank test. For our example, suppose we want to compare time to CHD by BP status.
This plot shows that those with high BP at study entry (blue line) have higher rates of CHD than those with low or normal BP (red line). The KM estimates of CHD at 10 years are 12.7% for the high BP group and 4.7% for the low/normal group. At 20 years, these estimates are 26.1% and 12.0%. The log-rank test is essentially a comparison of lines, not specifically comparing estimates at any single point, and is highly significant here (p<0.0001).
Modeling (Multivariable Associations)
We can use Cox Proportional Hazards modeling to estimate the hazard ratio. This model uses the hazard function which is the probability that if a person survives to time t, they will experience the event in the next instant.
Just from eyeballing the previous plot, it appears that the risk of CHD is about twice as high for those with high BP compared to those with low/normal. Actually fitting a Cox model with high BP as a single covariate shows that the estimated hazard ratio is 1.87 (95% CI: 1.69 - 2.08), which fits with what we see in the plot.
The Cox models can also include multiple covariates to test for confounding and interaction terms to evaluate effect modification, similar to those in previous sections. With additional terms in the model, we can estimate adjusted hazard ratios.
9.4 - Summary
9.4 - SummaryWhen planning analyses for a study it is important to be clear about what type of data you’ll have. Once you know if the outcome measure is continuous, categorical, or time-to-event, you can choose the appropriate methods. Understanding your data is very important, so do not skip the step of looking at descriptive statistics first, including looking at distributions and graphs whenever possible. Next, you can start to look at associations between variables (bivariable) to get a sense of how variables relate to one another. This step can and should also use graphs and tables to visualize data whenever helpful. Once these relationships are understood, modeling techniques can be used. Models allow for both unadjusted and adjusted estimates to be calculated and can include more than one covariate. Modeling can be used to evaluate potential confounding along with effect modification.
Lesson 10 - Power and Sample Size Considerations
Lesson 10 - Power and Sample Size ConsiderationsLesson 10 Objectives
- Describe the rationale for sample size calculations
- Describe the relationships between sample size, power, variability, effect size, and significance level
- Distinguish between type I & II error
- Calculate the sample size needed for the following scenarios given the necessary preliminary data
- Single proportion
- Comparison of two proportions
- Unmatched case-control study
- Matched case-control study
- Single mean
- Comparison of two means
10.1 - Rationale and Type I & II Error
10.1 - Rationale and Type I & II ErrorOne reason for performing sample size calculations in the planning phase of a study is to assure confidence in the study results and conclusions. We certainly wish to propose a study that has a chance to be scientifically meaningful.
Are there other implications, beyond a lack of confidence in the results, to an inadequately-powered study? Suppose you are reviewing grants for a funding agency. If insufficient numbers of subjects are to be enrolled for the study to have a reasonable chance of finding a statistically significant difference, should the investigator receive funds from the granting agency? Of course not. The FDA, NIH, NCI, and most other funding agencies are concerned about sample size and power in the studies they support and do not consider funding studies that would waste limited resources.
Money is not the only limited resource. What about potential study subjects? Is it ethical to enroll subjects in a study with a small probability of producing clinically meaningful results, precluding their participation in a more adequately-powered study? What about the horizon of patients not yet treated? Are there ethical implications to conducting a study in which treatment and care actually help prolong life, yet due to inadequate power, the results are unable to alter clinical practice?
Too many subjects are also problematic. If more subjects are recruited than needed, the study is prolonged. Wouldn't it be preferable to quickly disseminate the results if the treatment is worthwhile instead of continuing a study beyond the point where a significant effect is clear? Or, if the treatment proves detrimental to some, how many subjects will it take for the investigator to conclude there is a clear safety issue?
Recognizing that careful consideration of statistical power and the sample size is critical to assuring scientifically meaningful results, protection of human subjects, and good stewardship of fiscal, tissue, physical, and staff resources, let's review how power and sample size are determined.
Type I and II Error
When we are planning the sample size to use for a new study, we want to balance the use of resources while minimizing the chance of making an incorrect conclusion. Suppose our study is comparing an outcome between groups.
When we simply do hypothesis testing, without a priori sample size calculations, we use alpha (\(\boldsymbol{\alpha})\)=0.05 as our typical cutoff for significance. In this situation, the \(\boldsymbol{\alpha}\) of 0.05 means that we are comfortable with a 5% chance that we incorrectly rejected the null hypothesis when it was in fact true. In other words, a 5% chance that we concluded a significant difference between groups when there was not actually a difference. It makes sense that we really want to minimize the chance of making this error. This type of error is referred to as a Type I error (\(\boldsymbol{\alpha}\)).
Power comes in when we ALSO want to ensure that our sample size is large enough to detect a difference when one truly exists. We want our study to have large power (usually at least 80%) to correctly reject the null hypothesis when it is false. In other words, we want a high chance that if there truly is a difference between groups, we detect it with our statistical test. The type of error that comes in when we fail to reject the null hypothesis when it is in fact false is type II error (\(\boldsymbol{\beta}\)). Power is defined as \(1 - \boldsymbol{\beta}\).
Possible outcomes for hypothesis testing
Decision | Reality | |
---|---|---|
Null is true | Null is NOT true | |
Don't reject null | Correct decision | Type II error (\(\boldsymbol{\beta}\)) - Miss detecting a difference when there is one |
Reject null | Type I error (\(\boldsymbol{\alpha}\)) - Conclude difference when there isn't one |
Correct decision |
Using the analogy of a trial, we want to make correct decisions: declare the guilty, 'guilty' and the innocent, 'innocent'. We do not wish to declare the innocent 'guilty' (\(\sim \boldsymbol{\alpha}\)) or the guilty 'innocent' (\(\sim \boldsymbol{\beta}\)).
Factors that affect the sample size needed for a study
- Alpha - is the level of significance, the probability of a Type I error (\(\boldsymbol{\alpha}\)). This is usually 5% or 1%, meaning the investigator is willing to accept this level of risk of declaring the null hypothesis false when it is actually true (i.e. concluding a difference when there is not one).
- Beta - is the probability of making a Type II error, and Power = (\(1 - \boldsymbol{\beta}\)). Beta is usually 20% or lower, resulting in a power of 80% or greater, meaning the investigator is willing to accept this level of risk of not rejecting the null hypothesis when in fact it should be rejected (i.e. missing a difference when one exists)
- Effect size - is the deviation from the null (or difference between groups) that the investigator wishes to be able to detect. The effect size should be clinically meaningful. It may be based on the results of prior or pilot studies. For example, a study might be powered to be able to detect a relative risk of 2 or greater.
- Variability - may be expressed in terms of a standard deviation, or an appropriate measure of variability for the statistic. The investigator will need an estimate of the variability in order to calculate the sample size. Reasonable estimates may be obtained from historical data, pilot study data, or a literature search.
Change in factor | Change in sample size |
---|---|
Alpha \(\downarrow\) | Sample size \(\uparrow\) |
Beta \(\downarrow\) (power \(\uparrow\)) | Sample size \(\uparrow\) |
Effect size \(\downarrow\) | Sample size \(\uparrow\) |
Variability \(\downarrow\) | Sample size \(\downarrow\) |
Sample Size Considerations
- There are nice closed-form formulas for many of the standard comparisons we are interested in. For other scenarios, formulas do not exist, and simulations must be used.
- When calculating necessary sample sizes, there are various options (formulas, tables, online calculators, proprietary software). It is worthwhile to use more than one to check yourself while you are still getting comfortable doing these calculations and until you find a method that works best for you. They will likely make slightly different assumptions or use slightly different formulas, but the results should be similar.
- It is often good to make a table to see how sample size estimates will change based on different assumptions.
- One-sided versus two-sided alpha (depends on hypothesis)
- Alpha (level of type I error you’re comfortable with)
- Power (level of type II error you’re comfortable with)
- Preliminary estimates (depends if you have good preliminary data, or you’re just hypothesizing the null values)
- Estimated differences (see how sample size changes based on difference you’re trying to detect)
10.2 - Test a Single Proportion
10.2 - Test a Single ProportionExample
The baseline prevalence of smoking in a particular community is 30%. A clean indoor air policy goes into effect. What is the sample size required to detect a decrease in smoking prevalence of at least 2 percentage points, with a one-sided alpha of 0.05 and a power of 90%?
Formula
We are interested in testing the following hypothesis:
\(\begin{array}{l}
\mathrm{H}_{0}\colon \pi=\pi_{0} \\
\mathrm{H}_{1}\colon \pi=\pi_{1}=\pi_{0}+d
\end{array}\)
Where \(\pi\) is the true proportion, \(\pi_0\) is some specified value for the proportion we wish to test (30% in our example), and \(\pi_1\) (which differs from \(\pi_0\) by an amount d (d= 2% in our example)) is the alternative value.
The formula needed to calculate the sample size is:
\(\displaystyle{n=\frac{1}{d^{2}}\left[z_{\alpha} \sqrt{\pi_{0}\left(1-\pi_{0}\right)}+z_{\beta} \sqrt{\pi_{1}\left(1-\pi_{1}\right)}\right]^{2}}\)
Where
- \(\pi_0\) = null hypothesized proportion
- d = estimated change in proportion
Note that we can replace \(z_a\) by \(z_{\alpha / 2}\) for a two-sided test.
The z terms can be found from a standard normal distribution table, and common values are shown below:
Significance level | |||||||
---|---|---|---|---|---|---|---|
One-sided | Two-sided | Power | |||||
5% |
1% 2.3263 |
0.1% 3.0902 |
5% 1.9600 |
1% 2.5758 |
0.1% 3.2905 |
90% 1.2816 |
95% 1.6449 |
(Chapter 8.5, p 305, Woodward book)
From the formula, we can calculate that n= 4,417.
The table below can also be used to estimate the necessary sample size. Note that the formula and the Woodward table define d as a positive change. Since we are testing a decrease (30% down to 28%), we need to assume \(pi_{0}\) is 70%, and that it will go up to 72%. We can think of it as testing the “non-smoking” prevalence.
Note that Woodward offers additional tables in his textbook which can be used for different power, and for a two-sided versus a one-sided test.
Sample Size Statement: A total sample size of n=4,417 is needed to detect a change in smoking prevalence of 2% (30% down to 28%) using a one-group chi-squared test with one-sided alpha of 0.05 and 90% power.
Table B.8. Sample size requirements for testing the value of a single proportion.
These tables give requirements for a one-sided test directly. For two-sided tests, use the table corresponding to half the required significance level. Note that \(\pi_{0}\) is the hypothesized proportion (under \(H_{0}\)) and \(d\) is the difference to be tested. | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
(a) 5% significance, 90% power \(\pi_{0}\) |
|||||||||||
\(d\) | 0.01 | 0.10 | 0.20 | 0.30 | 0.40 | 0.50 | 0.60 | 0.70 | 0.80 | 0.90 | 0.95 |
0.01 | 1 178 | 8 001 | 13 923 | 18 130 | 20 625 | 21 406 | 20 475 | 17 830 | 13 473 | 7 400 | 3 717 |
0.02 | 366 | 2 070 | 3 534 | 4 567 | 5 172 | 5 349 | 5 097 | 4 417 | 3 308 | 1 769 | 833 |
0.03 | 192 | 950 | 1 593 | 2 045 | 2 305 | 2 376 | 2 255 | 1 944 | 1 443 | 748 | 322 |
0.04 | 123 | 551 | 908 | 1 158 | 1 300 | 1 335 | 1 262 | 1 083 | 795 | 398 | 148 |
0.05 | 88 | 362 | 589 | 746 | 834 | 853 | 804 | 686 | 498 | 239 | |
0.06 | 67 | 258 | 414 | 521 | 580 | 591 | 555 | 471 | 338 | 155 | |
0.07 | 54 | 194 | 308 | 385 | 427 | 434 | 405 | 342 | 242 | 104 | |
0.08 | 44 | 152 | 238 | 296 | 327 | 331 | 308 | 258 | 181 | 71 | |
0.09 | 38 | 123 | 190 | 235 | 259 | 261 | 242 | 201 | 139 | 48 | |
0.10 | 32 | 102 | 156 | 191 | 210 | 211 | 195 | 161 | 109 | ||
0.15 | 18 | 49 | 72 | 87 | 93 | 92 | 83 | 66 | 40 | ||
0.20 | 12 | 30 | 42 | 49 | 52 | 50 | 44 | 33 | |||
0.25 | 9 | 20 | 27 | 31 | 33 | 31 | 26 | 18 | |||
0.30 | 7 | 14 | 19 | 22 | 22 | 20 | 16 | ||||
0.35 | 5 | 11 | 14 | 16 | 16 | 14 | 10 | ||||
0.40 | 4 | 9 | 11 | 12 | 11 | 10 | |||||
0.45 | 4 | 7 | 8 | 9 | 8 | 6 | |||||
0.50 | 3 | 6 | 7 | 7 | 6 |
(Tables from Woodward, M. Epidemiology Study Design and Analysis. Boca Raton: Chapman and Hall:, 2013)
Stop and Think!
Looking at the table values, what happens to the necessary sample size as:
- Prevalence increases (\(B_0\))? Does the sample size increase or decrease?
- What happens to the sample size as effect size decreases?
- What is the minimal detectable difference if you had funds for 1,500 subjects?
- The largest sample sizes occur with baseline prevalence at 0.5
- The smaller the effect size, the larger the sample size
- About 3.6% decrease in prevalence
10.3 - Compare Two Proportions
10.3 - Compare Two ProportionsExample
Suppose the rate of disease in an unexposed population is 10/100 person-years. You hypothesize an exposure has a relative risk of 2.0. How many persons must you enroll assuming half are exposed and half are unexposed to detect this increased risk, with a one-sided alpha of 0.05 and power of 90%?
Formula
We are interested in testing the following hypothesis:
\(\begin{array}{l}
\mathrm{H}_{0}\colon \pi_{1}=\pi_{2} \\
\mathrm{H}_{1}\colon \pi_{1}-\pi_{2}=\delta
\end{array}\)
But it is usually more convenient to consider the ratio (i.e. relative risk = λ), so we can consider this hypothesis:
\(\begin{array}{l}
\mathrm{H}_{0}: \pi_{1}=\pi_{2} \\
\mathrm{H}_{1}: \pi_{1} / \pi_{2}=\lambda
\end{array}\)
The formulas needed to calculate the total sample size are:
\(\displaystyle{n=\frac{r+1}{r(\lambda-1)^{2} \pi^{2}}\left[z_{\alpha} \sqrt{(r+1) p_{c}\left(1-p_{c}\right)}+z_{\beta} \sqrt{\lambda \pi(1-\lambda \pi)+r \pi(1-\pi)}\right]^{2}}\),
and
\(\displaystyle{p_{c}=\frac{\pi(r \lambda+1)}{r+1}}\)
where
\(\pi=\pi_{2}\) is the proportion in the reference group
\(\mathrm{r}=\mathrm{n}_{1} / \mathrm{n}_{2}\) (ratio of sample sizes in each group)
\(p_{0}=\) the common proportion over the two groups
When r = 1 (equal-sized groups), the formula above reduces to:
\(p_{c}=\dfrac{\pi(\lambda+1)}{2}=\dfrac{\pi_{1}+\pi_{2}}{2}\)
From the formula, we can calculate that n=433 total, thus n=217 per group.
The table below can also be used to estimate the necessary sample size. For the column with \(\pi\)=0.10, with \(\lambda\)=2.0, we see that n=448 total, with n=224 per group. Approximately the same as from the formula.
Sample Size statement: A sample size of n=217 per group (total of 434) is needed to detect an increased risk of disease (relative risk=2.0) when the proportions are 10% in the unexposed and 20% in the exposed groups, using a two group chi-squared test with one-sided alpha of 0.05 and 90% power.
Table B.9. Total sample size requirements (for the two groups combined) for testing the ratio
of two proportions (relative risk) with equal numbers in each group.
These tables give requirements for a one-sided test directly. For two-sided tests, use the table corresponding to half the required significance level. Note that \(\pi\) is the proportion for the reference group (the denominator) and \(\lambda\) is the relative risk to be tested. | |||||||||
---|---|---|---|---|---|---|---|---|---|
(a) 5% significance, 90% power \(\pi\) |
|||||||||
\(\lambda\) | 0.001 | 0.005 | 0.010 | 0.050 | 0.100 | 0.150 | 0.200 | 0.500 | 0.900 |
0.10 | 23 244 | 4 636 | 2 310 | 488 | 216 | 138 | 100 | 30 | 8 |
0.20 | 32 090 | 6 398 | 3 188 | 618 | 298 | 190 | 136 | 40 | 10 |
0.30 | 45 406 | 9 052 | 4 508 | 874 | 418 | 268 | 192 | 56 | 14 |
0.40 | 66 554 | 13 268 | 6 606 | 1 278 | 612 | 390 | 278 | 78 | 18 |
0.50 | 102 678 | 20 466 | 10 190 | 1 968 | 940 | 598 | 426 | 118 | 26 |
0.60 | 171 126 | 34 104 | 16 976 | 3 274 | 1 562 | 990 | 706 | 192 | 38 |
0.70 | 323 228 | 64 410 | 32 058 | 6 176 | 2 940 | 1 862 | 1 322 | 352 | 62 |
0.80 | 770 020 | 153 422 | 76 348 | 14 688 | 6 980 | 4 412 | 3 128 | 814 | 126 |
0.90 | 3 251 102 | 647 690 | 322 264 | 61 924 | 29 380 | 18 534 | 13 110 | 3 336 | 450 |
1.10 | 3 593 120 | 715 666 | 355 984 | 68 240 | 32 272 | 20 282 | 14 288 | 3 496 | 292 |
1.20 | 941 030 | 187 410 | 93 208 | 17 846 | 8 426 | 5 286 | 3 716 | 890 | |
1.30 | 437 234 | 87 068 | 43 298 | 8 280 | 3 904 | 2 444 | 1 714 | 402 | |
1.40 | 256 630 | 51 098 | 25 406 | 4 854 | 2 284 | 1 428 | 1 000 | 228 | |
1.50 | 171 082 | 34 062 | 16 934 | 3 232 | 1 518 | 948 | 662 | 148 | |
1.60 | 123 556 | 24 596 | 12 226 | 2 330 | 1 094 | 680 | 474 | 104 | |
1.80 | 74 842 | 14 896 | 7 402 | 1 408 | 658 | 408 | 284 | 58 | |
2.00 | 51 318 | 10 212 | 5 074 | 962 | 448 | 278 | 192 | ||
3.00 | 17 102 | 3 400 | 1 688 | 316 | 146 | 88 | 60 | ||
4.00 | 9 498 | 1 886 | 934 | 174 | 78 | 46 | 30 | ||
5.00 | 6 419 | 1 272 | 630 | 116 | 52 | 30 | |||
10.00 | 2 318 | 458 | 226 | 40 | |||||
20.00 | 992 | 194 | 94 |
(Tables from Woodward, M. Epidemiology Study Design and Analysis. Boca Raton: Chapman and Hall, 2013)
Stop and Think!
- Incidence rate increase \((\pi)\)?
- Relative risk decreases \((\lambda)\)?
- How would you use this table to determine sample size for 'protective' effects (i.e., nutritional components or medical procedures which prevent a negative outcome), as opposed to an increased risk?
- What is the minimal detectable relative risk if you had funds for 1000 subjects?
- n decreases
- Largest n is closest to l
- Protective effects would be those with \(\lambda \lt 1\)
- With a background rate of 10/100 and 1000 subjects, a relative risk of about 1.65 could be detected.
10.4 - Unmatched Case Control
10.4 - Unmatched Case ControlExample
An unmatched case-control study evaluating the association between smoking and CHD is planned.
If 30% of the population is estimated to be smokers, what is the number of study subjects (assuming an equal number of cases and controls in an unmatched study design) necessary to detect a hypothesized odds ratio of 2.0? Assume 90% power and a one-sided alpha of 0.05.
Formula
Due to the design of unmatched case-control studies, where unequal sampling rates are used for the exposed and unexposed, we cannot estimate relative risks for case-control studies, and must instead estimate odds ratios. The hypothesis we wish to test is slightly altered
\(\begin{array}{l}
\mathrm{H}_{0}^{*}\colon \pi_{1}^{*}=\pi_{2}^{*} \\
\mathrm{H}_{1}^{*}\colon \pi_{1}^{*} / \pi_{2}^{*}=\lambda^{*}
\end{array}\)
Where
\(\begin{array}{l}
\pi_{1}^{*}=p(\text { Exposed } \mid \text { Disease })=p(\text { Exposed } \mid \text { Case }) \\
\pi_{2}^{*}=p(\text { Exposed } \mid \text { No disease })=p(\text { Exposed } \mid \text { Control })
\end{array}\)
The formulas are similar to the formula for relative risk, but with additional parameters.
\(\begin{aligned}
n=\frac{(r+1)(1+(\lambda-1) P)^{2}}{r P^{2}(P-1)^{2}(\lambda-1)^{2}}[ & z_{\alpha} \sqrt{(r+1) p_{c}^{*}\left(1-p_{c}^{*}\right)} \\
& \left.+z_{\beta} \sqrt{\frac{\lambda P(1-P)}{[1+(\lambda-1) P]^{2}}+r P(1-P)}\right]^{2}
\end{aligned}\)
and
\(\displaystyle{p_{c}^{*}=\frac{P}{r+1}\left(\frac{r \lambda}{1+(\lambda-1) P}+1\right)}\)
Where
- \(\mathrm{P}=\) exposure prevalence
- \(\lambda=\) estimated relative risk
- r = ratio of cases to controls
From the formula, we can calculate that n=306 total, thus 153 cases and 153 controls.
The table below can also be used to estimate the necessary sample size. For the column with P=0.30, with \(\lambda\)=2.0, we see that n=306 total, with n=153 cases, and n=153 controls.
Sample Size statement: A total sample size of n=306 (153 cases and 153 controls) is needed to detect an OR of 2.0, assuming the prevalence of exposure is 30%, with one-sided alpha of 0.05 and 90% power.
Table B.10. Total sample size requirements (for the two groups combined) for unmatched case–control studies with equal numbers of cases and controls.
These tables give requirements for a one-sided test directly. For two-sided tests, use the table corresponding to half the required significance level. Note that \(P\) is the prevalence of the risk factor in the entire population and \(\lambda\) is the appropriate relative risk to be tested. | |||||||||
---|---|---|---|---|---|---|---|---|---|
(a) 5% significance, 90% power \(P\) |
|||||||||
\(\lambda\) | 0.010 | 0.050 | 0.100 | 0.200 | 0.300 | 0.400 | 0.500 | 0.700 | 0.900 |
0.10 | 2 318 | 456 | 224 | 108 | 70 | 50 | 40 | 30 | 38 |
0.20 | 3 206 | 638 | 316 | 158 | 104 | 80 | 66 | 56 | 88 |
0.30 | 4 546 | 912 | 458 | 232 | 160 | 124 | 106 | 98 | 176 |
0.40 | 6 676 | 1 348 | 684 | 356 | 248 | 200 | 176 | 172 | 330 |
0.50 | 10 318 | 2 098 | 1 074 | 566 | 404 | 332 | 296 | 306 | 616 |
0.60 | 17 220 | 3 522 | 1 816 | 974 | 706 | 588 | 536 | 576 | 1 206 |
0.70 | 32 570 | 6 698 | 3 476 | 1 890 | 1 390 | 1 174 | 1 088 | 1 206 | 2 612 |
0.80 | 77 686 | 16 052 | 8 382 | 4 614 | 3 438 | 2 944 | 2 764 | 3 146 | 7 012 |
0.90 | 328 374 | 68 156 | 35 786 | 19 922 | 15 020 | 13 006 | 12 354 | 14 400 | 32 892 |
1.10 | 363 666 | 76 090 | 40 352 | 22 918 | 17 630 | 15 574 | 15 096 | 18 316 | 43 550 |
1.20 | 95 332 | 20 020 | 10 664 | 6 112 | 4 744 | 4 228 | 4 134 | 5 102 | 12 340 |
1.30 | 44 334 | 9 342 | 4 998 | 2 888 | 2 260 | 2 032 | 2 002 | 2 510 | 6 166 |
1.40 | 26 044 | 5 506 | 2 958 | 1 722 | 1 358 | 1 230 | 1 222 | 1 554 | 3 870 |
1.50 | 17 376 | 3 684 | 1 986 | 1 166 | 926 | 846 | 846 | 1 090 | 2 748 |
1.60 | 12 558 | 2 672 | 1 446 | 854 | 684 | 628 | 632 | 826 | 2 106 |
1.80 | 7 618 | 1 630 | 888 | 532 | 432 | 400 | 408 | 546 | 1 420 |
2.00 | 5 230 | 1 124 | 616 | 374 | 306 | 288 | 296 | 404 | 1 074 |
3.00 | 1 754 | 386 | 218 | 138 | 120 | 118 | 126 | 184 | 522 |
4.00 | 978 | 220 | 126 | 84 | 74 | 76 | 84 | 130 | 380 |
5.00 | 664 | 150 | 88 | 60 | 56 | 58 | 66 | 104 | 316 |
10.00 | 244 | 60 | 38 | 30 | 30 | 34 | 40 | 70 | 224 |
20.00 | 108 | 30 | 20 | 18 | 20 | 24 | 30 | 56 | 190 |
(Tables from Woodward, M. Epidemiology Study Design and Analysis. Boca Raton: Chapman and Hall, 2013)
Stop and Think!
- Prevalence of the risk factor increases (P)?
- Odds ratio decreases (\(\lambda\))?
- For many \(\lambda\), 0.5 has the smallest sample size requirement
- largest sample sizes with OR closest to 1; 1.1 requires greater n than 0.9
10.5 - Matched Case Control
10.5 - Matched Case ControlExample
In contrast to the unmatched case-control study proposed in 10.4, here, assume we want to plan a matched case-control study evaluating the association between smoking and CHD.
A previous study suggested that the chance of a discordant pair is about 50%. What is the number of study subjects necessary to detect a hypothesized odds ratio of 2.0? Assume 90% power and a one-sided alpha of 0.05.
Formula
In matched case/control study designs, useful data come from only the discordant pairs of subjects. Useful information does not come from the concordant pairs of subjects. Matching of cases and controls on a confounding factor (e.g., age, sex) may increase the efficiency of a case-control study, especially when the moderator's minimal number of controls is rejected.
The sample size for matched study designs may be greater or less than the sample size required for similar unmatched designs because only the pairs discordant on exposure are included in the analysis. The proportion of discordant pairs must be estimated to derive sample size and power. The power of matched case/control study design for a given sample size may be larger or smaller than the power of an unmatched design.
The hypothesis to be tested is essentially that the number of discordant pairs that have an exposed case is 50% compared to the alternative that it is different from 50%.
The formulas for sample size calculation for matched case-control study are:
\(\displaystyle{d_{p}=\frac{\left[z_{\alpha}(\lambda+1)+2 z_{\beta} \sqrt{\lambda}\right]^{2}}{(\lambda-1)^{2}}}\) and \(\displaystyle{n=2 d_{p} / \pi_{d}}\)
Where
- \(\mathrm{Dp}=\) number of discordant pairs needed
- n = total number of matched pairs
- \(\lambda =\) estimated relative risk
- \(\boldsymbol{\pi}_{\mathrm{d}}=\) probability of a discordant pair
From the formula, we can calculate that \(d_{p}\) = 73.19, and then n=292.7, so rounding up to the next nearest even number, the study needs 294 individuals - that is, 147 pairs.
In this scenario, conducting a matched case-control study provides a saving of 12 compared with the unmatched version.
Sample Size statement: A total sample size of n=294 (147 matched case-control pairs) is needed to detect an OR of 2.0, assuming the prevalence of exposure is 30%, with one-sided alpha of 0.05 and 90% power.
10.6 - Compare a Single Mean
10.6 - Compare a Single MeanExample
Suppose the male population of an area in a developing country is known to have had a mean serum total cholesterol of 5.5 mmol/l 10 years ago, with an estimated standard deviation of 1.4 mmol/l. In recent years Western food has been imported into the country and is believed to have increased cholesterol levels. The investigators want to see if mean cholesterol levels have increased a clinically meaningful amount (up to about 6 mmol/l, a difference of 0.5 mmol/l) with a one-sided alpha of 0.05 and a power of 90%.
Formula
We are interested in testing the following hypothesis:
\(\mathrm{H}_{0}\colon \mu=\mu_{0}\)
\(\mathrm{H}_{1}\colon \mu=\mu_{1}\)
The formula needed to calculate the sample size is:
\(\displaystyle{n=\frac{\left(z_{\alpha}+z_{\beta}\right)^{2} \sigma^{2}}{\left(\mu_{1}-\mu_{0}\right)^{2}}}\)
Where
- \(\mu_{0}\)= null hypothesized value
- \(\mu_{1}\)= alternative hypothesized value
- \(\sigma\) = standard deviation
From the formula, we can calculate that n=67.1, so rounding to the next whole number would be n=68 .
To use the table below, we can calculate S= (6.0 – 5.5)/1.4 = 0.3571. This exact value does not appear in Table B.7. In these situations, we can get a rough idea of sample size by taking the nearest figure for S. In the example, the nearest tabulated figure is 0.35, which has n = 70 (for one-sided 5% significance and 90% power). This is only slightly above the true value of 67 for S = 0.3571. However, this process can lead to considerable error when S is small, so it is preferable to use the formula.
Sample Size Statement: A total sample size of n=68 is needed to detect a 0.5 mmol/l increase in mean cholesterol compared to an historical value of 5.5 mmol/l using a one-group t- test with one-sided alpha of 0.05 and 90% power, and assuming a standard deviation of 1.4.
Table B.7. Sample size requirements for testing the value of a single mean or the difference between two means.
The table gives requirements for testing a single mean with a one-sided test directly. For two-sided tests, use the column corresponding to half the required significance level. For tests of the difference between two means, the total sample size (for the two groups combined) is obtained by multiplying the requirement given below by 4 if the two samples sizes are equal or by \((r+1)^{2} / r\) if the ratio of the first to the second is \(r : 1\) (assuming equal variances). Note that \(S\) = difference/standard deviation. |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
5% Significance |
2.5% Significance |
1% Significance |
0.5% Significance |
0.1% Significance |
0.05% Significance |
|||||||
\(S\) | 90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
0.01 | 85 639 | 108 222 | 105 075 | 129 948 | 130 170 | 157 705 | 148 794 | 178 142 | 191 125 | 224 211 | 209 040 | 243 580 |
0.02 | 21 410 | 27 056 | 26 269 | 32 487 | 32 543 | 39 427 | 37 199 | 44 536 | 47 782 | 56 053 | 52 260 | 60 895 |
0.03 | 9 516 | 12 025 | 11 675 | 14 439 | 14 464 | 17 523 | 16 533 | 19 794 | 21 237 | 24 913 | 23 227 | 27 065 |
0.04 | 5 353 | 6 764 | 6 568 | 8 122 | 8 136 | 9 587 | 9 300 | 11 134 | 11 946 | 14 014 | 13 065 | 15 224 |
0.05 | 3 426 | 4 329 | 4 203 | 5 198 | 5 207 | 6 309 | 5 952 | 7 126 | 7 645 | 8 969 | 8 362 | 9 744 |
0.06 | 2 379 | 3 007 | 2 919 | 3 610 | 3 616 | 4 381 | 4 134 | 4 949 | 5 310 | 6 229 | 5 807 | 6 767 |
0.07 | 1 748 | 2 209 | 2 145 | 2 652 | 2 657 | 3 219 | 3 037 | 3 636 | 3 901 | 4 576 | 4 267 | 4 972 |
0.08 | 1 339 | 1 691 | 1 642 | 2 031 | 2 034 | 2 465 | 2 325 | 2 784 | 2 987 | 3 504 | 3 267 | 3 806 |
0.09 | 1 058 | 1 334 | 1 298 | 1 605 | 1 608 | 1 947 | 1 837 | 2 200 | 2 360 | 2 769 | 2 581 | 3 008 |
0.10 | 857 | 1 083 | 1 051 | 1 300 | 1 302 | 1 578 | 1 488 | 1 782 | 1 912 | 2 243 | 2 091 | 2 436 |
0.15 | 381 | 481 | 467 | 578 | 579 | 701 | 662 | 792 | 850 | 997 | 930 | 1 083 |
0.20 | 215 | 271 | 263 | 325 | 326 | 395 | 372 | 446 | 478 | 561 | 523 | 609 |
0.25 | 138 | 174 | 169 | 208 | 209 | 253 | 239 | 286 | 306 | 359 | 335 | 390 |
0.30 | 96 | 121 | 117 | 145 | 145 | 176 | 166 | 198 | 213 | 250 | 233 | 271 |
0.35 | 70 | 89 | 86 | 107 | 107 | 129 | 122 | 146 | 157 | 184 | 171 | 199 |
0.40 | 54 | 68 | 66 | 82 | 82 | 99 | 93 | 112 | 120 | 141 | 131 | 153 |
0.45 | 43 | 54 | 52 | 65 | 65 | 78 | 74 | 88 | 95 | 111 | 104 | 121 |
0.50 | 35 | 44 | 43 | 52 | 53 | 64 | 60 | 72 | 77 | 90 | 84 | 98 |
0.55 | 29 | 36 | 35 | 43 | 44 | 53 | 50 | 59 | 64 | 75 | 70 | 81 |
from Woodward, M. Epidemiology Study Design and Analysis. Boca Raton: Chapman and Hall, 2013, p.770
10.7 - Compare Two Means
10.7 - Compare Two MeansExample
Suppose investigators plan an intervention study to help individuals lower their cholesterol, and randomize patients to participate in their new intervention or a control group. They hypothesize at the end of their 6-month intervention the intervention group will have cholesterol levels down to about 5.3, while the control group's cholesterol levels will still be about 6. They assume the standard deviation will still be about 1.4. What sample size is needed to detect this difference with a one-sided alpha of 0.05 and a power of 90%?
Formula
We are interested in testing the following hypothesis:
\(\begin{array}{l}
\mathrm{H}_{0}\colon \mu_{1}=\mu_{2} \\
\mathrm{H}_{1}\colon \mu_{1}-\mu_{2}=\delta,
\end{array}\)
The formula needed to calculate the sample size is:
\(\displaystyle{n=\frac{(r+1)^{2}\left(z_{\alpha}+z_{\beta}\right)^{2} \sigma^{2}}{\delta^{2} r}}\)
Where...
- \(\mu_{1}\)= hypothesized mean in group 1
- \(\mu_{2}\)= hypothesized mean in group 2
- \(\delta\)= difference in means (null hypothesis \(\delta = 0\), alternative hypothesis \(\delta \ne 0\))
- \(\sigma\) = standard deviation
- \(r = \dfrac{n_1}{n_2}\)
From the formula, we can calculate that n=137, but after rounding to the next highest even number, n=138, with 69 per group.
To use the table below, we can calculate S= (6.0-5.3)/1.4 = 0.5. For a one-sided alpha of 0.05, we need to use the column for alpha =5%, and 90% power. Reading down to S=0.5, we see n=35. Back to the table header directions, we see that for a test of the difference between two means, we need to multiply the value by 4. Thus, 35*4 = 140 is the total sample size needed.
Sample Size Statement: A total sample size of n=138 (69 per group) is needed to detect a 0.7 mmol/l difference in mean cholesterol using a two-group t- test with one-sided alpha of 0.05 and 90% power, and assuming a common standard deviation of 1.4.
Table B.7. Sample size requirements for testing the value of a single mean or the difference between two means.
The table gives requirements for testing a single mean with a one-sided test directly. For two-sided tests, use the column corresponding to half the required significance level. For tests of the difference between two means, the total sample size (for the two groups combined) is obtained by multiplying the requirement given below by 4 if the two samples sizes are equal or by \((r+1)^{2} / r\) if the ratio of the first to the second is \(r : 1\) (assuming equal variances). Note that \(S\) = difference/standard deviation. |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
5% Significance |
2.5% Significance |
1% Significance |
0.5% Significance |
0.1% Significance |
0.05% Significance |
|||||||
\(S\) | 90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
90% Power |
95% Power |
0.01 | 85 639 | 108 222 | 105 075 | 129 948 | 130 170 | 157 705 | 148 794 | 178 142 | 191 125 | 224 211 | 209 040 | 243 580 |
0.02 | 21 410 | 27 056 | 26 269 | 32 487 | 32 543 | 39 427 | 37 199 | 44 536 | 47 782 | 56 053 | 52 260 | 60 895 |
0.03 | 9 516 | 12 025 | 11 675 | 14 439 | 14 464 | 17 523 | 16 533 | 19 794 | 21 237 | 24 913 | 23 227 | 27 065 |
0.04 | 5 353 | 6 764 | 6 568 | 8 122 | 8 136 | 9 587 | 9 300 | 11 134 | 11 946 | 14 014 | 13 065 | 15 224 |
0.05 | 3 426 | 4 329 | 4 203 | 5 198 | 5 207 | 6 309 | 5 952 | 7 126 | 7 645 | 8 969 | 8 362 | 9 744 |
0.06 | 2 379 | 3 007 | 2 919 | 3 610 | 3 616 | 4 381 | 4 134 | 4 949 | 5 310 | 6 229 | 5 807 | 6 767 |
0.07 | 1 748 | 2 209 | 2 145 | 2 652 | 2 657 | 3 219 | 3 037 | 3 636 | 3 901 | 4 576 | 4 267 | 4 972 |
0.08 | 1 339 | 1 691 | 1 642 | 2 031 | 2 034 | 2 465 | 2 325 | 2 784 | 2 987 | 3 504 | 3 267 | 3 806 |
0.09 | 1 058 | 1 334 | 1 298 | 1 605 | 1 608 | 1 947 | 1 837 | 2 200 | 2 360 | 2 769 | 2 581 | 3 008 |
0.10 | 857 | 1 083 | 1 051 | 1 300 | 1 302 | 1 578 | 1 488 | 1 782 | 1 912 | 2 243 | 2 091 | 2 436 |
0.15 | 381 | 481 | 467 | 578 | 579 | 701 | 662 | 792 | 850 | 997 | 930 | 1 083 |
0.20 | 215 | 271 | 263 | 325 | 326 | 395 | 372 | 446 | 478 | 561 | 523 | 609 |
0.25 | 138 | 174 | 169 | 208 | 209 | 253 | 239 | 286 | 306 | 359 | 335 | 390 |
0.30 | 96 | 121 | 117 | 145 | 145 | 176 | 166 | 198 | 213 | 250 | 233 | 271 |
0.35 | 70 | 89 | 86 | 107 | 107 | 129 | 122 | 146 | 157 | 184 | 171 | 199 |
0.40 | 54 | 68 | 66 | 82 | 82 | 99 | 93 | 112 | 120 | 141 | 131 | 153 |
0.45 | 43 | 54 | 52 | 65 | 65 | 78 | 74 | 88 | 95 | 111 | 104 | 121 |
0.50 | 35 | 44 | 43 | 52 | 53 | 64 | 60 | 72 | 77 | 90 | 84 | 98 |
0.55 | 29 | 36 | 35 | 43 | 44 | 53 | 50 | 59 | 64 | 75 | 70 | 81 |
(from Woodward, M. Epidemiology Study Design and Analysis. Boca Raton: Chapman and Hall, 2013, p.770)
10.8 - Additional Sample Size Topics
10.8 - Additional Sample Size TopicsRatio of Cases to Controls
Another consideration for sample size is if the same number of cases and controls should be used.
Power increases but at a decreasing rate as the ratio of controls/cases increases. Little additional power is gained at ratios higher than four controls/cases. There is little benefit to enrolling a greater ratio of controls to cases.
Under what circumstances would it be recommended to enroll a large number of controls compared to cases?
Perhaps the small gain in power is worthwhile if the cost of a Type II error is large and the expense of obtaining controls is minimal, such as selecting controls with covariate information from a computerized database. If you must physically locate and recruit the controls, set up clinic appointments, run diagnostic tests, and enter data, the effort of pursuing a large number of controls quickly offsets any gain. You would use a one-to-one or two-to-one range. The bottom line is there is little additional power beyond a four-to-one ratio.
Cohort v Case-control sample sizes
Sample sizes for cohort studies depend upon the rate of the outcome, not the prevalence of exposure. Sample size for case-control studies is dependent upon prevalence of exposure, not the rate of outcome. Because the rate of outcome is usually smaller than the prevalence of the exposure, cohort studies typically require larger sample sizes to have the same power as a case-control study.
The example below is from a study of smoking and coronary heart disease where the background incidence rate was 0.09 events per person-year among the non-exposed group and the prevalence of the risk factor was 0.3.
The sample size requirements to detect a given relative risk with the 90% power using two-sided 5% significance tests for cohort and case-control studies are listed below:
Relative Risk | Cohort study | Case-Control study |
---|---|---|
1.1 | 44,398 | 21,632 |
1.2 | 11,568 | 5,820 |
1.3 | 5,346 | 2,774 |
1.4 | 3,122 | 1,668 |
1.5 | 2,070 | 1,138 |
2 | 602 | 376 |
3 | 188 | 146 |
In such a situation, with a relative risk of 1.1, more than twice the number of subjects are required for a cohort study as for a case-control study. In every study in the table, the case-control design requires a smaller sample than does the cohort study to detect the same level of increased risk. This is generally true. There is also a dependence upon the rate of the outcome, but in general, case-control studies involve less sampling.
Furthermore, in designing a cohort study, loss-to-follow-up is important to consider. Based on your own experience or that of the literature, any sample size calculation should be inflated to account for the expected drop-outs. For example, if the drop-out rate is expected to be 5%, multiply n by 1/(1-0.05) and recruit the increased number of subjects.
10.9 - Summary
10.9 - SummaryCalculating the necessary sample size is important for the planning of a study, and sample size justification is often required when requesting funding. This is because we want to make sure we have enough participants to detect an effect if there is one and not too many that wastes resources (participant time/effort, cost, time), all while minimizing potential errors.
The two types of error occur when you 1) reject the null hypothesis when it is in fact true (type I) and 2) miss rejecting the null hypothesis when it is in fact false (type II error). In practice, you’ll never know if you made either of these errors, but using an appropriate sample size (based on prior knowledge) is a good way to do your best to minimize these possible errors. Most study’s primary objectives fall into the categories discussed in this section, and once you have the necessary estimates needed for the sample size calculations, these formulas can help you decide on the sample size for the study. Often when we are comparing two groups, equal sample sizes in each group are the best choice, but there are situations when unequal sample sizes are appropriate.
Sample size calculations are only as good as the preliminary data put into the formulas, so it is important to use the best information available, and if needed, to run pilot studies first to get good preliminary data.