6.5 - Case-Control Study Design

Case-control study designs are used to estimate the relative risk for a disease from a specific risk factor. The estimate is the odds ratio, which is a good estimate of the relative risk especially when the disease is rare. Case-control studies are useful when epidemiologists to investigate an outbreak of a disease because the study design is powerful enough to identify the cause of the outbreak especially when the sample size is small. Attributable risks may also be calculated.

The approach for a case-control study is straightforward. Case-control studies begin by enrolling persons based upon their current disease status. Previous exposure status is subsequently determined for each case and control. However, because these studies collect data after disease has already occurred, they are considered retrospective, which is a limitation. While a case-control study design offers less support for a causation hypothesis than the longer and more expensive cohort design, it does provide stronger evidence than a cross-sectional study.

Below is a 2 × 2 table for case-control data:


Case (Number)

Controls (Number) Total Exposure (Number)
Exposed A B \(\text{Total}_{\text{Exposed}}\)
Not Exposed C D \(\text{Total}_{\text{Not Exposed}}\)
  \(\text{Total}_{\text{Cases}}\) \(\text{Total}_{\text{Controls}}\) Total

With case-control studies, we essentially work down the columns of the 2 × 2 table. Cases are identified first, then controls. The investigator then determines whether cases and controls were exposed or not exposed to the risk factor. We calculate the odds of exposure among cases (A/C) and the odds of exposure among controls (B/D). The odds ratio is then (A/C)/(B/D), which simplifies, after cross-multiplication, to (A*D)/(B*C).

Think about it!

Come up with an answer to this question and then click on the button below to reveal the answer.

Why can't we determine the incidence rate from a case-control study?

We have selected cases and controls from a population, often an unknown population. For example, we might enoll patients in a hospital, but we don't really know the size of the general popluation that would have come to the hospital. Also, we have not followed persons at risk to monitor the development of disease. Furthermore, the investigator selects the number of cases relative to the number of controls.

A most critical and often controversial component of a case-control study is the selection of the controls. Controls must be comparable to cases in every way except that they do not have the disease. Preferably controls are drawn from the same population as the cases. Some studies, though, draw the controls from a different data source. For example, cases may be detected from a disease registry but the controls are selected randomly from another data source. Controls should be selected without regard to their exposure status (e.g., exposed/non-exposed), but may be sampled proportional to their time at risk (which is called density sampling).

There are two basic types of case-control studies, distinguished by the method used to select controls. The first is a non-matched case-control study in which we enroll controls without regard to the number or characteristics of the cases. In this study design, the number of controls does not necessarily equal the number of cases. For example, we may enroll 105 cases and 178 controls. Analytic methods for non-matched case-control studies include:

  • Chi-square 2 × 2 analysis;
  • Mantel-Hanszel statistic (This test takes into account the possibility that there are different effects for the different strata (e.g., effect modification))
  • Fisher’s Exact test (This test is used if an expected cell size is <5)
  • Unconditional logistic regression (The method is used to simultaneously adjust for mutliple confounders; a multivariable analysis).

The other basic type is a matched case-control study. In a matched study, we enroll controls based upon some characteristic(s) of the case. For example, we might match the sex of the control to the sex of the case. The idea in matching is to match upon a potential confounding variable in order to remove the confounding effect. (We will look at how matching occurs in the example below.)

There are two basic types of matched designs: one-to-n matching (i.e., one case to one control, or one case to a specific number of controls) and frequency-matching, where matching is based upon the distributions of the characteristics among the cases. For example, 40% of the cases are women so we choose the controls such that 40% of the controls are women.

In an analysis of a matched study design, only discordant pairs are used. A discordant pair occurs when the exposure status of case is different than the exposure status of the control. Analytic methods for matched case control studies include conditional logistic regression, conditioned upon the matching.

To review, for a simple non-matched case control study, you find a case, determine whether the person is exposed or not. Find a control; determine their exposure status. The data can be summarized in a 2 × 2 table as below:



Exposed A B
Not Exposed C D

In contrast, the matched case-control study has linked a case to a control based on matching of one or more variables. The summary table will differ for a matched case-control study.

Let's look at an example. Suppose we plan to match cases to controls by gender and age (+/- 5 years). We first identify the following case:

Case: Male, 45 years of age (Patient 1); Exposure status: Exposed

If this was a non-matched study, the case would be counted in cell A in the preceding table because he is exposed. However, in the age- and gender-matched case-control study we must also find a male control within five years of age. Searching in the appropriate control population, we locate the following control:

Control: Male 48 years of age (Person 47); Exposure status: Exposed

If Person 47 were counted in an unmatched study, he would belong in cell B of the preceding table. In a matched case-control study however, we are interested in results for the matched pair. The data from Patient 1 and Person 47 are linked for the duration of the study. The appropriate table for the matched study is depicted below. Where do Patient 1 and Person 47 belong?

    Exposed (Number) Not Exposed (Number) Total (Number)
Controls Exposed (Number) A (Concordant Pair) B (Discordant Pair) \(\text{Total}_{\text{ExposedConrols}}\)
Not Exposed (Number) C (Discordant Pair) D (Concordant Pair) \(\text{Total}_{\text{Not ExposedControls}}\)
    \(\text{Total}_{\text{ExposedCases}}\) \(\text{Total}_{\text{Not ExposedCases}}\) Total

Patient 1 is a case and he is exposed so he fits into either cell A or cell C. Based upon his control's status we determine which cell is the correct placement for this pair. Patient 1's control is exposed, therefore Patient 1 and Person 47 fit into cell A as a pair. This is a concordant pair because both are exposed. Concordancy is based upon exposure status. In a matched case-control study, the cell counts represent pairs, not individuals. In the statistical analysis, only the discordant pairs are important. Cells B and C contribute to the odds ratio in a matched design. Cells A and D do not contribute to to the odds-ratio. If the risk for disease is increased due to exposure, C will be greater than B.

Think about it!

Come up with an answer to this question and then click on the button below to reveal the answer.

Can you think of more than one reason why a matched case-control study could take longer to complete than an unmatched study?

First you must identify matched controls, sometimes more than one per case. Second, since only the discordant pairs contribute to the statistical analysis, achieving a desired statistical power depends on obtaining a particular number of discordant pairs.

Think about it!

Come up with an answer to this question and then click on the button below to reveal the answer.

Why bother with matching if it means a longer case-control study?

We match to eliminate the possibility of the relationship being confounded by the matching variable because both the case and the control are similar for that variable. In the above example, we control for confounding from age or sex because we matched on age and sex. We don't want to match on too many variables because it will cause an extreme delay in the completion of the study.

When performing statistical analysis, the matched variables are not included in the statistical model.

(In a cohort study, confounding is dealt with by including the terms in the model to adjust for their effects. In a matched case-control study, the adjustment for this confounding has been made through the matching.)

We will learn more about designing a cohort study later in this course. Below is table comparing advantages and disadvantages of the cohort design to a case-control design.

Quick Comparison of Cohort and Case-control Studies

Cohort Study

  • Can calculate incidence rate, risk, and relative risk
  • Potentially greater strength for causal investigations
  • Expensive
  • Long-term study
  • Large sample size required
  • Efficient design for rare exposure
  • Good for multiple outcomes
  • Less potential for recall bias
  • More potential for loss-to-follow up
  • Possibly generalizable
  • Allows examination of natural course of disease, survival

Case-Control Study

  • Only estimates relative risk
  • Potentially weaker causal investigation
  • Inexpensive
  • Short-term study
  • Can be powerful with small sample of cases
  • Efficient design for rare disease
  • Good for multiple exposures
  • More potential for recall bias
  • Less potential for loss-to-follow up
  • Probably not generalizable
  • Does not allow examination of natural course of disease, survival

Example 6-3 Section

Serum Carotenoids and Risk of Cervical Intraepithelial Neoplasia in Southwestern American Indian Women

Schiff, M. et. al, (2001) Serum Carotenoids and Risk of Cervical Intraepithelial Neoplasia in
Southwestern American Indian Women, Cancer Epidemiology, Biomarkers & Prevention, Vol. 10, 1219–1222.

Notice the high study participation rate.

How were cases and controls determined?

Is this a matched study?

Take a look at Table 1: how many cases and how many controls?

Are the demographic characteristics similar for cases and controls?

Are any different?

What statistical method is used to analyze the data?

How do the results support the conclusions (Table 2 and conclusions)?

If you have questions about this study design or the results, ask in the Week 6 General Discussion.