In epidemiology, three different types of studies are commonly done depending on whether the disease condition is first fixed and then the possible causes (exposure to a risk factor) are assessed or whether exposed and unexposed individuals are followed until the disease is developed. We will introduce two of those here.

Suppose variable* *\(Z\) represents a condition (disease) that is relatively rare in a population (e.g. lymphoma), and we want to assess whether another characteristic or behavior \(Y\) (e.g. smoking) could be a risk factor for \(Z\).

The obvious way to study this is to follow a group of smokers \((Y = 1)\) and a group of nonsmokers \((Y = 2)\) over time, and see which ones eventually develop lymphoma \((Z = 2)\) and which do not \((Z = 1)\). This is called a **prospective study**. The exposed and unexposed groups are determined at the start of the study and both groups are disease-free. While it makes logical sense in determining a significant relationship, it can be very

- time-consuming (we have to wait for a long time for the problem condition to develop)
- inefficient (we may need very large samples to obtain enough subjects with \(Z = 1)\)

An alternative is the **retrospective study**, in which we first locate a group of subjects with lymphoma \((Z = 1)\) and identify which are smokers and which are not. Here a diseased group is determined first and is retrospectively assessed for exposure status. Then we locate another group of subjects who are in some sense "comparable" but who do not have lymphoma \((Z = 2)\) and identify which are smokers and which are not. In the retrospective study, we have "sampled on the outcome," choosing individuals on the basis of \(Z\) and then observing \(Y\).

The **interchangeability **of \(Y\) and \(Z\) means that the usual roles of "response" and "explanatory" variables can be reversed, which could be extremely useful for research.

*Because the odds ratio is invariant to exchanging \(Y\) and \(Z\), the odds ratio from a retrospective study should be about the same as the odds ratio from a prospective study*

*in which we sampled individuals according to their \(Y\) values and collected information on \(Z\). A retrospective study provides no information about the overall incidence of \(Z\) in the population because the proportions of cases with \(Z = 1\) and \(Z = 2\) were decided by the investigator. However, it does provide consistent estimates of the odds ratio indicating the effect of \(Y\) on \(Z\).*

##
Example: Lung Cancer
Section* *

The table below is adapted from Doll and Hill (1950), where 709 lung cancer sufferers were matched with 709 individuals without lung cancer to serve as a control. This is an example of a retrospective, **case-control study**. A study is called case-control when "cases" or diseased subjects and "controls" or comparable non-diseased subjects are sampled from respective populations and then assessed on their risk-factor exposure status.

Cancer Yes | Cancer No | Totals | |
---|---|---|---|

Smoking Yes |
688 | 650 | 1338 |

Smoking No |
21 | 59 | 80 |

Totals |
709 | 709 | 1418 |

Lung cancer is the natural response variable of interest, and we would like to condition on smoking to estimate conditional probabilities of cancer, given smoking status. But since the column (lung cancer frequencies) are fixed by design, each column is a separate binomial distribution---not each row---and the sample conditional probabilities based on row totals do not reflect the corresponding population proportions.

In other words, with retrospective studies, the sample sizes are fixed in a way that's counter-intuitive to how we'd like to view the variables as explanatory and response. This also affects the interpretation of the relative risk because it's based on the same conditional probabilities. Fortunately, the odds ratio is numerically invariant, regardless of which totals (row or column) are fixed, which makes it an appropriate measure of association for both retrospective studies as well as prospective studies.

To see this invariance for the data above, we can calculate the sample odds ratio as

\(\displaystyle \hat{\theta}=\dfrac{688/650}{21/59}= \dfrac{688/21}{650/59}=2.97 \)

Thus, for this sample, the odds of lung cancer among smokers is 2.97 times the odds of lung cancer among non-smokers. Equivalently, the odds of smoking among those with lung cancer is 2.97 times the odds of smoking among those without lung cancer.

Source: R. Doll and A. B. Hill, *Br. Med. J*., 739--748, Sept. 30, 1950.