9.5 - Frequentist Methods: O'Brien-Fleming, Pocock, Haybittle-Peto

From a frequentist point of view, repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial. Therefore, the frequentist approach to interim monitoring of clinical trials focuses on controlling the type I error rate.

In most clinical trials, it is not necessary to perform a statistical analysis after each patient is accrued. In fact, for most multi-center clinical trials, interim statistical analyses are conducted only once or twice per year. Usually this frequency of interim analyses detects treatment effects nearly as early as continuous monitoring. The group sequential analysis is defined as the situation in which only a few scheduled analyses are conducted. Again, let's focus more on the concepts than the statistical details.

Suppose that the group sequential approach consists of R analyses, and we let \(Z_1, \dots , Z_R\) denote the test statistic at the R times of hypothesis testing. So, we are accumulating data over time. We are adding to the dataset and analyzing the current set that has been collected. Also, we let \(B_1, \dots , B_R\) denote the corresponding boundary points (critical values). At the \(r^{th}\) interim analysis, the clinical trial is terminated with rejection of the null hypothesis if:

\( |Z_r| \ge B_r, r = 1, 2, ... , R\)

The boundary points are chosen such that the overall significance level does not exceed the desired \(\alpha\). There are primarily three schemes for selecting the boundary points which have been proposed. These are illustrated in the following table for an overall significance level of \(\alpha = 0.05\) and for R = 2,3,4,5. The table is constructed under the assumption that n patients are accrued at each of the R statistical analyses so that the total sample size is \(N = nR\).

R Interim Analysis Number O'Brien-Fleming Haybittle-Peto* Pocock
B \(\alpha\) B \(\alpha\) B \(\alpha\)
2 1 2.782 0.0054 3.0 0.002 2.178 0.0294
2 1.967 0.0492 1.960 0.0500 2.178 0.0294
3 1 3.438 0.0006 3.291 0.0010 2.289 0.0221
2 2.431 0.0151 3.291 0.0010 2.289 0.0221
3 1.985 0.0471 1.960 0.0500 2.289 0.0221
4 1 4.084 0.00005 3.291 0.00100 2.361 0.0182
2 2.888 0.0039 3.291 0.00100 2.361 0.0182
3 2.358 0.0184 3.291 0.00100 2.361 0.0182
4 2.042 0.0412 1.960 0.0500 2.361 0.0182
5 1 4.555 0.000005 3.291 0.00100 2.413 0.0158
2 3.221 0.0013 3.291 0.00100 2.413 0.0158
3 2.630 0.0085 3.291 0.00100 2.413 0.0158
4 2.277 0.0228 3.291 0.00100 2.413 0.0158
5 2.037 0.0417 1.960 0.0500 2.413 0.0158

For example, if we plan one interim analysis and a final analysis, we will select the row in this table with R=2. Using these first two rows of the table, we find the critical values for the interim analysis and for the final analysis. If using O'Brien-Fleming approach, the interim analysis is conducted with bound 2.782 and final analysis with bound 1.967. On the other hand, had the choice been a Haybittle-Peto approach, the first test would be conducted with bound 3.0 and the final analysis at 1.96.

In another situation with three interim analyses and a final analysis, R=4. View the corresponding four rows in the middle of the table to determine critical values for each interim and the final analysis. Notice different approaches 'spend' or distribute the overall significance differently across the interim and final analyses.

The Pocock approach uses the same significance level at each of the R interim analyses. Of the three procedures described in the table, it provides the best chance of early trial termination. Many investigators dislike the Pocock approach, however, because of its properties at the final stage of analysis. For example, suppose R = 3 analyses are planned and that statistical significance is not attained at any of the analyses. Suppose that the p-value at the final analysis is 0.0350 (this is > 0.0221 found in the table for the Pocock approach). If interim analyses had not been scheduled, however, this p-value would be considered to provide a statistically significant result \(\left(cp = 0.0350 < 0.0500 \right)\).

The Haybittle-Peto (based on intuitive reasoning) and O'Brien-Fleming (based on statistical reasoning) approaches were designed to avoid this problem. On the other hand, these two approaches render it very difficult to attain statistical significance at an early stage.

Example Section

An example of the Pocock approach is provided in Pocock's book (Pocock. 1983. Clinical Trials: A Practical Approach, New York, John Wiley & Sons). A trial was conducted in patients with non-Hodgkin's lymphoma, in which two drug combinations were compared, namely cytoxan-prednisone (CP) and cytoxan-vincristine-prednisone (CVP). The primary endpoint was presence/absence of tumor shrinkage, a surrogate variable.

Patient accrual lasted over two years and 126 patients participated. Statistical analyses were scheduled after approximately every 25 patients. Chi-square tests (without the continuity correction) were performed at each of the five scheduled analyses. The Pocock approach to group sequential testing requires a significance level of 0.0158 at each analysis. Here is a table with the results of these analyses.

  Tumor shrinkage treatment p-value
Analysis #1 3/14 5/11 p > 0.10
Analysis #2 11/27 13/24 p > 0.10
Analysis #3 18/40 17/36 p > 0.10
Analysis #4 18/54 24/48 0.05 < p < 0.10
Analysis #5 23/67 31/59 0.0158 < p < 0.10

Thus, the researchers were concerned that the CVP combination appeared to be clinically better than the CP combination (53% success versus 34% success), yet it did not lead to a statistically significant result with Pocock’s approach. Further analyses with secondary endpoints convinced the researchers that the CVP combination is superior to the CP combination.

How would you decide which of these group sequential methods to use? Since a major concern is the significance level at the final analysis and O'Brien-Fleming preserves close to the desired alpha for final analysis as well as allowing a strong result to terminate a trial, this has been a popular approach. The REMATCH clinical trial is a good example. Regardless of your choice, it is important to make it clear to study investigators the operating characteristics of any approach selected for interim analyses.