Lesson 9: Treatment Effects Monitoring; Safety Monitoring

Overview

During a clinical trial over a lengthy period of time, it can be desirable to monitor treatment effects as well as tracking safety issues. "Interim analysis" or "early stopping" procedures are used to interpret the accumulating information during a clinical trial. There may be a variety of practical reasons for terminating a clinical trial at an early stage. Some of these are overlapping:

Treatments are found to be convincingly different,
Treatments are found to be convincingly not different,
Side effects or toxicity are too severe to continue treatment, relative to the potential benefits,
The data are of poor quality,
Accrual is too slow to complete the study in a timely fashion,
Definitive information is available from outside the study, making the trial unnecessary or unethical, this is also related to the next item...
The scientific questions are no longer important because of other developments,
Adherence to the treatment is unacceptably poor, preventing an answer to the basic question,
Resources to perform the study are lost or no longer available, and/or
The study integrity has been undermined by fraud or misconduct.

(Piantodosi, 2005)

This lesson will look examine different methods or guidelines that can be used to help decide whether or not to terminate a clinical trial in progress.

Objectives

Upon completion of this lesson, you should be able to:

Differentiate between valid and invalid reasons for interim analyses and early termination of a trial.
Identify characteristics of a sound plan for interim analysis.
Understand the theoretical framework for a likelihood based interim analysis.
Compare and contrast the Bayesian approach to analysis with the frequentist approach.
Recognize the general effects of the choice the prior on the posterior probability distribution from a Bayesian analysis.
Compare α spending functions for 3 group sequential methods for interim analysis.
Comment on the use of a group sequential method in a published statistical analysis.
Recognize a futility assessment and define conditional power.
List topics that should be covered in an interim report to an IRB.
List the advantages and disadvantages of a DSMB and describe who might compose the DSMB.
List the issues of concern to a DSMB in a typical clinical study.

References:

DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending function approach, Statistics in Medicine 13: 1341-1352.

Ellenberg, SS. Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees in Clinical Trials, New York, NY: Wiley.

Piantadosi, Steven. (2005) Treatment Effects Monitoring. In: Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ: John Wiley and Sons, Inc.

Pocock, S.J. 1983 Clinical Trials: A Practical Approach. Chichester: John Wiley and Sons.

9.1 - Overview

Data-dependent stopping: general term to describe any statistical or administrative reason for stopping a trial

Consideration of the reasons given earlier may lead you to stop the trial at an early stage, or at least change the protocol.

The review, interpretation, and decision-making aspects of clinical trials based on interim data are necessary but prone to error. If the investigators learn of interim results, then it could affect objectivity during the remainder of the trial, or if statistical tests are performed repeatedly on the accumulating data, then the Type I error rate is increased.

There is a natural conflict. On one hand, terminating the trial as early as possible will save costs and labor, expose as few patients as possible to inferior treatments, and allow disseminating information about the treatments quickly. On the other hand, there are pressures to continue the trial for as long as possible in order to increase precision, reduce errors of inference, obtain sufficient statistical power to account for prognostic factors and examine subgroups of interest, and gather information on secondary endpoints.

All of the available statistical methods for interim analyses have some similar characteristics. They

require the investigators to state the objectives clearly and in advance of the study,
assume that the basic study design is sound,
require some structure to the problem beyond the data being observed
impose some penalty for early termination.

No statistical method is a substitute for judgment. The statistical criteria provide guidelines for terminating the trial because the decision to stop a trial is not based just on statistical information collected on one endpoint.

9.2 - Likelihood Methods

It may be possible to assess treatment effects after each patient is accrued, treated, and evaluated. Such an approach is impractical in most circumstances, especially for trials that require lengthy follow-up to determine outcomes.

The first classical likelihood method proposed for this situation is called the sequential probability ratio test (SPRT) and it is based on the likelihood function. (This method is very rarely implemented because it is impractical in the clinical setting, but is important for historical reasons.) Let's review this method in general terms here.

A likelihood function is constructed from a probability model for a sequence of random variables which correspond to the outcome measurements on the experimental units. In the likelihood function, however, the observed data points replace the random variables. Suppose we have a binary response (success/failure) from each patient which is determined immediately after a treatment is administered. (Again, not very practical.) However, for the situation discussed, we are examining one treatment which is administered to every patient. If there are N patients with K successes, and p represents the probability of success within each patient, then the likelihood function is based on the binomial probability function:

\(L(p, K)=p^K(1-p)^{N-K}\)

This is a very simple likelihood function for a very simple example.

If the investigator is trying to decide whether \(p_0\) or \(p_1\) is the more appropriate value of p, then the likelihood ratio can be constructed to assess the evidence:

\(R=\dfrac{L(p_0, K)}{L(p_1, K}=\left(\dfrac{p_0}{p_1} \right)^K \left(\dfrac{1-p_0}{1-p_1} \right)^{N-K} \)

This is a ratio of two different likelihood functions. If R is large, then the evidence is going to favor \(p_0\). If R is small, then the evidence is going to favor \(p_1\). Therefore, when analyzing interim data, we can calculate the likelihood ratio and stop the trial only if we have the amount of evidence that is expected for the target sample size.

Suppose that N is the target sample size and that after n patients there are k successes. After each treatment we will stop and analyze the data to determine whether to continue the trial or not. Under this scenario, we stop the trial if:

\(R=\dfrac{L(p_0, K)}{L(p_1, K}=\left(\dfrac{p_0}{p_1} \right)^k \left(\dfrac{1-p_0}{1-p_1} \right)^{n-k} \le R_L \text{ or }\ge R_U \)

where \(R_L\) and \(R_U\) are prespecified constants. Let's not worry about the details of the statistical calculation here. The values of \(R_L\) and \(R_U\) that correspond to testing \(H_0\colon p = p_0\) versus \(H_1 \colon p = p_1\) are \(R_L = \dfrac{\alpha}{(1 - \beta)}\) and \(R_U = \dfrac{(1 - beta)}{\alpha}\).

A sample schematic of the SPRT in practice is shown below. Here you would calculate R after the treatment of each patient. As you accumulate patients you can see that R is moving around as the trial proceeds. Before we had accrued all of the patients that we wanted we hit the upper boundary and would not recruit the remaining patients.

Here is another example...

The SPRT might be useful in a phase II SE trial in which a treatment is to be monitored closely to determine if it reaches a certain level of success or failure. For example, suppose the investigator considers the treatment successful if \(p = 0.4\) (40% or greater), but considers it a failure if \(p = 0.2\) (20 % or less). Thus, the hypothesis testing problem is \(H_0 \colon p = 0.2\) vs. \(H_1 \colon p = 0.4\). Suppose we take \(\alpha = 0.05\) and \(\beta = 0.05\). Then the bounds would be calculated as \(R_L = \dfrac{1}{19}\) and \(R_U = 19\). We would reject \(H_0\) in favor of \(H_1\), and claim success, as soon as R gets small enough, \(R = (0.5)^k (1.33)^{n-k} \leq \dfrac{1}{19}\). On the other hand, we would stop the trial and accept \(H_0\) and reject \(H_1\), and claim failure, as soon as \(R \geq 19\).

The statistical formulation for the SPRT is relatively straightforward, but it is more commonly used in a quality control setting than in clinical trials. The obvious criticism is that each patient’s outcome must be observed quickly before you recruit the next patient. The SPRT also has the statistical property that it has a positive probability of never reaching the boundaries \(R_L\) and \(R_U\). If this is the case after the target sample size, N, is reached, then the trial is inconclusive.

9.3 - Bayesian Methods

First, let's review the Bayesian approach in general and then apply it to our current topic of likelihood methods.

The Bayesian approach to statistical design and inference is very different from the classical approach (the frequentist approach).

Before a trial begins, a Bayesian statistician summarizes the current knowledge or belief about the treatment effect, say we call it \(\theta\), in the form of a probability distribution. This is known as the prior distribution for \(\theta\). These assumptions are made prior to conducting the study and collecting any data.

Next, the data from the trial are observed, say we call it X, and the likelihood function of X given \(\theta\) is constructed. Finally, the posterior distribution for \(\theta\) given X is constructed. In essence, the prior distribution for \(\theta\) is revised into the posterior distribution based on the data X. The data collection in the study informs or revises the earlier assumptions.

The following schematic describes this Bayesian approach:

The development of the posterior distribution may be very difficult mathematically and it may be necessary to approximate it through computer algorithms.

The Bayesian statistician performs all inference for the treatment effect by formulating probability statements based on the posterior distribution. This is a very different approach and is not always accepted by the more traditional frequentist oriented statisticians.

In the Bayesian approach, \(\theta\) is regarded as a random variable, about which probability statements can be made. This is the appealing aspect of the Bayesian approach. In contrast, the frequentist approach regards \(\theta\) as a fixed but unknown quantity (called a parameter) that can be estimated from the data.

As an example of the contrasting philosophies, consider the frequentist description and the Bayesian description of a 95% confidence interval for \(\theta\).

Frequentist: "If a very large number of samples, each with the same sample size as the original sample, were taken from the same population as the original sample, and a 95% confidence interval constructed for each sample, then 95% of those confidence intervals would contain the true value of \(\theta\)." This is an extremely awkward and dissatisfying definition but technically represents the frequentist's approach.

Bayesian: "The 95% confidence interval defines a region that covers 95% of the possible values of \(\theta\)." This is much more simple and straightforward. (As a matter of fact, most people when they first take a statistics course believe that this is the definition of a confidence interval.)

In a Bayesian analysis, if \(\theta\) is a parameter of interest, the analysis results in a probability distribution for \(\theta\). Using the probability distribution, many statements can be made. For example, if \(\theta\) represents a probability of success for a treatment, a statement can be made about the probability that \(\theta > 0.90\) (or any other value).

9.4 - Bayesian approach in Clinical Trials

With respect to clinical trials, a Bayesian approach can cause some difficulties for investigators because they are not accustomed to representing their prior beliefs about a treatment effect in the form of a probability distribution. In addition, there may be very little prior knowledge about a new experimental therapy, so investigators may be reluctant to or not be able to quantify their prior beliefs. In the business world, the Bayesian approach is used quite often because of the availability of prior information. In the medical field, more often than not, this is not the case.

The choice of a prior distribution can be very controversial. Different investigators may select different priors for the same situation, which could lead to different conclusions about the trial. This is especially true when the data, X, are based on a small sample size because in such situations the prior distributions are modified only slightly to form the posterior distributions. Small sample sizes only modify the prior slightly. This tends to weight the posterior distribution very closely to the prior, therefore you are basing your results almost entirely on your prior assumptions.

When there is little prior information to base your assumptions of the distribution on, Bayesians employ a reference (or vague or non-informative) prior. These are intended to represent a minimal amount of prior information. Although vague priors may yield results similar to those of a frequentist approach, the priors may be unrealistic because they attempt to assign equal weight to all values of θ. Below you can see a very flat distribution, very spread out over a wide range of values.

Similarly, skeptical prior distributions are those that quantify the belief that large treatment effects are unlikely. Enthusiastic prior distributions are those that quantify large treatment effects. Let's not worry about the calculations, but focus instead on the concepts here...

An example of a Bayesian approach for interim monitoring is as follows. Suppose an investigator plans a trial to detect a hazard ratio of 2 \(\left(\Lambda = 2\right)\) with 90% statistical power \(\left(\beta = 0.10\right)\) using at least a sample size of 90 events. The investigator plans one interim analysis, approximately halfway through trial, and a final analysis. (This is the more standard approach, as opposed to the SPRT where R was calculated after each treatment.)

The estimated logarithm of the hazard ratio is approximately normally distributed with variance \(\left(\dfrac{1}{d_1}\right) + \left(\dfrac{1}{d_2}\right)\), where \(d_1\) and \(d_2\) are the numbers of events in the two treatment groups. The null hypothesis is that the treatment groups are the same, i.e., \(H_0\colon \Lambda = 1\). Note that the \(log_e\) hazard ratio is 0 under the null hypothesis and the \(log_e\) hazard ratio is 0.693 when \(\Lambda = 2\), the proposed effect size.

Suppose the investigator has access to some pilot data or the published report of another investigator, in which there appeared to be a very small treatment effect with 16 events occurring within each of the two treatment groups. The investigator decides that this preliminary study will form the basis of a skeptical prior distribution for the \(log_e\) hazard ratio with a mean of 0 and a standard deviation of \(0.35 = {\dfrac{1}{16} + \dfrac{1}{16}}^{\frac{1}{2}}\). This is called a skeptical prior because it expresses skepticism that the treatment is beneficial.

Next, suppose that at the time of the interim analysis, (45 events have occurred), there are 31 events in one group and 14 events in the other group, such that the estimated hazard ratio is 2.25 (calculations not shown). These values are incorporated into the likelihood function, which modifies the prior distribution to yield the posterior distribution for the estimated \(log_e\) hazard ratio that has a mean = 0.474 and standard deviation = 0.228 (calculations not shown). Therefore we can calculate the probability that \(\Lambda\) is \(> 2\). From the posterior distribution we construct the following probability statement:

\(Pr[\Lambda \ge 2]=1-\Phi \left(\dfrac{log_e(2)-0.474}{0.228} \right)=1-\Phi(0.961)=0.168\)

where \(\Phi\) represents the cumulative distribution function for the standard normal and is the true hazard ratio.

Conclusion: Based on the results from the interim analysis with a skeptical prior, there is not strong evidence that the treatment is effective because the posterior probability of the hazard ratio exceeding 2 is relatively small. Therefore, there is not enough evidence here to suggest that the study be stopped. What is too large? A reasonable value should be specified in your protocol before these values are determined.

In contrast, suppose that before the onset of the trial the investigator is very excited about the potential benefit of the treatment. Therefore, the investigator wants to use an enthusiastic prior for the \(log_e\) hazard ratio, i.e., a normal distribution with mean \(= log_e(2) = 0.693\) and standard deviation = 0.35 (same as the skeptical prior).

Suppose the interim data results are the same as those described above. This time, the posterior distribution for the \(log_e\) hazard ratio is normal with mean = 0.762 and standard deviation = 0.228. Then the probability for the posterior distribution is:

\(Pr[\Lambda \ge 2]=1-\Phi \left(\dfrac{log_e(2)-0.762}{.228} \right)=1-\Phi(-0.302)=0.682\)

This is a drastic change in the probability based on the assumptions that were made ahead of time. In this case, the investigator still may not consider this to be strong evidence that the trial should terminate because the posterior probability of the hazard ratio exceeding 2 does not exceed 0.90.

Nevertheless, the example demonstrates the controversy that can arise with a Bayesian analysis when the amount of experimental data is small, i.e., the selection of the prior distribution drives the decision-making process. For this reason, many investigators prefer to use non-informative priors. Using the Bayesian methods, you can make probability statements about your expected results.

9.5 - Frequentist Methods: O'Brien-Fleming, Pocock, Haybittle-Peto

From a frequentist point of view, repeated hypothesis testing of accumulating data increases the type I error rate of a clinical trial. Therefore, the frequentist approach to interim monitoring of clinical trials focuses on controlling the type I error rate.

In most clinical trials, it is not necessary to perform a statistical analysis after each patient is accrued. In fact, for most multi-center clinical trials, interim statistical analyses are conducted only once or twice per year. Usually this frequency of interim analyses detects treatment effects nearly as early as continuous monitoring. The group sequential analysis is defined as the situation in which only a few scheduled analyses are conducted. Again, let's focus more on the concepts than the statistical details.

Suppose that the group sequential approach consists of R analyses, and we let \(Z_1, \dots , Z_R\) denote the test statistic at the R times of hypothesis testing. So, we are accumulating data over time. We are adding to the dataset and analyzing the current set that has been collected. Also, we let \(B_1, \dots , B_R\) denote the corresponding boundary points (critical values). At the \(r^{th}\) interim analysis, the clinical trial is terminated with rejection of the null hypothesis if:

\( |Z_r| \ge B_r, r = 1, 2, ... , R\)

The boundary points are chosen such that the overall significance level does not exceed the desired \(\alpha\). There are primarily three schemes for selecting the boundary points which have been proposed. These are illustrated in the following table for an overall significance level of \(\alpha = 0.05\) and for R = 2,3,4,5. The table is constructed under the assumption that n patients are accrued at each of the R statistical analyses so that the total sample size is \(N = nR\).

R	Interim Analysis Number	O'Brien-Fleming		Haybittle-Peto*		Pocock
R	Interim Analysis Number	B	\(\alpha\)	B	\(\alpha\)	B	\(\alpha\)
2	1	2.782	0.0054	3.0	0.002	2.178	0.0294
2	2	1.967	0.0492	1.960	0.0500	2.178	0.0294
3	1	3.438	0.0006	3.291	0.0010	2.289	0.0221
	2	2.431	0.0151	3.291	0.0010	2.289	0.0221
	3	1.985	0.0471	1.960	0.0500	2.289	0.0221
4	1	4.084	0.00005	3.291	0.00100	2.361	0.0182
	2	2.888	0.0039	3.291	0.00100	2.361	0.0182
	3	2.358	0.0184	3.291	0.00100	2.361	0.0182
	4	2.042	0.0412	1.960	0.0500	2.361	0.0182
5	1	4.555	0.000005	3.291	0.00100	2.413	0.0158
	2	3.221	0.0013	3.291	0.00100	2.413	0.0158
	3	2.630	0.0085	3.291	0.00100	2.413	0.0158
	4	2.277	0.0228	3.291	0.00100	2.413	0.0158
	5	2.037	0.0417	1.960	0.0500	2.413	0.0158

For example, if we plan one interim analysis and a final analysis, we will select the row in this table with R=2. Using these first two rows of the table, we find the critical values for the interim analysis and for the final analysis. If using O'Brien-Fleming approach, the interim analysis is conducted with bound 2.782 and final analysis with bound 1.967. On the other hand, had the choice been a Haybittle-Peto approach, the first test would be conducted with bound 3.0 and the final analysis at 1.96.

In another situation with three interim analyses and a final analysis, R=4. View the corresponding four rows in the middle of the table to determine critical values for each interim and the final analysis. Notice different approaches 'spend' or distribute the overall significance differently across the interim and final analyses.

The Pocock approach uses the same significance level at each of the R interim analyses. Of the three procedures described in the table, it provides the best chance of early trial termination. Many investigators dislike the Pocock approach, however, because of its properties at the final stage of analysis. For example, suppose R = 3 analyses are planned and that statistical significance is not attained at any of the analyses. Suppose that the p-value at the final analysis is 0.0350 (this is > 0.0221 found in the table for the Pocock approach). If interim analyses had not been scheduled, however, this p-value would be considered to provide a statistically significant result \(\left(cp = 0.0350 < 0.0500 \right)\).

The Haybittle-Peto (based on intuitive reasoning) and O'Brien-Fleming (based on statistical reasoning) approaches were designed to avoid this problem. On the other hand, these two approaches render it very difficult to attain statistical significance at an early stage.

Example

An example of the Pocock approach is provided in Pocock's book (Pocock. 1983. Clinical Trials: A Practical Approach, New York, John Wiley & Sons). A trial was conducted in patients with non-Hodgkin's lymphoma, in which two drug combinations were compared, namely cytoxan-prednisone (CP) and cytoxan-vincristine-prednisone (CVP). The primary endpoint was presence/absence of tumor shrinkage, a surrogate variable.

Patient accrual lasted over two years and 126 patients participated. Statistical analyses were scheduled after approximately every 25 patients. Chi-square tests (without the continuity correction) were performed at each of the five scheduled analyses. The Pocock approach to group sequential testing requires a significance level of 0.0158 at each analysis. Here is a table with the results of these analyses.

	Tumor shrinkage treatment		p-value
	CP	CVP	p-value
Analysis #1	3/14	5/11	p > 0.10
Analysis #2	11/27	13/24	p > 0.10
Analysis #3	18/40	17/36	p > 0.10
Analysis #4	18/54	24/48	0.05 < p < 0.10
Analysis #5	23/67	31/59	0.0158 < p < 0.10

Thus, the researchers were concerned that the CVP combination appeared to be clinically better than the CP combination (53% success versus 34% success), yet it did not lead to a statistically significant result with Pocock’s approach. Further analyses with secondary endpoints convinced the researchers that the CVP combination is superior to the CP combination.

How would you decide which of these group sequential methods to use? Since a major concern is the significance level at the final analysis and O'Brien-Fleming preserves close to the desired alpha for final analysis as well as allowing a strong result to terminate a trial, this has been a popular approach. The REMATCH clinical trial is a good example. Regardless of your choice, it is important to make it clear to study investigators the operating characteristics of any approach selected for interim analyses.

9.6 - Alpha Spending Function approach

A few drawbacks to the group sequential approach to interim statistical testing include the strict requirements that:

The number of scheduled analyses, R, must be determined prior to the onset of the trial
There is equal spacing between scheduled analyses with respect to patient accrual.

The alpha spending function approach was developed to overcome these drawbacks: (DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending function approach, Statistics in Medicine 13: 1341-1352.)

Let τ denote the information fraction available during the course of a clinical trial. For example, in a clinical trial with a target sample size, N, in which treatment group means will be compared, the information fraction at an interim analysis is \(\tau = \dfrac{n}{N}\), where n is the sample size at the time of the interim analysis. If your target sample size is 500 and you have taken measurements on 400 patients then \(\tau = .8\)

If the clinical trial involves a time-to-event endpoint, then the information fraction is \(\tau = \dfrac{d}{D}\), where D is the target number of events for the entire trial and d is the events that have occurred at the time of the interim analysis.

The alpha spending function, \(\alpha(\tau)\), is an increasing function. At the beginning of trial: \(t = 0\) and \(\alpha(t) = 0\); at the end of trial: \(t = 1\) and \(\alpha(t) = \alpha\), the desired overall significance level. In other words, every time an analysis is performed, part of the overall alpha is "spent". For the \(r^{th}\) interim analysis, where the information fraction is \(\tau_r, 0 ≤ \tau_r ≤ 1, \alpha(\tau_r)\) determines the probability of any of the first r analyses leading to rejection of the null hypothesis when the null hypothesis is true. Obtaining the critical values consecutively requires numerically integrating the distribution function. A program is available in this module, along with the Demets-Lan paper.

As a simple example, suppose investigators are planning a trial in which patients are examined every two weeks over a 12-week period. The investigators would like to incorporate an interim analysis when one-half of the subjects have completed at least one-half of the trial. This corresponds to \(\tau = 0.25\).

A simple spending function that is a compromise between the Pocock and O'Brien-Fleming functions, is \(\alpha(\tau) = \tau\alpha, 0 ≤ \alpha ≤ 1\). This leads to a significance level of 0.012 at the interim analysis and a significance level of 0.04 at the final analysis (calculations not shown). Many variations of spending functions have been devised.

Regardless of whether a sequential, group sequential or alpha spending function approach is invoked, the estimates of a treatment effect will be biased when a trial is terminated at an early stage. The earlier the decision, the larger the bias. Intuitively, if the target sample size is 200 and the trial terminates after 25 patients because of a significant difference between treatment groups, you recognize the potential for a lot of bias in this situation. Are 25 patients a representative sample from the population?

9.7 - Futility Assessment with Conditional Power; Adaptive Designs

As an alternative to the above methods, we might want to terminate a trial when the results of the interim analysis are unlikely to change after accruing more patients (futility assessment/curtailed sampling). It just doesn't look like there could ever be a significant difference!

Unconditional power, as we have used in earlier sample size calculations is the probability of acheiving a significant result at a pre-specified alpha under a pre-specified alternative treatment effect as calculated at the beginning of a trial. Conditional power is an approach that quantifies the probability of rejecting the null hypothesis of no effect once some data are available. If this quantity is very small, a conclusion can be reached that it would be futile to continue the investigation.

As a simple example, consider the situation in which we want to determine if a coin is fair, so the hypothesis testing problem is:

\(H_0: p = Pr[\text{Heads}] = 0.5 \text{ versus } H_1: p = Pr[\text{Heads}] > 0.5\).

The fixed sample size plan is to toss the coin 500 times, count the number of heads, X. But do we actually need to flip the coin 500 times? Using this futility assessment procedure we could reject \(H_0\) at the 0.025 significance level if:

\(Z=\dfrac{X-250}{\sqrt{(500)(0.5)(0.5)}} \ge 1.96 \)

This is equivalent to rejecting \(H_0\) if X ≥ 272. Suppose that after 400 tosses of the coin there are 272 heads. It is futile to proceed further because even if the remaining 100 tosses yielded tails, the null hypothesis still would be rejected at the 0.025 significance level. The calculation of the conditional power in this example is trivial (it equals 1) because no matter what is assumed about the true value of p, the null hypothesis would be rejected if the trial were taken to completion.

You can also look at this in the other direction. Suppose that after 400 tosses of the coin there are 200 heads. The null hypothesis will be rejected if there are at least 72 heads during the remaining 100 tosses.

Even if p = 0.6 (arbitrary assignment), the conditional power is:

\(Pr[X \ge 72 | n=100, p=0.6]\)

\(= Pr\left[\dfrac{X-60}{\sqrt{(100)(0.6)(0.4)}} \ge \dfrac{72-60}{\sqrt{(100)(0.6)(0.4)}} \right]\)

\(= Pr[X \ge 2.45] = 0.007\)

The probability based on a standard normal table is calculated to be .007, a very small probability. Thus, it is futile to continue because there is such a small chance of rejecting \(H_0\).

Similarly, two clinical trial scenarios can be envisioned:

A trend in favor of rejecting \(H_0\) is observed at \(t < T\), with intervention \(>\) control. Compute conditional probability of rejecting \(H_0\) at T given current data. If probability is sufficiently large, one might argue trend not going to disappear.
A negative trend consistent with \(H_0\) at t . Compute conditional probability of rejecting \(H_0\) at end of trial at T given some alternative \(H_1\) is true. How large does the true effect need to be before the negative trend is reversed? If the probability of trend reversal is highly unlikely, termination might be considered.

Adaptive Designs

As we have seen, emerging trends may cause investigators to consider making changes in a study, such as increasing a sample size or terminating the study. An adaptive design which pre-specifies how the study design may change based on observed results can be useful. Group sequential strategies that we have already discussed are examples of a classical approach to some adaptation, that is early termination. In confirmatory trials, any adaptive design must maintain the statistical validity of the conclusions; control of Type I error is critical. On the other hand, adaptive designs for studies aimed at finding safe and effective doses emphasize strategies for assigning more participants to treatments with favorable responses and do not consider control of the type I error rate as important as identifying the most effective doses to enter confirmatory trials.

View the table from Bhatt DL, Mehta C. Adaptive designs for clinical trials. New England Journal of Medicine. 2016;375(1):65–74. doi: 10.1056/NEJMra1510061. pmid:27406349, to see the strengths and weaknesses of different adaptive designs. In the paper, the authors examine 4 case studies of different adaptive designs used in confirmatory trials.

9.8 - Monitoring and Interim Reporting for Trials

Single-Center Trials

Here are some practical issues as they relate to single-center trials. Typically, an investigator for a single-center trial needs to submit an annual report to his/her IRB. The report should address whether the study is safe and whether it is appropriate to continue.

The report should include the following topics:

Compliance with governmental and institutional oversight,
Review of eligibility (low frequency of ineligible patients entering the trial),
Treatment review (most patients are adhering to the treatment regimen),
Summary of response,
Summary of survival,
Adverse events,
Safety monitoring rules (possibly statistical criteria for evaluating safety endpoints), and
Audit and other quality assurance reviews.

Multi-Center Trials

A multi-center trial is one in which there are one or more clinical investigators at each of a number of locations (centers). Obviously, multi-center trials are of great importance when the disease is not common and a single investigator is capable of recruiting only a handful of patients.

Advantages of a multi-center trial (Pocock, 1983) include the following:

Larger sample size and quicker patient accrual,
Broader interpretations of results because of the multiple participants involved in the study across various geographic regions, (this adds to external validity), and
Increased scientific merit of the trial because of collaborations among experienced clinical scientists involved in the design and implementation of the study.

Of course there is a down side... Disadvantages of a multi-center trial include the following:

Planning is more complex,
The study is going to be more expensive,
More effort to needs to go into ensuring compliance to clinical protocol across all centers,
Quality control must be implemented for taking measurements and recording data,
You need a data coordinating center (DCC) for storing, monitoring data and organizing investigators,
A need develops to keep all investigators involved and motivated,
Avoidance of passive investigators,
Compromise between quantity and quality of centers, and
A need for strong leadership.

The NIH requires a Data and Safety Monitoring Board (DSMB) to monitor the progress of a multi-center clinical trial that it sponsors. Although the FDA does not require a pharmaceutical/biotech company to construct a DSMB for its multi-center clinical trials, many companies are starting to use DSMBs on a regular basis.

There are several advantages that a DSMB provides, such as yielding a mechanism for protecting the interests and safety of the trial participants, while maintaining scientific integrity. The manner in which it is constructed should ensure that the DSMB is financially and scientifically independent of the study investigators so that decisions about early stopping or study continuation are made objectively. Depending on the circumstances, a DSMB may be composed of anywhere from three to ten experts in medicine, statistics, epidemiology, data management, clinical chemistry, and ethics. None of the study investigators should be a part of the DSMB. In addition, the DSMB should not be masked to treatment assignment when it is evaluating a clinical trial. Although investigators and statisticians may submit information and materials to the DSMB for their study, most of the deliberations made by the DSMB are kept confidential. The DSMB reports directly to the sponsor of the multi-center trial (the NIH or the company) and does not report to the investigators.

A DSMB typically examines the following issues when assessing the worth of a multi-center clinical trial:

Are the treatment groups comparable at baseline?
Are the accrual rates meeting initial projections and is the trial on its scheduled timeline?
Are the data of sufficient quality?
Are the treatment groups different with respect to safety and toxicity data?
Are the treatment groups different with respect to efficacy data?
Should the trial continue?
Should the protocol be modified?
Are other descriptive statistics, graphs, or analyses needed for the DSMB to make its decisions?

The major disadvantage of a DSMB holding the decision-making authority in a multi-center clinical trial, instead of the investigators, is that expertise may be sacrificed in order to maintain impartiality. Investigators gain valuable knowledge during the course of the trial and it is not possible to provide the DSMB with the totality of this knowledge. Nevertheless, the advantages of a DSMB seem to outweigh this disadvantage during the conduct of a multi-center trial.

A comprehensive book on the aspects of DSMBs is available: Ellenberg, SS. Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees in Clinical Trials, New York, NY: Wiley.

9.9 - Summary

In this lesson, among other things, we learned:

Differentiate between valid and invalid reasons for interim analyses and early termination of a trial.
Identify the characteristics of a sound plan for interim analysis.
Understand the theoretical framework for a likelihood-based interim analysis.
Compare and contrast the Bayesian approach to analysis with the frequentist approach.
Recognize the general effects of the choice the prior on the posterior probability distribution from a Bayesian analysis.
Compare α spending functions for 3 group sequential methods for interim analysis.
Comment on the use of a group sequential method in a published statistical analysis.
Recognize a futility assessment and define conditional power.
List topics that should be covered in an interim report to an IRB.
List the advantages and disadvantages of a DSMB and describe who might compose the DSMB.
List the issues of concern to a DSMB in a typical clinical study.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility