Lesson 11: Estimating Clinical Effects

Overview

The design of a clinical trial imposes structure on the resulting data. For example, in pharmacologic treatment mechanism (Phase I) studies, blood samples are used to display concentration × time curves, which relate to simple physiologic models of drug distribution and/or metabolism. As another example, in SE (Phase II) trials of cytotoxic drugs, investigators are interested in tumor response and toxicity of the drug or regimen. The usual study design permits estimating the unconditional probability of response or toxicity in patients who met the eligibility criteria.

For every trial, investigators must distinguish between those analyses, tests of hypotheses, or other summaries of the data that are specified a priori and justified by the design and those which are exploratory. Remember, the results from statistical analyses of endpoints that are specified a priori in the protocol carry more validity. Although exploratory analyses are important and might uncover biological relationships previously unsuspected, they may not be statistically reliable because it is not possible to account for the random nature of exploratory analyses. Exploratory analyses are not confirmatory by themselves but generate hypotheses for future research.

Objectives

Upon completion of this lesson, you should be able to:

State the objectives of a pharmacokinetic model.
Use a SAS program to calculate a confidence interval for an odds ratio
Use a SAS program to perform a Mantel-Haenszel.analysis to estimate an odds ratio adjusted for strata effects.
Recognize when odds-ratios or relative risks differ significantly between groups.
Modify a SAS program to perform JT and Cochran-Armitage tests for trend.
Interpret a Kaplan-Meier survival curve.
Interpret SAS output comparing survival curves.
Describe the process of bootstrapping to estimate variability of an estimator.

Reference:

Piantadosi Steven. (2005) Counting Subjects and Events. Estimating Clinical Effects. In: Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ: John Wiley and Sons, Inc.

Friedman, Furberg, DeMets, Reboussin and Granger. (2015). Survival Analysis. In FFDRG. Fundamentals of Clinical Trials. 5th ed. NY: Springer.

11.1 - Dose-Finding (Phase I) Studies

One of the principal objectives of dose-finding (DF) studies is to assess the distribution and elimination of drug in the human system. What level of the drug is appropriate?

Pharmacokinetic (PK) models, also known as compartmental models, provide useful analytical approaches for DF studies. The objective of a PK model is to account for the absorption, distribution, metabolism, and excretion of a drug in the human system. An example of a two-compartment PK model is as follows and the objective is to estimate the rates into, between, and out of the two compartments.

Estimates are made for each of these areas, i.e., absorption rate, distribution rate, etc. for the drug in question.

11.2 - Safety and Efficacy (Phase II) Studies: The Odds Ratio

The main objectives of most safety and efficacy (SE) studies with a new treatment are to estimate the frequency of adverse reactions and estimate the probability of treatment success. These types of endpoints often are expressed in binary form, presence/absence of adverse reaction, success/failure of treatment, etc., although this is not always the case.

Adverse reactions are often classified on an ordinal scale, such as absent/mild/moderate/severe. The primary efficacy endpoint also may be measured on an ordinal scale, such as failure/partial success/success, or it may be a time-to-event variable or measured on a continuous scale, such as a measurement of blood pressure. There are many ways to assess efficacy.

Estimates of risk can be useful in SE studies. Suppose that an SE study consists of a placebo group and a treatment group, and that probability of an adverse reaction is an important investigation. Let \(p_1\) and \(p_2\) denote the respective probabilities of an adverse reaction for the treatment and placebo groups. Three common parameters of risk are as follows:

\(\text{Risk Difference }= \Delta = p_1-p_2\)

\(\text{Relative Risk }= \dfrac{p_1}{p_2}\)

\(\text{Odds Ratio }= \theta = \dfrac{p_1/ (1-p_1)}{p_2/(1-p_2)}\)

For relative risk a number significantly different from 1 indicates a difference between the two groups in the risk for the event. The odds ratio indicates the relative odds of the event occurring between two groups.. Because both are ratios, the relative risk and the odds ratio are assessed in terms of their distance from 1.0.

When would the odds ratio and the relative risk be about the same? When \(p_1\) and \(p_2\) are relatively small, for instance, when you are dealing with a very rare event.

The odds ratio is useful and convenient for assessing risk when the response outcome is binary, but it does have some limitations.

\(p_1\)	\(p_2\)	Risk Diff	Rel Risk	Odds Ratio
0.25	0.05	0.20	5.00	6.33
0.30	0.10	0.20	3.00	3.86
0.45	0.25	0.20	1.80	2.45
0.70	0.50	0.20	1.40	2.33

Notice in the table above that while the absolute risk difference is constant, the relative risk varies greatly, as does the odds ratio. Thus, the magnitudes of the odds ratio and relative risk are strongly influenced by the initial probability of the condition.

When the outcome in a CTE trial is a binary response and the objective is to compare the two groups with respect to the proportion of success, the results can be expressed in a 2 × 2 table as

	Group #1	Group #2
Success	\(r_1\)	\(r_2\)
Failure	\(n_1 - r_1\)	\(n_2 - r_2\)

The estimated relative risk is \(\dfrac{\frac{r_1}{n_1}}{\frac{r_2}{n_2}} (\dfrac{r_1 }{n_1})/ (\dfrac{r_2}{n_2})\) and the estimated odds ratio is:

\(\hat{\theta}=\frac{r_1/(n_1-r_1)}{r_2/(n_2-r_2)}=\frac{r_1*(n_2-r_2)}{r_2*(n_1-r_1)}\)

There are a variety of methods for performing the statistical test of the null hypothesis \(H_0 \colon \theta = 1\) (or \H_0 \colon \Delta = 0\) , such as a z-test using a normal approximation, a \(\chi^2\) test (basically, a square of the z-test), a \(\chi^2\) test with continuity correction, and Fisher's exact test.

The normal and \(\chi^2\) approximations for testing \(H_0 \colon \theta = 1\) are relatively accurate if these conditions hold:

\(\frac{n_1(r_1+r_2)}{n_1+n_2} \ge 5, \frac{n_2(r_1+r_2)}{n_1+n_2} \ge 5, \frac{n_1(n_1+n_2-r_1-r_2)}{n_1+n_2} \ge 5, \frac{n_2(n_1+n_2-r_1-r_2)}{n_1+n_2} \ge 5\)

This expression is basically what we would have calculated for the expected values in the 2 × 2 table. The first part of the expression is the probability of success times the probability of being in group 1 times the number of subjects.

Otherwise, Fisher's exact test is recommended.

If the above condition is met, then the log_e-transformed estimated odds ratio has an approximate normal distribution:

\( log_e(\hat{\theta}) \sim N \left( \mu=log_e(\theta), \sigma^2=\frac{1}{r_1}+\frac{1}{r_2}+\frac{1}{n_1-r_1}+\frac{1}{n_2-r_2} \right)\)

Therefore, an approximate \(100(1 - \alpha)\%\) confidence interval for the log odds ratio is possible and would look like:

\(log_e(\hat{\theta}) \pm z_{1-\alpha/2}\sqrt{\frac{1}{r_1}+\frac{1}{r_2}+\frac{1}{n_1-r_1}+\frac{1}{n_2-r_2}}\)

An approximate \(100(1 - \alpha)\%\) confidence interval for the odds ratio is constructed by exponentiation the endpoints of the \(100(1 - \alpha)\%\) confidence interval for the log odds ratio. The computer can do this for you.

SAS® Example

Using PROC FREQ in SAS for performing statistical inference with the odds ratio in a two-way frequency table

An investigator conducted a small safety and efficacy study comparing the treatment to the placebo with respect to adverse reactions. The data are as follows:

	Treatment	Placebo
adverse reaction	12	4
no adverse reaction	32	40

***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for  *
* performing statistical inference with the odds ratio in a two-way   *
* frequency table.                                                    *
***********************************************************************;

proc format;
value groupfmt 0='placebo' 1='treatment';
value noyesfmt 0='no' 1='yes';
run;

data adverse_reactions;
input group advreact count;
format group groupfmt. advreact noyesfmt.;
cards;
0 0 40
0 1  4
1 0 32
1 1 12
;
run;

proc freq data=adverse_reactions;
tables group*advreact/chisq measures;
exact chisq measures;
weight count;
title "Statistical Inference Using the Odds Ratio";
run;

The estimated odds ratio is calculated as:

\(\hat{\theta}=\dfrac{(12)(40)}{(32)(4)}=3.75\)

and the approximate 95% confidence interval for the \(log_e\) odds ratio is

\(1.32 \pm (1.96 \times 0.62) = (0.10, 2.54)\)

so the 95% confidence interval for \(\theta\) is (1.10, 12.68).

Because the approximate 95% confidence interval for \(\theta\) does not contain 1.0, the null hypothesis of \(H_0 \colon \theta = 1\) is rejected at the 0.05 significance level.

Even though this data table satisfies the criteria for \(\text{log}_e\) estimated odds ratio to follow an approximate normal distribution, there still is a discrepancy between the approximate results and the exact results.

From PROC FREQ of SAS, the exact 95% exact confidence interval for \(\theta\) is (1.00, 17.25). Because the 95% confidence interval for \(\theta\) does contain 1.0, \(H_0 \colon \theta = 1\) is not rejected at the 0.05 significance level based on Fisher's exact test.

11.3 - Safety and Efficacy (Phase II) Studies: The Mantel-Haenszel Test for the Odds Ratio

Sometimes a safety and efficacy study is stratified according to some factor, such as clinical center, disease severity, gender, etc. In such a situation, it still may be desirable to estimate the odds ratio while accounting for strata effects. The Mantel-Haenszel test for the odds ratio assumes that the odds ratio is equal across all strata, although the rates, \(p_1\) and \(p_2\), may differ across strata. This procedure calculates the odds ratio within each stratum and then combines the strata estimates into one estimate of the common odds ratio. For example,

Stratum	\(p_1\)	\(p_2\)	\(\theta\)
1	0.50	0.25	3.00
2	0.40	0.18	3.00
3	0.30	0.12	3.00
4	0.20	0.08	3.00

SAS® Example

Using PROC FREQ for conducting a Mantel-Haenszel test

A company performed a multi-center safety and efficacy study at six sites, with a binary outcome (success/failure), for comparing placebo and treatment.

***********************************************************************
* This is a program that illustrates the use of PROC FREQ for         *
* conducting a Mantel-Haenszel test within the setting of a multi-    *
* center clinical trial.                                              *
***********************************************************************;

proc format;
value centfmt  1='Phoenix'
               2='Denver'
               3='Miami'
               4='New York'
               5='Atlanta'
               6='Chicago';
value groupfmt 0='Placebo'
               1='Treatment';
value respfmt  0='Failure'
               1='Success';
run;

data one;
input center group response count;
format center centfmt. group groupfmt. response respfmt.;
cards;
1 0 0 24
1 0 1  8
1 1 0 20
1 1 1 12
2 0 0 44
2 0 1 28
2 1 0 40
2 1 1 32
3 0 0 25
3 0 1  5
3 1 0 20
3 1 1 10
4 0 0 14
4 0 1 16
4 1 0 12
4 1 1 18
5 0 0 32
5 0 1 20
5 1 0 24
5 1 1 28
6 0 0 45
6 0 1  5
6 1 0 32
6 1 1 18
;
run;

proc freq data=one;
tables center*group*response/cmh;
exact comor;
weight count;
title "Example of the Mantel-Haenszel Test Applied to a Multi-Center Trial";
run;

SAS PROC FREQ yields an estimated odds ratio of 1.84 with an approximate 95% confidence interval is (1.28, 2.66).

The exact 95% confidence interval is (1.26, 2.69). The exact and asymptotic confidence intervals are nearly identical due to the large sample size across the six clinical centers.

\(H_0 \colon θ = 1\) is rejected at the 0.05 significance level (\(p = 0.0013\), which is consistent with the 95% confidence interval not containing 1.0. (Later in this chapter we discuss the construction of the Mantel-Haenszel test statistic.)

11.4 - Safety and Efficacy (Phase II) Studies: Trend Analysis

In some safety and efficacy studies, it is of interest to determine if an increase in the dose yields an increase (or decrease) in the response. The statistical analysis for such a situation is called a dose-response or trend analysis. We want to see a trend here, not just a difference in groups. Typically, patients in a dose-response study are randomized to K + 1 treatment groups (a placebo dose and K increasing doses of the drug). The response variables of interest may be binary, ordinal, or continuous (in some circumstances, the response variable may be a time-to-event variable). In some instances, trend tests can be sensitive and reveal a mild trend where pair-wise comparisons would not be able to find significant differences and not be as helpful.

For the sake of illustration, suppose that the response is continuous and that we want to determine if there is a trend in the K + 1 population means.

A one-sided hypothesis testing framework for investigating an increasing trend is

\(H_0 \colon {\mu_0 = \mu_1 = \dots = \mu_K}\) versus

\(H_1 \colon {\mu_0 ≤ \mu_1 ≤ \dots ≤ \mu_K}\) with at least one strict inequality}

A one-sided hypothesis testing framework for investigating a decreasing trend is

\(H_0 \colon {\mu_0 = \mu_1 = \dots = \mu_K}\) versus

\(H_1 \colon {\mu_0 ≥ \mu_1 ≥ \dots ≥ \mu_K}\) with at least one strict inequality}

A two-sided hypothesis testing framework for investigating a trend is

\(H_0 \colon {\mu_0 = \mu_1 = \dots = \mu_K}\) versus

\(H_1 \colon {\mu_0 ≤ \mu_1 ≤ \dots ≤ \mu_K or \mu_0 ≥ \mu_1 ≥ \dots ≥ \mu_K}\) with at least one strict inequality}

More than likely we would use one of the one-sided tests as you probably have a hunch about the effect that will result.

For a continuous response, an appropriate test is the Jonckheere-Terptsra (JT) trend test that was developed in the 1950s. The JT trend test is based on a sum of Mann-Whitney-Wilcoxon tests :

\(JT=\sum_{k=0}^{K-1}\sum_{k'=1}^{K}MWW_{kk'}\)

where \(MWW,_kk´\) is the Mann-Whitney-Wilcoxon test for comparing group k to group \(k´, 0 ≤ k < k´ ≤ K\). Essentially, each of the pairs of groups is compared against one another and then summed up. In this way this test looks for trends.

If \(Y_ki , I = 1, \dots , n_k \), denote the observations from group k, and \(Y_k'i', i´ = 1, \dots , n_k'\) , denote the observations from group \(k´\), then

\(MWW_{kk'}=\sum_{i=1}^{n_k}\sum_{i'=1}^{n_k'}sign(Y_{k'i'}-Y_{ki}\)

Note that each MWW should be constructed in a consistent manner. For example, when comparing an observation from a lower dose group versus an observation higher dose group, take the difference of the latter minus the former.

As an example of how the JT statistic is constructed, suppose there are four dose groups in a study (placebo, low dose, mid-dose, and high dose). Then the JT trend test is the sum of six Mann-Whitney-Wilcoxon test statistics:

{placebo vs. low dose} +

{placebo vs. mid dose} +

{placebo vs. high dose} +

{low dose vs. mid dose} +

{low dose vs. high dose} +

{mid dose vs. high dose}

Values of the statistic JT near-zero support

\(H_0 \colon {\mu_0 = \mu_1 = \dots = \mu_K}\) - they are equal

Large positive values of the statistic JT support

\(H_1 \colon {\mu_0 ≤ \mu_1 ≤ \dots ≤ \mu_K}\) with at least one strict inequality} - a increasing trend

Large negative values of the statistics JT support

\(H_1 \colon {\mu_0 ≥ \mu_1 ≥ \dots ≥ \mu_K}\) with at least one strict inequality}- a decreasing trend

The JT trend test actually is testing hypotheses about population medians, but if the underlying probability distribution is symmetric, the population mean and the population median are equal to one another. The JT trend test is available in PROC FREQ of SAS.

The parametric version of the JT trend test, based on the assumption of normal data, is to substitute the difference between sample means for the Mann-Whitney-Wilcoxon statistics. The numerator for the parametric test is as follows:

\(\sum_{k=0}^{K-1}\sum_{k'=k+1}^{K}(\bar{Y}_{k'}-\bar{Y}_{k})\)

Next, we assume that the \(K + 1\) groups have a homogeneous population variance, \(\sigma^2\) . The population variance is estimated by the pooled sample variance, \(s^2\) , and it has d degrees of freedom:

\(s^2=\frac{1}{d}\sum_{k=0}^{K}\sum_{i=1}^{n_k}(Y_{ki}-\bar{Y}_{k})^2, d=\sum_{k=0}^{K}(n_k-1)\)

Letting \(c_k = 2k - K, k = 0, 1, \dots , K\), the numerator reduces to:

\(\sum_{k=0}^{K}c_k \bar{Y}_{k}\)

Then the trend statistic is:

\(T=\left( \sum_{k=0}^{K}c_k \bar{Y}_{k} \right)/\left( \sqrt{s^2 \sum_{k=0}^{K}\dfrac{c_{k}^{2}}{n_{k}^{2}}} \right)\)

For example, if \(K = 3\) (placebo, low dose, mid dose, and high dose), then \(c_0 = -3, c_1 = -1, c_2 = 1, c_3 = 3\). Notice, however, that if there are an odd number of groups, then the middle group has a coefficient of zero. For example, with \(K = 2\) (placebo, low dose, and high dose) \(c_0 = - 1, c_1 = 0, c_2 = 1\). This is not ideal and there are better trend tests than JT and T for continuous data.

To use the actual dose values (denoted as \(d_0, d_1, \dots , d_K\)) in the parametric test, set \(c_k = d_k - \text{mean}(d_0, d_1, \dots , d_K), k = 0, 1, \dots , K\).

The T trend statistic can be constructed by using the CONTRAST statement in SAS PROC GLM.

The JT trend test works well for binary and ordinal data, as well as being available for continuous data.

Another trend test for binary data is the Cochran-Armitage (CA) trend test. The difference between the JT and CA trend tests is that for the latter test, the actual dose levels can be specified. In other words, instead of designating the dose levels as low, mid, or high, the actual numerical dose levels can be used in the CA trend test, such as 20 mg, 60, 180 mg.

The CA trend test, however, can yield unusual results if there is unequal spacing among the dose levels. If the dose levels are equally spaced and the sample sizes are equal (\(n_0 = n_1 = ... = n_K\)), then the JT and CA trend tests yield exactly the same results. Each of these parameters needs to be taken into account to make sure you are applying the best test for your data.

SAS® Example

Constructing trend tests

This SAS example illustrates how to construct trend tests.

***********************************************************************
* This is a program that illustrates the use of PROC FREQ and PROC    *
* GLM in SAS for performing trend tests.                              *
***********************************************************************;

proc format;
value groupfmt 0='Placebo' 1='20 mg' 2='60 mg' 3='180 mg';
value reactfmt 0='F' 1='S';
run;

data contin;
input group subject response;
cards;
0 01 27
0 02 28
0 03 27
0 04 31
0 05 34
0 06 32
1 01 31
1 02 35
1 03 34
1 04 32
1 05 31
1 06 33
2 01 32
2 02 33
2 03 30
2 04 34
2 05 37
2 06 36
3 01 40
3 02 39
3 03 41
3 04 38
3 05 42
3 06 43
;
run;

proc glm data=contin;
class group;
model response=group;
contrast 'Trend Test' group -1.5 -0.5 0.5 1.5;
title "Parametric Trend Test for Continuous Data";
run;

proc freq data=contin;
tables group*response/jt;
title "Jonckheere-Terpstra Trend Test for Continuous Data";
run;

data binary;
set contin;
if group=0 then dose=0;
if group=1 then dose=20;
if group=2 then dose=60;
if group=3 then dose=180;
if response<32 then react=0;
if response>=32 then react=1;
format react reactfmt.;
run;

proc freq data=binary;
tables react*group/jt trend;
exact jt trend;
title "Jonckheere-Terpstra and Cochran-Armitage Trend Tests for Binary Data";
title2 "Ordinal Scores";
run;

proc freq data=binary;
tables react*dose/jt trend;
exact jt trend;
title "Jonckheere-Terpstra and Cochran-Armitage Trend Tests for Binary Data";
title2 "Dose Scores";
run;

11.5 - Safety and Efficacy (Phase II) Studies: Survival Analysis

In many clinical trials involving serious diseases, such as cancer and AIDS, a primary objective is to evaluate the survival experience of the cohort. In clinical trials not involving serious diseases, survival may not be an outcome, but other time-to-event outcomes may be important. Examples include time to hospital discharge, time to disease relapse, time to getting another migraine, time to progression of disease, etc.

The Kaplan-Meier survival curve is a nonparametric technique for estimating the probability of survival, even in the presence of censoring (e.g. study is completed before the patient experiences the event), at any point in time. This statistical approach is nonparametric because it does not assume any particular distribution for the data, such as lognormal, exponential, or Weibull. It is a "robust" procedure because it is not adversely affected by one or more unusual data points.

In order to construct the Kaplan-Meier survival curve, the actual failure times need to be ordered from smallest to largest. In a sample size of n patients, denote these times of failure as \(t_1, t_2, \dots , t_K\). For convenience, let \(t_0 = 0\) denote the start time and let \(t_K+1 = ∞\).

At the \(k^th\) failure time, \(t_k\), the number of failures, \(d_k\), are noted as well as the number of patients who were at risk for failure immediately prior to \(t_k, n_k\). Notice that patients who are lost to follow-up (censored) prior to time \(t_j\) are not included in \(n_k\).

The algebraic formula for the Kaplan-Meier survival probability at time t is:

\(\hat{S}(t)=1, t_0 \le t \le t_1\)

\(\hat{S}(t)= \prod_{k'=1}^{k}\left( 1-\frac{d_{k'}}{n_{k'}} \right), t_k \le t \le t_{k+1}, k=1, 2, \dots , K \)

The calculation of S(t) utilizes conditional probability: the probability of surviving at time t, given that the person has survived up to time t. The Kaplan-Meier curve depicts S(t), the probability of surviving beyond time t.

An example with an initial sample of n = 100 patients is as follows:

k	\(t_k\) (days)	\(d_k\) (events)	\(n_k\) (at risk)	\(\hat{S}(t_k)\)	probability
1	127	1	98	0.99 = (1 - 1/98)	probability of surviving beyond day 127
2	154	2	91	0.97 = (1 - 1/98)(1 - 2/91)	probability of surviving beyond day 154
3	195	1	84	0.96 = (1 - 1/98)(1 - 2/91)(1 - 1/84)	probability of surviving beyond day 195
4	221	3	75	0.92 = (1 - 1/98)(1 - 2/91)(1 - 1/84)(1 - 3/75)	probability of surviving beyond day 221

Note that the probability estimate does not change until a failure event occurs. Also, censored values do not affect the numerator, but do affect the denominator. Thus, the Kaplan-Meier survival curve gives the appearance of a step function when graphed.

A graphical display of the Kaplan-Meier survival curve is as follows:

Each step down represents the occurrence of an event.

11.6 - Comparative Treatment Efficacy (Phase III) Trials

For comparative treatment efficacy (CTE) trials, the primary endpoints often are measured on a continuous scale. The sample mean (sample standard deviation), the sample median (sample interquartile range), or the sample geometric mean (sample coefficient of variation) serve as reasonable descriptive statistics in such circumstances.

The sample mean (sample standard deviation) is suitable if the data are normally distributed or symmetric without heavy tails. The sample median (sample interquartile range) is suitable for symmetric or asymmetric data. The sample geometric mean (sample coefficient of variation) is suitable when the data are log-normally distributed.

Usually, two-sample t-tests or Wilcoxon rank tests are applied to compare the two randomized groups. In some instances, baseline measurements (prior to randomized treatment assignment) of the primary endpoints are taken.

Suppose \(Y_{i1}\) and \(Y_{i2}\) denote the baseline and final measurements of the endpoint, respectively, for the \(i^{th}\) subject, \(i = 1, 2, \dots, n\). Instead of statistically analyzing the \(Y_{i2}s\), there could be an increase in precision by analyzing the change (or gain) in the response, namely, the \(Y_{i}s\) where \(Y_i = Y_{i2} - Y_{i1}\).

Suppose that the variance for each \(Y_{i1}\) and \(Y_{i2}\) is \(\sigma^2\) and that the correlation between \(Y_{i1}\) and \(Y_{i2}\) is \(ρ\) (we assume that subjects are independent of each other but that the pair of measurements within each subject are correlated).

This leads to

\(Var(Y_{i2} - Y_{i1}) = Var(Y_{i2}) + Var(Y_{i1}) - 2Cov(Y_{i2},Y_{i1}) = 2\sigma^2(1 - \rho)\)

Therefore,

\(Var(Y_{i2}) = \sigma^2 \text{ and } Var(Y_{i2} - Y_{i1}) = 2\sigma^2(1 - \rho)\)

If \(ρ> \dfrac{1}{2}\), which often is the case for repeated measurements within patients, then \(Var(Y_{i2} - Y_{i1}) < Var(Y_{i2})\). Thus, there may be more precision if the \(Y_{i2} - Y_{i1}\) are analyzed instead of the \(Y_{i2}\). This happens all the time. Using the patient as their own control is a good thing. We are interested in the differences that are occurring, therefore we will subtract the treatment period measurements from the baseline data for the patient. A two-sample t test or the Wilcoxon rank sum test can be applied to the change-from-baseline measurements if the CTE trial consists of two randomized groups, such as placebo and an experimental therapy.

An alternative approach with baseline measurements is analysis of covariance (ANCOVA). In this situation, the baseline measurement, \(Y_{i1}\), serves as a covariate, so that the final measurement for a subject is adjusted by the baseline measurement. A linear model that describes this for a two-armed trial with placebo and experimental treatment groups is as follows. The expected value for the \(i^{th}\) patient, \(i = 1, 2, \dots, n\), is:

\( E(Y_{i2}) = \mu_p + T_i( \mu_E - \mu_P) + Y_{i1}\beta \)

where \(\mu_P\) is the population mean for the placebo group, \(\mu_E\) is the population mean for the experimental treatment group, \(T_i = 0\) if the \(i^{th}\) patient is in the placebo group and 1 if in the experimental treatment group, and \(\beta\) is the slope for the baseline measurement.

The expectations for subjects in the placebo group and experimental treatment group, respectively, can be rewritten as:

\( E(Y_{i2} - Y_{i1}\beta) = \mu_P \text{ and } E(Y_{i2} - Y_{i1}\beta) = \mu_E \)

These expectations are analogous to the expectations for the change-from baseline measurements:

\( E(Y_{i2} - Y_{i1}) = \mu_P \text{ and } E(Y_{i2} - Y_{i1}) = \mu_E\)

The only difference between the two approaches is that in the change-from-baseline measurements, \(\beta\) is set equal to 1.0. In the ANCOVA approach, \(\beta\) is estimated in the analysis and may differ from 1.0. Thus, ANCOVA approach is more flexible and can yield slightly more statistical power and efficiency.

11.7 - Comparing Survival Curves

If the primary endpoint in a CTE trial is a time-to-event variable, then it will be of interest to compare the survival curves of the randomized treatment arms. Again, we will focus on a nonparametric approach that corresponds to comparing the Kaplan-Meier survival curves rather than a parametric approach.

The Mantel-Haenszel test can be adapted here in terms comparing two groups, say P and E for placebo and experimental treatment. In this situation, the Mantel-Haenszel test is called the logrank test.

The assumptions for the logrank test are that (1) the censoring patterns are the same for the two treatment groups, and (2) the hazard functions for the two treatment groups are proportional.

For each of the K distinct failure times across the two randomized groups at times \(t_1, t_2, \dots , t_K\), a 2 × 2 table is constructed. For failure time \(t_k , k = 1, 2, … , K\), the table is:

	Placebo	Exp Treat
# events	\(d_{Pk}\)	\(d_{Ek}\)
# non events	\(n_{Pk} - d_{Pk}\)	\(n_{Ek} - d_{Ek}\)

The logrank statistic constructs an observed minus expected score, under the assumption that the null hypothesis of equal event rates is true, for each of the K tables and then sums over all tables:

\(O-E=\sum_{k=1}^{K}\left( \frac{n_{Pk}d_{Ek}-n_{Ek}d_{Pk}}{n_{Pk}+n_{Ek}} \right)\)

The variance expression for the O - E score is as follows:

\(V_L=Var(O-E)=\sum_{k=1}^{K}\left( \frac{(d_{Pk}+d_{Ek})(n_{Pk}+n_{Ek}-d_{Pk}-d_{Ek})n_{Pk}n_{Ek}}{(n_{Pk}+n_{Ek}-1)(n_{Pk}+n_{Ek})^2} \right)\)

Then the logrank statistic is:

\(Z_L=(O-E)/\sqrt{V_L}\)

which has an approximate standard normal distribution.

The generalized Wilcoxon test also is a nonparametric test for comparing survival curves and it is an extension of the Wilcoxon rank-sum test in the presence of censoring. It also requires that the censoring patterns for the two treatment groups be the same, but it does not assume proportional hazards.

The first step in constructing the generalized Wilcoxon statistic is to pool the two samples of survival times (including censored values) and order them from lowest to highest. For the \(i^{th}\) observation in the ordered sample with survival (or censored) time\(t_i\), construct a score, \(U_i\), which represents the number of survival (or censored) times less than \(t_i\) minus the number of survival (or censored) times greater than \(t_i\). The \(U_i\) are summed over the experimental treatment group and a variance calculated, i.e.,

\(U=\sum_{i=1}^{n_E}U_i \text {and }V_U = Var(U)=\left( \frac{n_Pn_E}{(n_P+n_E)(n_P+n_E-1)}\right)\sum_{i=1}^{n_P+n_E}U_{i}^{2}\)

such that:

\(Z_U=(O-E)/\sqrt{V_U}\)

has an approximate standard normal distribution.

An example of constructing the \(U_i\) scores ("+" reflects censoring):

\(t_i\)	Group	#\( < t_i\)	#\( > t_i\)	\(U_i\)
6	Exp Treat	0	7	-7
10	Placebo	1	6	-5
10+	Exp Treat	2	0	2
12	Exp Treat	2	4	-2
15+	Exp Treat	3	0	3
17	Placebo	3	2	1
21	Placebo	4	1	3
25+	Placebo	5	0	5

Then U = (-7) + 2 + (-2) + 3 = -4.

SAS® Example

Using PROC LIFETEST in SAS to construct Kaplan-Meier survival curves and test statistics for comparing survival curves

A safety and efficacy study was conducted in 83 patients with malignant mesothelioma, an uncommon lung cancer that is strongly associated with asbestos exposure. Patients underwent one of three types of surgery, namely, biopsy, limited resection, and extrapleural pneumonectomy (EPP). Treatment assignment was nonrandomized and based on the extent of disease at the time of diagnosis. Thus, there can be a strong procedure selection bias here in this example.

***********************************************************************
* This is a program that illustrates the use of PROC LIFETEST in SAS  *
* to construct Kaplan-Meier survival curves and test statistics for   *
* comparing survival curves.                                          *
*                                                                     *
* The sample data set is based on the results from an SE trial on 83  *
* patients with malignant mesothelioma, an uncommon lung cancer that  *
* is strongly associated with asbestos exposure.  Patients underwent  *
* one of three types of surgery, namely, biopsy, limited resection,   *
* and extrapleural pneumonectomy (EPP).  Treatment assignment was     *
* based on the extent of disease at the time of diagnosis.            *
***********************************************************************;

proc format;
value sexfmt 0='female' 1='male';
value psfmt 0='low' 1='high';
value wtchgfmt 1='no' 2='yes';
value surgfmt 1='biopsy' 2='limited resection' 3='EPP';
value eventfmt 0='no' 1='yes';
run;

data mesoth;
input age sex ps hist wtchg surg pftime prog stime dead;
label age='Age'
      sex='Sex'
      ps='Performance Status'
      hist='Histologic Subtype'
      wtchg='Weight Change at DX'
      surg='Surgery Type'
      pftime='PFT Event'
      prog='PFtime Censoring'
      stime='Survival Time'
      dead='Death Event';
format sex sexfmt.
       ps psfmt.
       wtchg wtchgfmt.
       surg surgfmt.
       prog eventfmt.
       dead eventfmt.;
cards;
60 1 1 136 1 3 394   1 823  1
59 1 0 136 2 3 1338  0 1338 0
51 0 0 130 1 1 184   1 270  1
73 1 1 136 1 3 320   0 320  1
74 1 0 136 2 1 168   0 168  1
39 0 0 136 1 1 36    1 247  1
46 1 1 131 1 3 552   1 694  0
71 1 0 136 1 1 133   1 316  1
69 1 0 136 1 1 175   1 725  0
49 1 0 131 1 1 327   0 327  1
69 1 0 131 1 2 0     0 0    1
72 1 0 131 1 1 676   1 963  0
44 0 0 130 2 2 223   1 265  1
45 1 0 136 2 2 184   1 237  1
57 1 0 132 1 2 145   1 176  1
60 0 1 131 1 1 316   0 316  1
22 1 1 131 1 2 87    1 310  1
46 0 1 131 1 1 135   1 166  1
60 1 0 131 1 3 1     1 28   1
72 1 0 131 1 2 199   1 730  1
65 1 0 131 1 3 39    0 39   1
65 1 1 131 1 2 61    1 116  1
60 1 0 131 1 3 17    0 17   1
64 1 0 131 2 3 799   1 1229 1
61 1 0 131 2 1 61    1 294  1
38 1 0 131 1 1 176   1 322  1
65 1 1 136 1 3 6     0 6    1
73 0 1 131 1 2 292   1 422  1
74 1 0 136 2 2 22    1 22   1
76 1 0 136 1 1 106   1 375  1
57 1 1 131 1 3 248   1 302  1
60 0 0 .   1 1 63    1 365  1
56 1 0 136 1 1 145   1 387  1
62 0 0 136 1 1 104   1 327  1
60 1 0 131 1 1 20    1 247  1
67 0 0 131 1 1 181   1 669  1
64 1 0 131 1 2 89    1 948  1
67 1 1 136 1 1 0     1 400  1
56 0 1 131 1 2 724   1 1074 0
52 1 0 160 2 1 62    1 137  1
56 1 0 131 1 3 93    1 210  1
44 1 0 136 1 3 402   1 648  1
50 0 0 136 2 2 141   1 520  1
63 1 0 .   2 1 156   1 304  1
68 1 1 131 1 2 265   1 349  1
50 1 0 .   2 3 305   1 317  1
41 0 1 131 1 1 181   1 395  1
60 1 0 131 1 1 274   1 503  1
65 1 0 136 2 2 20    1 20   1
47 1 1 131 1 3 411   1 679  0
46 1 1 131 1 2 624   0 624  0
70 1 1 131 1 2 278   1 617  0
58 1 0 136 1 1 20    1 85   1
57 1 1 132 1 3 112   1 139  1
75 1 0 132 2 2 47    1 47   1
66 1 1 136 1 3 294   1 523  1
77 1 0  .  1 1 126   1 157  1
65 0 0  .  2 1 117   1 545  0
46 0 0 131 1 1 63    1 218  1
71 0 1 132 2 1 139   0 139  1
61 1 0 136 1 1 538   1 1170 0
58 1 0 131 1 3 390   1 722  1
49 1 1 136 1 3 1102  0 1102 0
50 1 0 136 1 3 166   1 182  1
73 1 0 136 1 2 58    1 136  1
44 1 0 136 1 1 406   0 406  1
47 0 1 131 1 3 1123  0 1123 0
68 1 0 136 1 1 1009  1 1029 0
66 1 0 132 1 2 37    1 112  1
46 1 1 131 1 1 104   1 764  1
56 1 1 136 1 2 33    1 225  1
68 1 1 136 1 1 20    1 122  1
59 1 0 136 1 2 73    1 165  1
58 0 0 131 1 1 4     0 4    1
66 1 1 132 2 2 205   1 361  1
82 1 0 160 1 1 78    0 78   1
73 1 0 131 1 1 1265  0 1265 1
57 0 0 130 1 2 273   1 318  1
72 1 1 136 2 1 2     1 362  1
69 1 1   . 1 2 1093  0 1093 0
64 0 1 130 1 1 475   0 475  1
65 1 1 130 1 2 292   0 292  1
72 1 1 130 1 2 324   1 499  0
;
run;

proc print data=mesoth;
title 'Mesothelioma Example';
run;

proc lifetest data=mesoth plots=(survival);
strata surg;
time stime*dead(0);
title2 'Comparison of Surgery Types According to Survival Time';
run;

The primary outcome variable was time to death (survival). SAS PROC LIFETEST constructs the Kaplan-Meier survival curve for each surgery group and compares the survival curves via the logrank test (p = 0.48) and the generalized Wilcoxon test (p = 0.63).

Strength of Evidence

Although p-values are useful for hypothesis tests that are specified a priori, they provide poor summaries of clinical effects. In particular, they do not convey the magnitude of a clinical effect. The size of a p-value depends on the magnitude of the estimated treatment effect and its estimated variability (also a function of sample size). Thus, the p-value partially reflects the size of the trial, which has no biological interpretation. In addition, the p-value can mask the magnitude of the treatment effect, which does have biological importance. P-values only quantify the type I error and do not characterize the biologically important effects in the trial. Thus, p-values should not be used to describe the strength of evidence in a trial. Investigators have to look at the magnitude of the treatment effect.

Confidence intervals are more appropriate for describing the strength of evidence in a clinical trial, although they also are affected by the sample size. Most major journals now require this approach as it is many times more informative than simply just the p-value.

11.8 - Special Methods of Analysis

One of the most difficult statistical tasks is assessing the precision of an estimator, i.e., determining the variance of an estimator can be more difficult than determining the appropriate estimator. In complicated situations, the bootstrap method can be applied to estimate the variance of an estimator. The bootstrap is essentially a resampling plan.

For example, suppose an investigator collects a sample of N observations, denoted as \(Y_1, Y_2, \dots , Y_N\), and wants to estimate the median, \(η\), and get an expression for its variance. If the investigator does not want to make any assumptions about the distribution of the sample, then an explicit expression for the variance of the sample median does not exist.

The bootstrap can be used to construct a variance estimate of the sample median.

The bootstrap process consists of constructing B data sets, each with N observations, from the original data set. Each bootstrap sample is constructed by sampling with replacement from the original data set. This means that when constructing a bootstrap sample, N observations are generated one at a time where each \(Y_i\) has \(dfrac{1}{N}\) probability of being selected. Here is an example of the resampling of the original data:

Original sample: 17, 25, 16, 32, 27, 19, 25, 23, 22, 30

Bootstrap sample #1	30,22,27,25,25,23,32,27,19,22
Bootstrap sample #2	25,16,17,23,30,16,22,19,32,30
...	...
Bootstrap sample #1000	19,32,22,16,25,16,30,22,23,17

Thus, for \(b = 1, \dots,B\), the bootstrap sample \(Y_{b1}, Y_{b2}, \dots , Y_{bN}\) is constructed and the sample median within the bth bootstrap sample is formed as:

\(\hat{\eta}_b= \text{ median }(Y_{b1},Y_{b2}, ... , Y_{bN})\)

From the B estimates of the median we construct the estimated variance as:

\(S^2=\dfrac{1}{B-1}\sum_{b=1}^{B}(\hat{\eta}_b-\bar{\eta})^2 \text{ where}\bar{\eta}=\frac{1}{B}\sum_{b=1}^{B}\hat{\eta}_b\)

to get a sense about how these medians are varying over the 100 samples. The variance estimate can then be used to construct a Z statistic for hypothesis testing, i.e.,

\(Z=(\hat{\eta}-\eta)/S\)

Some statisticians at first were leery of this approach, essentially using one sample to create many others samples from this original, i.e., "pulling oneself up by your bootstraps". The bootstrap process, however, over time has shown to have sound statistical properties. The disadvantage of this approach has to do with the random selection with replacement which could result in slight variations in results. The FDA, for instance, requires definitive results. This is simply a non-parametric approach for estimating the variance of a sample.

Exploratory or Hypothesis-Generating Analyses

Clinical trial data provide the opportunity for exploratory analyses, which are analyses in addition to those specified by the primary objectives in the protocol. A trial design is usually not well-suited for all of the exploratory analyses that are performed, so the results may not have much validity.

The results from exploratory analyses should not be regarded as confirmatory, but rather as hypothesis-generating for future research. As a general rule, the same data should not be used both to generate a new hypothesis and to test that hypothesis. Unfortunately, many investigators do not follow this principle.

Data in sufficient quantity and detail can be made to yield some effect. A few statistical sayings attest to this, such as "the data will confess to anything if tortured enough." It has been well documented that increasing the number of hypothesis tests inflates the Type I error rate. Exploratory analyses typically fall into this category and the chances of finding statistically significant results, when none truly exists, can be very high.

Subset analyses are a form of exploratory analysis that are very popular with clinical trial data. For example, after performing the primary statistical analyses, the investigators might decide to compare treatment groups within certain subsets, such as male subjects, female subjects, minority subjects, subjects over the age of 50, subjects with serum cholesterol above 220, etc. Unless it is planned ahead of time, such analyses should remain exploratory.

11.9 - Summary

In this lesson, among other things, we learned:

State the objectives of a pharmacokinetic model.
Use a SAS program to calculate a confidence interval for an odds ratio
Use a SAS program to perform a Mantel-Haenszel.analysis to estimate an odds ratio adjusted for strata effects.
Recognize when odds-ratios or relative risks differ significantly between groups.
Modify a SAS program to perform JT and Cochran-Armitage tests for trend.
Interpret a Kaplan-Meier survival curve.
Interpret SAS output comparing survival curves.
Describe the process of bootstrapping to estimate the variability of an estimator.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility