Lesson 4: Bias and Random Error

Overview

Error is defined as the difference between the true value of a measurement and the recorded value of a measurement. There are many sources pf error in collecting clinical data. Error can be described as random or systematic.

Random error is also known as variability, random variation, or ‘noise in the system’. The heterogeneity in the human population leads to relatively large random variation in clinical trials.

Systematic error or bias refers to deviations that are not due to chance alone. The simplest example occurs with a measuring device that is improperly calibrated so that it consistently overestimates (or underestimates) the measurements by X units.

Random error has no preferred direction, so we expect that averaging over a large number of observations will yield a net effect of zero. The estimate may be imprecise, but not inaccurate. The impact of random error, imprecision, can be minimized with large sample sizes.

Bias, on the other hand, has a net direction and magnitude so that averaging over a large number of observations does not eliminate its effect. In fact, bias can be large enough to invalidate any conclusions. Increasing the sample size is not going to help. In human studies, bias can be subtle and difficult to detect. Even the suspicion of bias can render judgment that a study is invalid. Thus, the design of clinical trials focuses on removing known biases.

Random error corresponds to imprecision, and bias to inaccuracy. Here is a diagram that will attempt to differentiate between imprecision and inaccuracy.

See the difference between these two terms? OK, let's explore these further!

Objectives

Upon completion of this lesson, you should be able to:

Distinguish between random error and bias in collecting clinical data.
State how the significance level and power of a statistical test are related to random error.
Accurately interpret a confidence interval for a parameter.

4.1 - Random Error

Random error (variability, imprecision) can be overcome by increasing the sample size. This is illustrated in this section via hypothesis testing and confidence intervals, two accepted forms of statistical inference.

Review of Hypothesis testing

In hypothesis testing, a null hypothesis and an alternative hypothesis are formed. Typically, the null hypothesis reflects the lack of an effect and the alternative hypothesis reflects the presence of an effect (supporting the research hypothesis). The investigator needs to have sufficient evidence, based on data collected in a study, to reject the null hypothesis in favor of the alternative hypothesis.

Suppose an investigator is conducting a two-armed clinical trial in which subjects are randomized to group A or group B, and the outcome of interest is the change in serum cholesterol after 8 weeks. Because the outcome is measured on a continuous scale, the hypotheses are stated as:

\(H_0\colon \mu_A = \mu_B \) versus \(H_0: \mu_A \ne \mu_B\)

where \(\mu_{A} \text{ and } μ_{B}\) represent the population means for groups A and B, respectively.

The alternative hypothesis of \(H_1\colon \mu_{A} \ne \mu_{B}\) is labeled a “two-sided alternative” because it does not indicate whether A is better than B or vice versa. Rather, it just indicates that A and B are different. A “one-sided alternative” of \(H_1\colon \mu_{A}< \mu_{B}\) (or \(H_1\colon \mu_{A} > \mu_{B}\)) is possible, but it is more conservative to use the two-sided alternative.

The investigator conducts a study to test his hypothesis with 40 subjects in each of group A and group B \(\left(n_{A} = 40 \text{ and } n_{B} = 40\right)\). The investigator estimates the population means via the sample means (labeled \(\bar{x}_A\) and \(\bar{x}_B\), respectively). Suppose the average changes that we observed are \(\bar{x}_A = 7.3\) and \(\bar{x}_B = 4.8 \text { mg/dl}\). Do these data provide enough evidence to reject the null hypothesis that the average changes in the two populations means are equal? (The question cannot be answered yet. We do not know if this is a statistically significant difference!)

If the data approximately follow a normal distribution or are from large enough samples, then a two-sample t test is appropriate for comparing groups A and B where:

\(t = (\bar{x}_A - \bar{x}_B) / (\text{standard error of } \bar{x}_A - \bar{x}_B)\).

We can think of the two-sample t test as representing a signal-to-noise ratio and ask if the signal is large enough, relative to the noise detected? In the example, \(\bar{x}_A = 7.3\) and \(\bar{x}_B = 4.8 mg/dl\). If the standard error of \(\bar{x}_A - \bar{x}_B\) is 1.2 mg/dl, then:

\( t_{obs} = (7.3 - 4.8) / 1.2 = 2.1\)

But what does this value mean?

Each t value has associated probabilities. In this case, we want to know the probability of observing a t value as extreme or more extreme than the t value actually observed, if the null hypothesis is true. This is the p-value. At the completion of the study, a statistical test is performed and its corresponding p-value calculated. If the p-value \(< \alpha\), then \(H_0\) is rejected in favor of \(H_1\).

Two types of errors can be made in testing hypotheses: rejecting the null hypothesis when it is true or failing to reject the null hypothesis when it is false. The probability of making a Type I error, represented by \(\alpha\) (the significance level), is determined by the investigator prior to the onset of the study. Typically, \(\alpha\) is set at a low value, say 0.01 or 0.05.

Here is an interactive table that presents these options. Roll your cursor over the specific decisions to view results.

Decision	Reality
Decision	\(H_0\) is true	\(H_0\) is false
Reject \(H_0\), (conclude \(H_a\))	Type I error	Correct decision
Fail to reject \(H_0\)	Correct decision	Type II error

In our example, the p-value = [probability that \(|t| > 2.1] = 0.04\)

Thus, the null hypothesis of equal mean change for in the two populations is rejected at the 0.05 significance level. The treatments were different in the mean change in serum cholesterol at 8 weeks.

Note that \(\beta\) (the probability of not rejecting \(H_0\) when it is false) did not play a role in the test of hypothesis.

The importance of \(\beta\) came into play during the design phase when the investigator attempted to determine the appropriate sample size for the study. To do so, the investigator had to decide on the effect size of interest, i.e., a clinically meaningful difference between groups A and B in the average change in cholesterol at 8 weeks. The statistician cannot determine this but can help the researcher decide whether he has the resources to have a reasonable chance of observing the desired effect or should rethink his proposed study design.

The effect size is expressed as: \(\delta = \mu_{A} - \mu_{B}\).

The sample size should be determined such that there exists good statistical power \(\left(\beta = 0.1\text{ or }0.2\right)\) for detecting this effect size with a test of hypothesis that has significance level α.

A sample size formula that can be used for a two-sided, two-sample test with \(\alpha = 0.05\) and \(\beta = 0.1\) (90% statistical power) is:

\(n_A = n_A = 21\sigma^{2}/\delta^{2}\)

where σ = the population standard deviation (more detailed information will be discussed in a later lesson).

Note that the sample size increases as σ increases (noise increases).

Note that the sample size increases as \(\delta\) decreases (effect size decreases).

In the serum cholesterol example, the investigator had selected a meaningful difference, \(\delta = 3.0 \text{ mg/dl}\) and located a similar study in the literature that reported \(\sigma = 4.0 \text{ mg/dl}\). Then:

\(n_A = n_B = 21\sigma^{2}/\delta^{2} = (21 \times 16) / 9 = 37 \)

Thus, the investigator randomized 40 subjects to each of group A and group B to assure 90% power for detecting an effect size that would have clinical relevance..

Many studies suffer from low statistical power (large Type II error) because the investigators do not perform sample size calculations.

If a study has very large sample sizes, then it may yield a statistically significant result without any clinical meaning. Suppose in the serum cholesterol example that \(\bar{x}_A = 7.3\) and \(\bar{x}_A = 7.1 \text {mg/dl}\) , with \(n_{A} = n_{B} = 5,000\). The two-sample t test may yield a p-value = 0.001, but \(\bar{x}_A - \bar{x}_B = 7.3 - 7.1 = 0.2 \text { mg/dl}\) is not clinically interesting.

Confidence Intervals

A confidence interval provides a plausible range of values for a population measure. Instead of just reporting \(\bar{x}_A - \bar{x}_B\) as the sample estimate of \(\mu_{A} - \mu_{B}\), a range of values can be reported using a confidence interval..

The confidence interval is constructed in a manner such that it provides a high percentage of “confidence” (95% is commonly used) that the true value of \(\mu_{A} - \mu_{B}\) lies within it.

If the data approximately follow a bell-shaped normal distribution, then a 95% confidence interval for \(\mu_{A} - \mu_{B}\) is

\((\bar{x}_A - \bar{x}_B) \pm \left \{1.96 \times (\text{standard error of } \bar{x}_A - \bar{x}_B)\right \}\)

In the serum cholesterol example, \( (\bar{x}_A - \bar{x}_B) = 7.3 - 4.8 = 2.5 \text{mg/dl}\) and the standard error = \(1.2 \text{mg/dl}\). Thus, the approximate 95% confidence interval is:

\(2.5 \pm (1.96 \times 1.2) = \left [ 0.1, 4.9 \right ] \)

Note that the 95% confidence interval does not contain 0, which is consistent with the results of the 0.05-level hypothesis test (p-value = 0.04). 'No difference' is not a plausible value for the difference between the treatments.

Notice also that the length of the confidence interval depends on the standard error. The standard error decreases as the sample size increases, so the confidence interval gets narrower as the sample size increases (hence, greater precision).

A confidence interval is actually is more informative than testing a hypothesis. Not only does it indicate whether \(H_0\) can be rejected, but it also provides a plausible range of values for the population measure. Many of the major medical journals request the inclusion of confidence intervals within submitted reports and published articles.

4.2 - Clinical Biases

If a bias is small relative to the random error, then we do not expect it to be a large component of the total error. A strong bias can yield a point estimate that is very distant from the true value. Remember the 'bulls eye' graphic? Investigators seldom know the direction and magnitude of bias, so adjustments to the estimators are not possible.

There are many sources of bias in clinical studies:

Selection bias
Procedure selection bias
Post-entry exclusion bias
Bias due to selective loss of data
Assessment bias

1. Selection Bias

Selection bias refers to selecting a sample that is not representative of the population because of the method used to select the sample. Selection bias in the study cohort can diminish the external validity of the study findings. A study with external validity yields results that are useful in the general population. Suppose an investigator decides to recruit only hospital employees in a study to compare asthma medications. This sample might be convenient, but such a cohort is not likely to be representative of the general population. The hospital employees may be more health-conscious and conscientious in taking medications than others. Perhaps they are better at managing their environment to prevent attacks. The convenient sample easily produces bias. How would you estimate the magnitude of this bias? It is unlikely to find an undisputed estimate and the study will be criticized because of the potential bias.

If the trial is randomized with a control group, however, something may be salvaged. Randomized controls increase the internal validity of a study. Randomization can also provide external validity for treatment group differences. Selection bias should affect all randomized groups equally, so in taking differences between treatment groups, the bias is removed via subtraction. Randomization in the presence of selection bias cannot provide external validity for absolute treatment effects. The graph below illustrates these concepts).

The estimates of the response from the sample are clearly biased below the population values. However, the observed difference between treatment and control is of the same magnitude as that in the population. In other words, it could be the observed treatment difference accurately reflects the population difference, even though the observations within the control and treatment groups are biased.

2. Procedure Selection Bias

Procedure selection bias, a likely result when patients or investigators decide on treatment assignment, can lead to extremely large biases. The investigator may consciously or subconsciously assign particular treatments to specific types of patients. Randomization is the primary design feature that removes this bias.

3. Post-entry exclusion bias

Post-entry exclusion bias can occur when the exclusion criteria for subjects are modified after examination of some or all of the data. Some enrolled subjects may be recategorized as ineligible and removed from the study. In the past, this may have been done for the purposes of manufacturing statistically significant results but would be regarded as an unethical practice now.

4. Bias due to selective loss of data

Bias due to selective loss of data is related to post-entry exclusion bias. In this case, data from selected subjects are eliminated from the statistical analyses. Protocol violations (including adding on other medications, changing medications or withdrawal from therapy) and other situations may cause an investigator to request an analysis using only the data from those who adhered to the protocol or who completed the study on their assigned therapy.

The latter two types of biases can be extreme. Therefore, statisticians prefer that intention-to-treat analyses be performed as the main statistical analysis.

In an intention-to-treat analysis, all randomized subjects are included in the data analysis, regardless of protocol violations or lack of compliance. Though it may seem unreasonable to include data from a patient who simply refused to take the study medication or violated the protocol in a serious manner, the intention-to-treat analysis usually prevents more bias than it introduces. Once all the patients are randomized to therapy, use all of the data collected. Other analyses may supplement the intention-to-treat analysis, perhaps substantiating that protocol violations did not affect the overall inferences, but the analysis including all subjects randomized should be primary.

5. Assessment bias

As discussed earlier, clinical studies that rely on patient self-assessment or physician assessment of patient status are susceptible to assessment bias. In some circumstances, such as in measuring pain or symptoms, there are no alternatives, so attempts should be made to be as objective as possible and invoke randomization and blinding. What is a mild cough for one person might be characterized as a moderate cough by another patient. Not knowing whether or not they received the treatment (blinding) when making these subjective evaluations will help to minimize this self-assessment or assessment bias..

Well-designed and well-conducted clinical trials can eliminate or minimize biases.

Key design features that achieve this goal include:

Randomization (minimizes procedure selection bias)
Masking (minimizes assessment bias)
Concurrent controls (minimizes treatment-time confounding and/or adjusts for disease remission/progression, as the graph below illustrates. Both treatment and control had an increase in response, but the treatment group experienced a greater increase.)
Objective assessments (minimizes assessment bias)
Active follow-up and endpoint ascertainment (minimizes assessment bias)
No post hoc exclusions (minimizes post-entry exclusion bias)

4.3 - Statistical Biases

For a point estimator, statistical bias is defined as the difference between the parameter to be estimated and the mathematical expectation of the estimator.

Statistical bias can result from methods of analysis or estimation. For example, if the statistical analysis does not account for important prognostic factors (variables that are known to affect the outcome variable), then it is possible that the estimated treatment effects will be biased. Fortunately, many statistical biases can be corrected, whereas design flaws lead to biases that cannot be corrected.

The simplest example of statistical bias is in the estimation of the variance in the one-sample situation with \(Y_1, \dots , Y_n\) denoting independent and identically distributed random variables and \(\bar{Y}\) denoting their sample mean. Define:

\(s^2=\frac{1}{n-1}\sum_{i=1}^{n}\left ( Y_i -\bar{Y} \right )^2\)

and

\(v^2=\frac{1}{n}\sum_{i=1}^{n}\left ( Y_i -\bar{Y} \right )^2 \)

The statistic \(s^2\) is unbiased because its mathematical expectation is the population variance, \(\sigma^2\). The statistic \(v^2\) is biased because its mathematical expectation is \(\dfrac{\sigma^2 (n-1)}{n}\). The statistic \(v^2\) tends to underestimate the population variance.

Thus, bias of \(v^2\) is \(\dfrac{\sigma^2(n-1)}{n} -\sigma^2 = - \dfrac{\sigma^2}{n}\). Obviously, as the sample size, n, gets larger, the bias becomes negligible.

4.4 - Summary

In this lesson, among other things, we learned:

to distinguish between random error and bias in collecting clinical data.
to state how the significance level and power of a statistical test are related to random error.
to accurately interpret a confidence interval for a parameter.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility