# Lesson 17: Medical Diagnostic Testing

Lesson 17: Medical Diagnostic Testing

## Overview

A diagnostic test is any approach used to gather clinical information for the purpose of making a clinical decision (i.e., diagnosis). Some examples of diagnostic tests include X-rays, biopsies, pregnancy tests, medical histories, and results from physical examinations.

From a statistical point of view there are two points to keep in mind:

1. the clinical decision-making process is based on probability;
2. the goal of a diagnostic test is to move the estimated probability of disease toward either end of the probability scale (i.e., 0 rules out disease, 1 confirms the disease).

Here is an example taken from Greenberg et al (2000, Medical Epidemiology, Third Edition ). A 54-year-old woman visits her family physician for an annual check-up. The physician observes that:

• she had no illnesses during the preceding year and there is no family history of breast cancer,
• her physical exam is unremarkable, (nothing unusual is apparent),
• her breast exam is normal (no signs of a palpable mass), and
• her pelvic and rectal exams are unremarkable.

Based on the woman's age and medical history, the initial (prior) probability estimate of breast cancer is 0.003. The physician recommends that the woman have a mammogram, due to her age. Unfortunately, the results of the mammogram are abnormal. This yields a modification of the women's prior probability of breast cancer from 0.003 to 0.13 (notice the Bayesian flavor of this approach - prior probability modified via existing data). Next, the woman is referred to a surgeon who agrees that the physical breast exam is normal. The surgeon consults with a radiologist and they decide that the woman should undergo fine needle aspiration (FNA) of the abnormal breast detected by the mammogram. (diagnostic test #2) The FNA specimen reveals abnormal cells, which again revises the probability of breast cancer, from 0.13 to 0.64. Finally, the woman is scheduled for a breast biopsy the following week to get a definitive diagnosis.

Ideally, diagnostic tests always would be correct, non-invasive, and inflict no side effects. If this were the case, a positive test result would unequivocally indicate the presence of disease and a negative result would indicate the absence of disease. Realistically, however, every diagnostic test is fallible.

## Objectives

Upon completion of this lesson, you should be able to:

• calculate and provide confidence intervals for the sensitivity and specificity of a diagnostic test,
• calculate accuracy and predictive values of a diagnostic test,
• state the relationship of prevalence of disease to the sensitivity, specificity and predictive values of a diagnostic test,
• test whether sensitivity or specificity of 2 tests are significantly different, whether the results come from a study in two groups of patients or one group of patients tested with both tests, and
• select an appropriate cut-off for a positive test result, given an ROC curve, for different cost ratios of false positive/false negative results.

# 17.1 - Analysis of Diagnostic Tests

17.1 - Analysis of Diagnostic Tests

To begin, let's consider a simple test which has only two possible outcomes, namely, positive and negative. When a test is applied to a group of patients, some with the disease and some without the disease, four groups can result, as summarized in the following 2 × 2 table:

 Disease No Disease Test Positive a true positives b false positives Test Negative c false negatives d true negatives

a (true-positives) = individuals with the disease, and for whom the test is positive

b (false-positives) = individuals without the disease, but for whom the test is positive

c (false-negatives) = individuals with the disease, but for whom the test is negative

d (true-negatives) = individuals without the disease, and for whom the test is negative

a + c = total number of individuals with the disease

b + d = total number of individuals without disease

The "Gold Standard" is the method used to obtain a definitive diagnosis for a particular disease; it may be biopsy, surgery, autopsy or an acknowledged standard. Gold Standards are used to define true disease status against which the results of a new diagnostic test are compared. Here are a number of definitive diagnostic tests that will confirm whether or not you have the disease. Some of these are quite invasive and this is a major reason why new diagnostic procedures are being developed.

 Target Disorder Gold Standard breast cancer excisional biopsy prostate cancer transrectal biopsy coronary stenosis coronary angiography myocardial infarction catheterization strep throat throat culture

# 17.2 - Describing Diagnostic Tests

17.2 - Describing Diagnostic Tests

The following concepts have been developed to describe the performance of a diagnostic test relative to the gold standard; these concepts are measures of the validity of a diagnostic test.

Sensitivity
is the probability that an individual with the disease of interest has a positive test. It is estimated from the sample as a/(a+c).
Specificity
is the probability that an individual without the disease of interest has a negative test. It is estimated from the sample as d/(b+d).
Accuracy
is the probability that the diagnostic test yields the correct determination. It is estimated from the sample as (a+d)/(a+b+c+d).

Tests with high sensitivity are useful clinically to rule out a disease. A negative result for a very sensitive test virtually would exclude the possibility that the individual has the disease of interest. If a test has high sensitivity, it also results in a low proportion of false-negatives. Sensitivity also is referred to as "positive in disease" or "sensitive to disease".

Tests with high specificity are useful clinically to confirm the presence of a disease. A positive result for a very specific test would give strong evidence in favor of diagnosing the disease of interest. If a test has high specificity, it also results in a low proportion of false-positives. Specificity also is referred to as "negative in health" or "specific to health".

Sensitivity and specificity are, in theory, stable for all groups of patients.

In a study comparing FNA to the gold standard (excisional biopsy), 114 women with normal physical examinations (nonpalpable masses) and abnormal mammograms received a FNA followed by surgical excisional biopsy of the same breast (Bibbo M, et al: Stereotaxic fine needle aspiration cytology of clinically occult malignant and premalignant breast lesions. Acta Cytol 1988; 32:193-201.)

 Cancer No Cancer FNA Positive 14 8 FNA Negative 1 91

Sensitivity = 14/15 = 0.93 or 93%
Specificity = 91/99 = 0.92 or 92%
Accuracy = 105/114 = 0.92 or 92%

## SAS® Example

### Using PROC FREQ in SAS for determining an exact confidence interval for sensitivity and specificity

Point estimates for sensitivity and specificity are based on proportions. Therefore, we can compute confidence intervals using binomial theory. See SAS Example (18.1_sensitivity_specifi.sas) below for a SAS program that calculates exact and asymptotic confidence intervals for sensitivity and specificity.

***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for  *
* determining an exact confidence interval for sensitivity and        *
* specificity.                                                        *
***********************************************************************;

proc format;
value yesnofmt 1='yes' 2='no';
run;

data sensitivity;
input positive count;
format positive yesnofmt.;
cards;
1 14
2 01
;
run;

proc freq data=sensitivity;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity";
run;

data specificity;
input negative count;
format negative yesnofmt.;
cards;
1 91
2 08
;
run;

proc freq data=specificity;
tables negative/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Specificity";
run;


For the FNA study, only 15 women with cancer, as diagnosed by the gold standard, were studied. The rule for using the asymptotic confidence interval fails for sensitivity because np(1 - p) = 0.9765 < 5 (the rule does hold for specificity).

As the output shows below, the exact 95% confidence intervals for sensitivity and specificity are (0.680, 0.998) and (0.847, 0.965), respectively.

Exact and Asymptotic 95% Confidence Intervals for Sensitivity
Exact and Asymptotic 95% Confidence Intervals for Sensitivity
The FREQ Procedure
Positive Frequency Percent Cumulative Frequency Cumulative Percent
yes 14 93.33 14 93.33
no 1 6.67 15 100.00

Binomial Proportion
for positive = yes

Proportion 0.9333
ASE 0.0644
95% Lower Conf Limit 0.8071
95% Upper Conf Limit 1.0000
Exact Conf Limits
95% Lower Conf Limit 0.6805
95% Upper Conf Limit 0.9983

Test of H0: Proportion = 0.5

ASE Under H0 0.1291
Z 3.3566
One-Sided Pr > Z 0.0004
Two-Sided Pr > |Z| 0.0008

Sample Size = 15

Exact and Asymptotic 95% Confidence Intervals for Specificity
Exact and Asymptotic 95% Confidence Intervals for Specificity
The FREQ Procedure
Positive Frequency Percent Cumulative Frequency Cumulative Percent
yes 91 91.92 91 91.92
no 8 8.08 99 100.00

Binomial Proportion
for positive = yes

Proportion 0.9192
ASE 0.0274
95% Lower Conf Limit 0.8655
95% Upper Conf Limit 0.9729
Exact Conf Limits
95% Lower Conf Limit 0.8470
95% Upper Conf Limit 0.9645

Test of H0: Proportion = 0.5

ASE Under H0 0.0503
Z 8.3418
One-Sided Pr > Z <.0001
Two-Sided Pr > |Z| <.0001

Sample Size = 99

# 17.3 - Estimating the Probability of Disease

17.3 - Estimating the Probability of Disease

Sensitivity and specificity describe the accuracy of a test. In a clinical setting, we do not know who has the disease and who does not - that is why diagnostic tests are used. We would like to be able to estimate the probability of disease based on the outcome of one or more diagnostic tests. The following measures address this idea.

Prevalence is the probability of having the disease, also called the prior probability of having the disease. It is estimated from the sample as $$\dfrac{\left(a+c\right)}{\left(a+b+c+d\right)}$$.

Positive Predictive Value (PV+) is the probability of disease in an individual with a positive test result. It is estimated as $$\dfrac{a}{\left(a+b\right)}$$.

Negative Predictive Value (PV - ) is the probability of not having the disease when the test result is negative. It is estimated as as $$\dfrac{d}{\left(c+d\right)}$$.

In the FNA study of 114 women with nonpalpable masses and abnormal mammograms,

$$prevalence = \dfrac{15}{114} = 0.13$$

$$PV+ = \dfrac{14}{\left(14+8\right)} = 0.64$$

$$PV - = \dfrac{91}{\left(1+91\right)} = 0.99$$

Thus, a woman's prior probability of having the disease is 0.13 and is modified to 0.64 if she has a positive test result. A women's prior probability of not having the disease is 0.87 and is modified to 0.99 if she has a negative test result.

If the disease under study is rare, the investigator may decide to invoke a case-control design for evaluating the diagnostic test, e.g., recruit 50 patients with the disease and 50 controls. Obviously, prevalence cannot be estimated from a case-control study because it does not represent a random sample from the general population.

Predictive values allow us to determine the usefulness of a test and they vary with the sensitivity and specificity of a test. If all other characteristics held constant, then:

1. as sensitivity of a test increases, PV - increases and
2. as specificity of a test increases, PV+ increases.

Predictive values vary with the prevalence of the disease in the population being tested or the pre-test probability of disease in a given individual.

Sensitivity, specificity, and prevalence can be used in a clinical setting to estimate post-test probabilities (predictive values), even though physicians work with one patient at a time, not entire populations of patients. Three pieces of information are necessary prior to performing the test, namely, (1) either the prevalence of the disease or the prior probability of disease, (2) sensitivity, and (3) specificity.

Then, formulae for PV+ and PV- are:

$$PV+ = \dfrac{\text{Prevalence}\times\text{Sensitivity}}{(\text{Prevalence}\times\text{Sensitivity})+\left\{(1-\text{Prevalence})\times (1-\text{Specificity}) \right\}}$$

$$PV- = \dfrac{(1-\text{Prevalence})\times\text{Specificity}}{\left\{(1-\text{Prevalence})\times\text{Specificity})\right\}+\left\{\text{Prevalence}\times (1-\text{Sensitivity}) \right\}}$$

Although PV+ = 14/(14+8) = 0.64 and PV - = 91/(1+91) = 0.99 can be calculated directly from the 2 × 2 data table because the women constituted a random sample, the above formulae yield the same results:

$$PV+ = \dfrac{(0.13)(0.93)}{{(0.13)(0.93) + (0.87)(0.08)}} = 0.64$$

$$PV- = \dfrac{(0.87)(0.92)}{{(0.87)(0.92) + (0.13)(0.07)}} = 0.99$$

The following example is taken from Sackett et al (1985, Clinical Epidemiology ). Suppose a patient with the following characteristics visits a physician:

• 45-year-old man
• ambulatory with episodic chest pain
• no coronary risk factors except smoking one pack of cigarettes per day
• 3-week history of substernal and precordial pain - stabbing and fleeting
• physical exam shows a single costochondral junction that is slightly tender, but does not reproduce the patient's pain

From this information, the physician estimates an intermediate pre-test (prior) probability of 60% that this patient has significant coronary artery narrowing.

The physician is not sure whether the patient should undergo an exercise electrocardiogram (ECG). How useful would this test be for this patient?

Suppose it is known from the literature that the sensitivity and specificity of the exercise ECG in coronary artery stenosis (as compared to the gold standard of coronary arteriography) are 60% and 91%, respectively.

Then:

$$PV+ = \dfrac{(0.6)(0.6)}{{(0.6)(0.6) + (0.4)(0.09)}} = 0.91$$

$$PV - = \dfrac{(0.4)(0.91)}{{(0.4)(0.91) + (0.6)(0.4)}} = 0.60$$

An additional test characteristic reported in the medical literature is the likelihood ratio, which is the probability of a particular test result (+ or - ) in patients with the disease divided by the probability of the result in patients without the disease. There exists one likelihood ratio for a positive test (LR+) and one for a negative test (LR - ). Likelihood ratios express how many times more (or less) likely the test result is found in diseased versus non-diseased individuals:

$$LR+ = \dfrac{\text{Sensitivity}}{\left(1 - Specificity\right)}$$

$$LR - = \dfrac{\left(1 - \text{Sensitivity}\right)}{\text{Specificity}}$$

From the FNA study in 114 women with nonpalpable masses and abnormal mammograms, LR+ = 0.933/0.081 = 11.52 and LR - = 0.067/0.919 = 0.07. Thus, positive FNA results are 11.52 times more likely in women with cancer as compared to those without, and negative FNA results are .07 times as likely in women with cancer as compared to those without.

# 17.4 - Comparing Two Diagnostic Tests

17.4 - Comparing Two Diagnostic Tests

Suppose that we want to compare sensitivity and specificity for two diagnostic tests. Let $$p_1$$ denote the test characteristic for diagnostic test #1 and let $$p_2$$ = test characteristic for diagnostic test #2.

The appropriate statistical test depends on the setting. If diagnostic tests were studied on two independent groups of patients, then two-sample tests for binomial proportions are appropriate (chi-square, Fisher's exact test). If both diagnostic tests were performed on each patient, then paired data result and methods that account for the correlated binary outcomes are necessary (McNemar's test).

Suppose two different diagnostic tests are performed in two independent samples of individuals using the same gold standard. The following 2 × 2 tables result:

 Diagnostic Test #1 Disease No Disease Positive 82 30 Negative 18 70

 Diagnostic Test #2 Disease No Disease Positive 140 10 Negative 60 90

Suppose that sensitivity is the statistic of interest. The estimates of sensitivity are $$p_1 = \dfrac{82}{100} = 0.82$$ and $$p_2 = \dfrac{140}{200} = 0.70$$ for diagnostic test #1 and diagnostic test #2, respectively. The following SAS program will provide confidence intervals for the sensitivity for each test as well as comparison of the tests with regard to sensitivity.

## SAS® Example

### Using PROC FREQ in SAS for comparing two diagnostic tests based on data from two samples

18.2_comparing_diagnostic.sas

***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for  *
* comparing two diagnostic tests based on data from two samples.      *
***********************************************************************;

proc format;
value yesnofmt 1='yes' 2='no';
run;

data sensitivity_diag1;
input positive count;
format positive yesnofmt.;
cards;
1 82
2 18
;
run;

proc freq data=sensitivity_diag1;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity with Diagnostic Test #1";
run;

data sensitivity_diag2;
input positive count;
format positive yesnofmt.;
cards;
1 140
2  60
;
run;

proc freq data=sensitivity_diag2;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity with Diagnostic Test #2";
run;

data comparison;
input test positive count;
format positive yesnofmt.;
cards;
1 1  82
1 2  18
2 1 140
2 2  60
;
run;

proc freq data=comparison;
tables positive*test/chisq;
exact chisq;
weight count;
title "Exact and Asymptotic Tests for Comparing Sensitivities";
run;


Run the program and look at the output. Do you see the exact 95% confidence intervals for the two diagnostic tests as (0.73, 0.89) and (0.63, 0.76), respectively?

The SAS program also indicates that the p-value = 0.0262 from Fisher's exact test for testing $$H_0 \colon p_1 = p_2$$ .

Thus, diagnostic test #1 has a significantly better sensitivity than diagnostic test #2.

## SAS® Example

### Using PROC FREQ in SAS for comparing two diagnostic tests based on data from one sample

Suppose both diagnostic tests (test #1 and test #2) are applied to a given set of individuals, some with the disease (by the gold standard) and some without the disease.

As an example, data can be summarized in a 2 × 2 table for the 100 diseased patients as follows:

 Diagnostic Test #2 Diagnostic Test #1 Positive Negative Positive 30 35 Negative 23 12

The appropriate test statistic for this situation is McNemar's test. The patients with a (+, +) result and the patients with a ( - , - ) result do not distinguish between the two diagnostic tests. The only information for comparing the sensitivities of the two diagnostic tests comes form those patients with a (+, - ) or ( - , +) result.

Testing that the sensitivities are equal, i.e., $$H_0 \colon p_1 = p_2$$ , is comparable to testing that.

$$H_0 \colon p$$ = (probability of preferring diagnostic test #1 over diagnostic test # 2) = ½ In the above example, N = 58 and 35 of the 58 display a (+, - ) result, so the estimated binomial probability is 35/58 = 0.60. The exact p-value is 0.148 from McNemar's test (see SAS Example 18.3_comparing_diagnostic.sas below).

***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for  *
* comparing two diagnostic tests based on data from one sample.       *
***********************************************************************;

proc format;
value testfmt 1='positive' 2='negative';
run;

data comparison;
input test1 test2 count;
format test1 test2 testfmt.;
cards;
1 1 30
1 2 35
2 1 23
2 2 12
run;

proc freq data=comparison;
tables test1*test2/agree;
weight count;
exact McNem;
title "McNemar's Test for Comparing Sensitivities";
run;


Thus, the two diagnostic tests are not significantly different with respect to sensitivity.

# 17.5 - Selecting a Positivity Criterion

17.5 - Selecting a Positivity Criterion

Methods for calculating sensitivity and specificity depend on test outcomes that are dichotomous. Many lab tests and other diagnostic tools, however, are measured on a numerical scale. In this case, sensitivity and specificity depend on where the cutoff point is made between positive and negative.

The positivity criterion is the cutoff value on a numerical scale that separates normal values from abnormal values. It determines which test results are considered positive (indicative of disease) and negative (disease-free). Because the distributions of test values for diseased and disease-free individuals are likely to overlap, there will be false-positive and false-negative results. When defining a positivity criterion, it is important to consider which mistake is worse.

Now suppose a greater value is selected for the cutoff point. The chosen cutoff value will yield a good sensitivity because nearly all of the diseased individuals will have a positive result. Unfortunately, many of the healthy individuals also will have a positive result (false positives), so this cutoff value will yield a poor specificity.

In the following example, a high value of the diagnostic test (positive result) is indicative of disease. The chosen cutoff value will yield a poor sensitivity because many of the diseased individuals will have a negative result (false negatives). On the other hand, nearly all of the healthy individuals will have a negative result, so the chosen cutoff value will yield a good specificity.

When the consequences for missing a case are potentially grave, choose a value for the positivity criterion that minimizes the number of false-negatives. For example, in neonatal PKU screening, a false-negative result may delay essential dietary intervention until mental retardation is evident. False-positive results, on the other hand, are usually identified during follow-up testing.

When false-positive results may lead to a risky treatment, choose a value for the positivity criterion that minimizes the number of false-positive results. For example, false-positive results indicating certain types of cancer can lead to chemotherapy which can suppress the patient's immune system and leave the patient open to infection and other side effects.

An ROC curve (Receiver Operating Characteristic) is a graphical representation of the relationship between sensitivity and specificity for a diagnostic test measured on a numerical scale. The ROC curve consists of a plot of sensitivity (true-positives) versus 1 - specificity (false-positives) for several choices of the positivity criterion. PROC LOGISTIC of SAS provides a means for constructing ROC curves.

The figure below depicts an ROC curve (drawn with x’s). The point in the upper left corner of the figure, (0,1), represents a perfect test, in which sensitivity and specificity both are 1. When false-positive and false-negative results are equally problematic, there are two choices: 1. Set the positivity criterion to the point on the ROC curve closest to the upper left corner. (This will also be closest to the dashed line, as the cutoff in the figure indicates.) or 2. Set the positivity criterion to the point on the ROC curver farthest (vertical distance) from the line of chance (Youdon Index).

When false-positive results are more undesirable, set the positivity criterion to the point farthest left on the ROC curve (increase specificity). If instead, false-negative results are more undesirable, set the positivity criterion to a point farther right on the ROC curve (increase sensitivity).

## SAS® Example

### Using PROC LOGISTIC to develop the ROC curve

In the ACRN SOCS trial, the investigators wanted to determine if low values of the methacholine $$PC_{20}$$ at baseline are predictive of significant asthma exacerbations. The methacholine $$PC_{20}$$ is a measure of how reactive a person’s airways are to an irritant (methacholine) – a low value of the $$PC_{20}$$ corresponds a to high level of airway reactivity.

Here is the SAS code that was used: ACRN_SOCS_trial.sas

Unfortunately, $$log_2 \left(\text{methacholine } PC_{20}\right)$$ is not statistically significant in predicting the occurrence of significant asthma exacerbation $$\left(p = 0.27\right)$$ and the ROC curve is very close to the line of identity.

# 17.6 - Summary

17.6 - Summary

In this lesson, among other things, we learned how to:

• calculate and provide confidence intervals for the sensitivity and specificity of a diagnostic test,
• calculate accuracy and predictive values of a diagnostic test,
• state the relationship of prevalence of disease to the sensitivity, specificity and predictive values of a diagnostic test,
• test whether sensitivity or specificity of 2 tests are significantly different, whether the results come from a study in two groups of patients or one group of patients tested with both tests, and
• select an appropriate cut-off for a positive test result, given an ROC curve, for different cost ratios of false positive/false negative results.

  Link ↥ Has Tooltip/Popover Toggleable Visibility