Lesson 17: Medical Diagnostic Testing
Lesson 17: Medical Diagnostic TestingOverview
A diagnostic test is any approach used to gather clinical information for the purpose of making a clinical decision (i.e., diagnosis). Some examples of diagnostic tests include Xrays, biopsies, pregnancy tests, medical histories, and results from physical examinations.
From a statistical point of view there are two points to keep in mind:
 the clinical decisionmaking process is based on probability;
 the goal of a diagnostic test is to move the estimated probability of disease toward either end of the probability scale (i.e., 0 rules out disease, 1 confirms the disease).
Here is an example taken from Greenberg et al (2000, Medical Epidemiology, Third Edition ). A 54yearold woman visits her family physician for an annual checkup. The physician observes that:
 she had no illnesses during the preceding year and there is no family history of breast cancer,
 her physical exam is unremarkable, (nothing unusual is apparent),
 her breast exam is normal (no signs of a palpable mass), and
 her pelvic and rectal exams are unremarkable.
Based on the woman's age and medical history, the initial (prior) probability estimate of breast cancer is 0.003. The physician recommends that the woman have a mammogram, due to her age. Unfortunately, the results of the mammogram are abnormal. This yields a modification of the women's prior probability of breast cancer from 0.003 to 0.13 (notice the Bayesian flavor of this approach  prior probability modified via existing data). Next, the woman is referred to a surgeon who agrees that the physical breast exam is normal. The surgeon consults with a radiologist and they decide that the woman should undergo fine needle aspiration (FNA) of the abnormal breast detected by the mammogram. (diagnostic test #2) The FNA specimen reveals abnormal cells, which again revises the probability of breast cancer, from 0.13 to 0.64. Finally, the woman is scheduled for a breast biopsy the following week to get a definitive diagnosis.
Ideally, diagnostic tests always would be correct, noninvasive, and inflict no side effects. If this were the case, a positive test result would unequivocally indicate the presence of disease and a negative result would indicate the absence of disease. Realistically, however, every diagnostic test is fallible.
Objectives
 calculate and provide confidence intervals for the sensitivity and specificity of a diagnostic test,
 calculate accuracy and predictive values of a diagnostic test,
 state the relationship of prevalence of disease to the sensitivity, specificity and predictive values of a diagnostic test,
 test whether sensitivity or specificity of 2 tests are significantly different, whether the results come from a study in two groups of patients or one group of patients tested with both tests, and
 select an appropriate cutoff for a positive test result, given an ROC curve, for different cost ratios of false positive/false negative results.
17.1  Analysis of Diagnostic Tests
17.1  Analysis of Diagnostic TestsTo begin, let's consider a simple test which has only two possible outcomes, namely, positive and negative. When a test is applied to a group of patients, some with the disease and some without the disease, four groups can result, as summarized in the following 2 × 2 table:
Disease 
No Disease 

Test Positive 
a
true positives 
b
false positives 
Test Negative 
c
false negatives 
d
true negatives 
a (truepositives) = individuals with the disease, and for whom the test is positive
b (falsepositives) = individuals without the disease, but for whom the test is positive
c (falsenegatives) = individuals with the disease, but for whom the test is negative
d (truenegatives) = individuals without the disease, and for whom the test is negative
a + c = total number of individuals with the disease
b + d = total number of individuals without disease
The "Gold Standard" is the method used to obtain a definitive diagnosis for a particular disease; it may be biopsy, surgery, autopsy or an acknowledged standard. Gold Standards are used to define true disease status against which the results of a new diagnostic test are compared. Here are a number of definitive diagnostic tests that will confirm whether or not you have the disease. Some of these are quite invasive and this is a major reason why new diagnostic procedures are being developed.
Target Disorder  Gold Standard 
breast cancer  excisional biopsy 
prostate cancer  transrectal biopsy 
coronary stenosis  coronary angiography 
myocardial infarction  catheterization 
strep throat  throat culture 
17.2  Describing Diagnostic Tests
17.2  Describing Diagnostic TestsThe following concepts have been developed to describe the performance of a diagnostic test relative to the gold standard; these concepts are measures of the validity of a diagnostic test.
 Sensitivity
 is the probability that an individual with the disease of interest has a positive test. It is estimated from the sample as a/(a+c).
 Specificity
 is the probability that an individual without the disease of interest has a negative test. It is estimated from the sample as d/(b+d).
 Accuracy
 is the probability that the diagnostic test yields the correct determination. It is estimated from the sample as (a+d)/(a+b+c+d).
Tests with high sensitivity are useful clinically to rule out a disease. A negative result for a very sensitive test virtually would exclude the possibility that the individual has the disease of interest. If a test has high sensitivity, it also results in a low proportion of falsenegatives. Sensitivity also is referred to as "positive in disease" or "sensitive to disease".
Tests with high specificity are useful clinically to confirm the presence of a disease. A positive result for a very specific test would give strong evidence in favor of diagnosing the disease of interest. If a test has high specificity, it also results in a low proportion of falsepositives. Specificity also is referred to as "negative in health" or "specific to health".
Sensitivity and specificity are, in theory, stable for all groups of patients.
In a study comparing FNA to the gold standard (excisional biopsy), 114 women with normal physical examinations (nonpalpable masses) and abnormal mammograms received a FNA followed by surgical excisional biopsy of the same breast (Bibbo M, et al: Stereotaxic fine needle aspiration cytology of clinically occult malignant and premalignant breast lesions. Acta Cytol 1988; 32:193201.)
Cancer

No Cancer


FNA Positive 
14

8

FNA Negative 
1

91

Sensitivity = 14/15 = 0.93 or 93%
Specificity = 91/99 = 0.92 or 92%
Accuracy = 105/114 = 0.92 or 92%
SAS® Example
Using PROC FREQ in SAS for determining an exact confidence interval for sensitivity and specificity
Point estimates for sensitivity and specificity are based on proportions. Therefore, we can compute confidence intervals using binomial theory. See SAS Example (18.1_sensitivity_specifi.sas) below for a SAS program that calculates exact and asymptotic confidence intervals for sensitivity and specificity.
***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for *
* determining an exact confidence interval for sensitivity and *
* specificity. *
***********************************************************************;
proc format;
value yesnofmt 1='yes' 2='no';
run;
data sensitivity;
input positive count;
format positive yesnofmt.;
cards;
1 14
2 01
;
run;
proc freq data=sensitivity;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity";
run;
data specificity;
input negative count;
format negative yesnofmt.;
cards;
1 91
2 08
;
run;
proc freq data=specificity;
tables negative/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Specificity";
run;
For the FNA study, only 15 women with cancer, as diagnosed by the gold standard, were studied. The rule for using the asymptotic confidence interval fails for sensitivity because np(1  p) = 0.9765 < 5 (the rule does hold for specificity).
As the output shows below, the exact 95% confidence intervals for sensitivity and specificity are (0.680, 0.998) and (0.847, 0.965), respectively.
Exact and Asymptotic 95% Confidence Intervals for Sensitivity  

The FREQ Procedure  
Positive  Frequency  Percent  Cumulative Frequency  Cumulative Percent 
yes  14  93.33  14  93.33 
no  1  6.67  15  100.00 
Binomial Proportion 


Proportion  0.9333 
ASE  0.0644 
95% Lower Conf Limit  0.8071 
95% Upper Conf Limit  1.0000 
Exact Conf Limits  
95% Lower Conf Limit  0.6805 
95% Upper Conf Limit  0.9983 
Test of H0: Proportion = 0.5 


ASE Under H0  0.1291 
Z  3.3566 
OneSided Pr > Z  0.0004 
TwoSided Pr > Z  0.0008 
Sample Size = 15
Exact and Asymptotic 95% Confidence Intervals for Specificity  

The FREQ Procedure  
Positive  Frequency  Percent  Cumulative Frequency  Cumulative Percent 
yes  91  91.92  91  91.92 
no  8  8.08  99  100.00 
Binomial Proportion 


Proportion  0.9192 
ASE  0.0274 
95% Lower Conf Limit  0.8655 
95% Upper Conf Limit  0.9729 
Exact Conf Limits  
95% Lower Conf Limit  0.8470 
95% Upper Conf Limit  0.9645 
Test of H0: Proportion = 0.5 


ASE Under H0  0.0503 
Z  8.3418 
OneSided Pr > Z  <.0001 
TwoSided Pr > Z  <.0001 
Sample Size = 99
17.3  Estimating the Probability of Disease
17.3  Estimating the Probability of DiseaseSensitivity and specificity describe the accuracy of a test. In a clinical setting, we do not know who has the disease and who does not  that is why diagnostic tests are used. We would like to be able to estimate the probability of disease based on the outcome of one or more diagnostic tests. The following measures address this idea.
Prevalence is the probability of having the disease, also called the prior probability of having the disease. It is estimated from the sample as \(\dfrac{\left(a+c\right)}{\left(a+b+c+d\right)}\).
Positive Predictive Value (PV+) is the probability of disease in an individual with a positive test result. It is estimated as \(\dfrac{a}{\left(a+b\right)}\).
Negative Predictive Value (PV  ) is the probability of not having the disease when the test result is negative. It is estimated as as \(\dfrac{d}{\left(c+d\right)}\).
In the FNA study of 114 women with nonpalpable masses and abnormal mammograms,
\(prevalence = \dfrac{15}{114} = 0.13\)
\(PV+ = \dfrac{14}{\left(14+8\right)} = 0.64\)
\(PV  = \dfrac{91}{\left(1+91\right)} = 0.99\)
Thus, a woman's prior probability of having the disease is 0.13 and is modified to 0.64 if she has a positive test result. A women's prior probability of not having the disease is 0.87 and is modified to 0.99 if she has a negative test result.
If the disease under study is rare, the investigator may decide to invoke a casecontrol design for evaluating the diagnostic test, e.g., recruit 50 patients with the disease and 50 controls. Obviously, prevalence cannot be estimated from a casecontrol study because it does not represent a random sample from the general population.
Predictive values allow us to determine the usefulness of a test and they vary with the sensitivity and specificity of a test. If all other characteristics held constant, then:
 as sensitivity of a test increases, PV  increases and
 as specificity of a test increases, PV+ increases.
Predictive values vary with the prevalence of the disease in the population being tested or the pretest probability of disease in a given individual.
Sensitivity, specificity, and prevalence can be used in a clinical setting to estimate posttest probabilities (predictive values), even though physicians work with one patient at a time, not entire populations of patients. Three pieces of information are necessary prior to performing the test, namely, (1) either the prevalence of the disease or the prior probability of disease, (2) sensitivity, and (3) specificity.
Then, formulae for PV+ and PV are:
\(PV+ = \dfrac{\text{Prevalence}\times\text{Sensitivity}}{(\text{Prevalence}\times\text{Sensitivity})+\left\{(1\text{Prevalence})\times (1\text{Specificity}) \right\}}\)
\(PV = \dfrac{(1\text{Prevalence})\times\text{Specificity}}{\left\{(1\text{Prevalence})\times\text{Specificity})\right\}+\left\{\text{Prevalence}\times (1\text{Sensitivity}) \right\}}\)
Although PV+ = 14/(14+8) = 0.64 and PV  = 91/(1+91) = 0.99 can be calculated directly from the 2 × 2 data table because the women constituted a random sample, the above formulae yield the same results:
\(PV+ = \dfrac{(0.13)(0.93)}{{(0.13)(0.93) + (0.87)(0.08)}} = 0.64\)
\(PV = \dfrac{(0.87)(0.92)}{{(0.87)(0.92) + (0.13)(0.07)}} = 0.99\)
The following example is taken from Sackett et al (1985, Clinical Epidemiology ). Suppose a patient with the following characteristics visits a physician:
 45yearold man
 ambulatory with episodic chest pain
 no coronary risk factors except smoking one pack of cigarettes per day
 3week history of substernal and precordial pain  stabbing and fleeting
 physical exam shows a single costochondral junction that is slightly tender, but does not reproduce the patient's pain
From this information, the physician estimates an intermediate pretest (prior) probability of 60% that this patient has significant coronary artery narrowing.
The physician is not sure whether the patient should undergo an exercise electrocardiogram (ECG). How useful would this test be for this patient?
Suppose it is known from the literature that the sensitivity and specificity of the exercise ECG in coronary artery stenosis (as compared to the gold standard of coronary arteriography) are 60% and 91%, respectively.
Then:
\(PV+ = \dfrac{(0.6)(0.6)}{{(0.6)(0.6) + (0.4)(0.09)}} = 0.91\)
\(PV  = \dfrac{(0.4)(0.91)}{{(0.4)(0.91) + (0.6)(0.4)}} = 0.60\)
An additional test characteristic reported in the medical literature is the likelihood ratio, which is the probability of a particular test result (+ or  ) in patients with the disease divided by the probability of the result in patients without the disease. There exists one likelihood ratio for a positive test (LR+) and one for a negative test (LR  ). Likelihood ratios express how many times more (or less) likely the test result is found in diseased versus nondiseased individuals:
\(LR+ = \dfrac{\text{Sensitivity}}{\left(1  Specificity\right)}\)
\(LR  = \dfrac{\left(1  \text{Sensitivity}\right)}{\text{Specificity}}\)
From the FNA study in 114 women with nonpalpable masses and abnormal mammograms, LR+ = 0.933/0.081 = 11.52 and LR  = 0.067/0.919 = 0.07. Thus, positive FNA results are 11.52 times more likely in women with cancer as compared to those without, and negative FNA results are .07 times as likely in women with cancer as compared to those without.
17.4  Comparing Two Diagnostic Tests
17.4  Comparing Two Diagnostic TestsSuppose that we want to compare sensitivity and specificity for two diagnostic tests. Let \(p_1\) denote the test characteristic for diagnostic test #1 and let \(p_2\) = test characteristic for diagnostic test #2.
The appropriate statistical test depends on the setting. If diagnostic tests were studied on two independent groups of patients, then twosample tests for binomial proportions are appropriate (chisquare, Fisher's exact test). If both diagnostic tests were performed on each patient, then paired data result and methods that account for the correlated binary outcomes are necessary (McNemar's test).
Suppose two different diagnostic tests are performed in two independent samples of individuals using the same gold standard. The following 2 × 2 tables result:
Diagnostic Test #1  Disease  No Disease 
Positive  82  30 
Negative  18  70 
Diagnostic Test #2  Disease  No Disease 
Positive  140  10 
Negative  60  90 
Suppose that sensitivity is the statistic of interest. The estimates of sensitivity are \(p_1 = \dfrac{82}{100} = 0.82\) and \(p_2 = \dfrac{140}{200} = 0.70\) for diagnostic test #1 and diagnostic test #2, respectively. The following SAS program will provide confidence intervals for the sensitivity for each test as well as comparison of the tests with regard to sensitivity.
SAS® Example
Using PROC FREQ in SAS for comparing two diagnostic tests based on data from two samples
***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for *
* comparing two diagnostic tests based on data from two samples. *
***********************************************************************;
proc format;
value yesnofmt 1='yes' 2='no';
run;
data sensitivity_diag1;
input positive count;
format positive yesnofmt.;
cards;
1 82
2 18
;
run;
proc freq data=sensitivity_diag1;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity with Diagnostic Test #1";
run;
data sensitivity_diag2;
input positive count;
format positive yesnofmt.;
cards;
1 140
2 60
;
run;
proc freq data=sensitivity_diag2;
tables positive/binomial alpha=0.05;
weight count;
title "Exact and Asymptotic 95% Confidence Intervals for Sensitivity with Diagnostic Test #2";
run;
data comparison;
input test positive count;
format positive yesnofmt.;
cards;
1 1 82
1 2 18
2 1 140
2 2 60
;
run;
proc freq data=comparison;
tables positive*test/chisq;
exact chisq;
weight count;
title "Exact and Asymptotic Tests for Comparing Sensitivities";
run;
Run the program and look at the output. Do you see the exact 95% confidence intervals for the two diagnostic tests as (0.73, 0.89) and (0.63, 0.76), respectively?
The SAS program also indicates that the pvalue = 0.0262 from Fisher's exact test for testing \(H_0 \colon p_1 = p_2\) .
Thus, diagnostic test #1 has a significantly better sensitivity than diagnostic test #2.
SAS® Example
Using PROC FREQ in SAS for comparing two diagnostic tests based on data from one sample
Suppose both diagnostic tests (test #1 and test #2) are applied to a given set of individuals, some with the disease (by the gold standard) and some without the disease.
As an example, data can be summarized in a 2 × 2 table for the 100 diseased patients as follows:
Diagnostic Test #2  
Diagnostic Test #1  Positive  Negative 
Positive  30  35 
Negative  23  12 
The appropriate test statistic for this situation is McNemar's test. The patients with a (+, +) result and the patients with a (  ,  ) result do not distinguish between the two diagnostic tests. The only information for comparing the sensitivities of the two diagnostic tests comes form those patients with a (+,  ) or (  , +) result.
Testing that the sensitivities are equal, i.e., \(H_0 \colon p_1 = p_2\) , is comparable to testing that.
\(H_0 \colon p\) = (probability of preferring diagnostic test #1 over diagnostic test # 2) = ½ In the above example, N = 58 and 35 of the 58 display a (+,  ) result, so the estimated binomial probability is 35/58 = 0.60. The exact pvalue is 0.148 from McNemar's test (see SAS Example 18.3_comparing_diagnostic.sas below).
***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for *
* comparing two diagnostic tests based on data from one sample. *
***********************************************************************;
proc format;
value testfmt 1='positive' 2='negative';
run;
data comparison;
input test1 test2 count;
format test1 test2 testfmt.;
cards;
1 1 30
1 2 35
2 1 23
2 2 12
run;
proc freq data=comparison;
tables test1*test2/agree;
weight count;
exact McNem;
title "McNemar's Test for Comparing Sensitivities";
run;
Thus, the two diagnostic tests are not significantly different with respect to sensitivity.
17.5  Selecting a Positivity Criterion
17.5  Selecting a Positivity CriterionMethods for calculating sensitivity and specificity depend on test outcomes that are dichotomous. Many lab tests and other diagnostic tools, however, are measured on a numerical scale. In this case, sensitivity and specificity depend on where the cutoff point is made between positive and negative.
The positivity criterion is the cutoff value on a numerical scale that separates normal values from abnormal values. It determines which test results are considered positive (indicative of disease) and negative (diseasefree). Because the distributions of test values for diseased and diseasefree individuals are likely to overlap, there will be falsepositive and falsenegative results. When defining a positivity criterion, it is important to consider which mistake is worse.
Now suppose a greater value is selected for the cutoff point. The chosen cutoff value will yield a good sensitivity because nearly all of the diseased individuals will have a positive result. Unfortunately, many of the healthy individuals also will have a positive result (false positives), so this cutoff value will yield a poor specificity.
In the following example, a high value of the diagnostic test (positive result) is indicative of disease. The chosen cutoff value will yield a poor sensitivity because many of the diseased individuals will have a negative result (false negatives). On the other hand, nearly all of the healthy individuals will have a negative result, so the chosen cutoff value will yield a good specificity.
When the consequences for missing a case are potentially grave, choose a value for the positivity criterion that minimizes the number of falsenegatives. For example, in neonatal PKU screening, a falsenegative result may delay essential dietary intervention until mental retardation is evident. Falsepositive results, on the other hand, are usually identified during followup testing.
When falsepositive results may lead to a risky treatment, choose a value for the positivity criterion that minimizes the number of falsepositive results. For example, falsepositive results indicating certain types of cancer can lead to chemotherapy which can suppress the patient's immune system and leave the patient open to infection and other side effects.
An ROC curve (Receiver Operating Characteristic) is a graphical representation of the relationship between sensitivity and specificity for a diagnostic test measured on a numerical scale. The ROC curve consists of a plot of sensitivity (truepositives) versus 1  specificity (falsepositives) for several choices of the positivity criterion. PROC LOGISTIC of SAS provides a means for constructing ROC curves.
The figure below depicts an ROC curve (drawn with x’s). The point in the upper left corner of the figure, (0,1), represents a perfect test, in which sensitivity and specificity both are 1. When falsepositive and falsenegative results are equally problematic, there are two choices: 1. Set the positivity criterion to the point on the ROC curve closest to the upper left corner. (This will also be closest to the dashed line, as the cutoff in the figure indicates.) or 2. Set the positivity criterion to the point on the ROC curver farthest (vertical distance) from the line of chance (Youdon Index).
When falsepositive results are more undesirable, set the positivity criterion to the point farthest left on the ROC curve (increase specificity). If instead, falsenegative results are more undesirable, set the positivity criterion to a point farther right on the ROC curve (increase sensitivity).
SAS® Example
Using PROC LOGISTIC to develop the ROC curve
In the ACRN SOCS trial, the investigators wanted to determine if low values of the methacholine \(PC_{20}\) at baseline are predictive of significant asthma exacerbations. The methacholine \(PC_{20}\) is a measure of how reactive a person’s airways are to an irritant (methacholine) – a low value of the \(PC_{20}\) corresponds a to high level of airway reactivity.
Here is the SAS code that was used: ACRN_SOCS_trial.sas
Unfortunately, \(log_2 \left(\text{methacholine } PC_{20}\right)\) is not statistically significant in predicting the occurrence of significant asthma exacerbation \(\left(p = 0.27\right)\) and the ROC curve is very close to the line of identity.
17.6  Summary
17.6  SummaryIn this lesson, among other things, we learned how to:
 calculate and provide confidence intervals for the sensitivity and specificity of a diagnostic test,
 calculate accuracy and predictive values of a diagnostic test,
 state the relationship of prevalence of disease to the sensitivity, specificity and predictive values of a diagnostic test,
 test whether sensitivity or specificity of 2 tests are significantly different, whether the results come from a study in two groups of patients or one group of patients tested with both tests, and
 select an appropriate cutoff for a positive test result, given an ROC curve, for different cost ratios of false positive/false negative results.