12.6 - Table Statistics
12.6 - Table StatisticsThere are a few possible statistical table options available in the FREQ procedure. Unfortunately, we don't have the luxury of digressing and discussing in this course when it is appropriate to use each of the statistics. We merely make ourselves aware of their existence in SAS should we need them in analyzing our data in the future. The options include:
- CHISQ, which requests a chi-square test of homogeneity or independence. Reported measures include the phi coefficient, the contingency coefficient, and Cramer's V. For 2×2 tables, Fisher's exact test is also included.
- MEASURES, which requests a basic set of measures of association and their standard errors. Measures include Pearson and Spearman correlation coefficients, gamma, Kendall's tau-b, Stuart's tau-c, Somer's D, lambda, and uncertainty coefficients. For 2×2 tables, odds ratios and risk ratios (with their confidence intervals) are reported.
- CMH, which requests Cochran-Mantel-Haenszel statistics, which test for association between a row and column variable after adjusting for all of the other variables in the tables statement. For 2×2 tables, Breslow's test for homogeneity of odds ratios is reported as well.
- ALL, which requests all of the tests and measures provided by the CHISQ, MEASURE, and CMH options.
- EXACT, which requests Fisher's Exact Test for tables larger than 2×2 tables. Note that this can take painfully long to compute for large tables.
Of course, you can always refer to the SAS Help and Documentation for the technical details of the statistical tests and measures.
Example 12.14
For the remaining examples in this lesson, we need an analysis data set with which to work. The analysis data set (click to save!) that we'll use contains (just!) four variables pulled from the ICDB background data set (with which we've already been working!), the ICDB cystoscopy data set, and the ICDB symptoms data set. The following program merely displays the variable names of, and the first 15 observations in, the analysis data set:
OPTIONS NOFMTERR;
PROC CONTENTS data = icdb.analysis position;
title 'The Analysis data set';
RUN;
PROC PRINT data = icdb.analysis (OBS = 15);
title 'The Analysis data set';
RUN;
Launch and run the SAS program. From the CONTENTS procedure output, you should see that the analysis data set contains 638 observations and the following four variables:
# | Variable | Type | Len | Format | Informat | Label |
---|---|---|---|---|---|---|
1 | subj | Num | 8 | 11. | 11. | patient ID number |
2 | CYST_HB | Num | 8 | |||
3 | SYM_1 | Num | 8 | SYMPTSEV. | 4. | severity of urinary symptoms |
4 | ctr | Num | 8 |
Besides the variable names, this list tells us that the sym_1 variable has a permanent format called symptsev. associated with it. You might recall that if we don't have access to the permanent formats catalog — which we don't — then we'll run into trouble when trying to use the sym_1 variable. That's why we specified the NOFMTERR option before trying to use the analysis data set. The NOFMTERR option allows us to use the analysis data set. We just can't take advantage of its variables' permanent formats.
Perhaps you'd appreciate an explanation of what the four variables are:
- subj: subject ID number
- ctr: clinical center number (1, 2, 3, 4, 5)
- cyst_hb: indicates whether or not a hydrodistention and/or biopsy was performed (0 = a local cystoscopy with no hydrodistention or biopsy, 1 = hydrodistention and biopsy, and 2 = hydrodistention only)
- sym_1: which indicates the subject's general assessment of the severity of his/her urinary symptoms (1 = not severe at all to 5 = extremely severe)
It might also help to review the first 15 observations just to get a feel for what the data looks like:
Obs | subj | CYST_HB | SYM_1 | ctr |
---|---|---|---|---|
1 | 110027 | 1 | 4 | 1 |
2 | 110029 | . | 4 | 1 |
3 | 110039 | . | 3 | 1 |
4 | 110040 | . | 3 | 1 |
5 | 110045 | . | 3 | 1 |
6 | 11049 | . | 3 | 1 |
7 | 110051 | . | 4 | 1 |
8 | 110052 | 1 | 4 | 1 |
9 | 110053 | . | 3 | 1 |
10 | 110055 | . | 3 | 1 |
11 | 110057 | . | . | 1 |
12 | 110058 | 1 | 4 | 1 |
13 | 110059 | . | 4 | 1 |
14 | 110060 | . | 3 | 1 |
15 | 110062 | 1 | 4 | 1 |
Now, as promised, let's use the analysis data set to illustrate some of the statistical tests and measures of association you can request from SAS while using the FREQ procedure.
Example 12.15
Some clinical centers may be more likely to perform a cystoscopy (which is a fairly invasive procedure) on their patients than other clinical centers. One could imagine this happening for a variety of reasons. For instance, one clinical center might have more severe patients thereby justifying more invasive procedures. Or perhaps the patients attending a particular clinical center might be better off financially and therefore more willing to pay for additional procedures. In any case, suppose we are interested in testing whether or not there is an association between performing a cystoscopy (cyst_hb) and clinical center (ctr). A chi-square test between the two variables would help us answer our research question. The following FREQ procedure illustrates the use of the CHISQ tables option in order to obtain the value of the chi-square test statistic and its associated P-value (as well as a few other useful statistics and P-values):
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: CHISQ';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing chisq;
RUN;
ctr | CYST_HB | ||||
---|---|---|---|---|---|
Frequency | |||||
Row Pct | Nothing | Local | Both | Hydro | Total |
1 | 96 | 0 | 10 | 0 | 106 |
90.57 | 0.00 | 9.43 | 0.00 | ||
2 | 77 | 0 | 14 | 0 | 52 |
48.73 | 5.70 | 44.30 | 1.27 | ||
3 | 38 | 0 | 14 | 0 | 52 |
73.08 | 0.00 | 26.92 | 0.00 | ||
4 | 101 | 16 | 65 | 5 | 187 |
54.01 | 8.56 | 34.76 | 2.67 | ||
5 | 86 | 1 | 41 | 7 | 135 |
63.70 | 0.74 | 30.37 | 5.19 | ||
Total | 398 | 26 | 200 | 14 | 638 |
Statistics | DF | Value | Prob |
---|---|---|---|
Chi-Square | 12 | 77.2068 | <.001 |
Likelihood Ratio Chi-Square | 12 | 89.2250 | <.001 |
Mantel-Haenszel Chi-Square | 1 | . | . |
Phi Coefficient | 0.3479 | ||
contingency Coefficient | 0.3286 | ||
Cramer's V | 0.2008 |
WARNING: 35% of the cell's have expected counts less then 5. Chi-square may not be a valid test
Sample Size 638
Of course, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.
The TABLES statement first requests a two-way table between the (row) variable ctr and the (column) variable cyst_hb. Because we are interested in including the missing values ("Nothing") in the analysis, we include the MISSING tables option. When doing this kind of analysis, I like to use the NOPERCENT and NOCOL tables options to help declutter the two-way frequency table. Finally, the CHISQ tables option tells SAS to calculate the chi-square statistics and its P-value (as well as a few other statistics).
Launch and run the SAS program, and review the resulting output. You should first see a two-way table (with five rows for ctr and four columns for cyst_hb). Then, you should see a list of six different statistics, of which the first one is the chi-square test statistic. Here, the value of the statistic is 77.2 (rounded) with 12 degrees of freedom (DF) and a P-value that is less than 0.0001. (Assuming that there is enough non-missing data and therefore that the chi-square test is valid), the P-value tells us that it is highly unlikely that we would observe the data we did under the assumption that ctr and cyst_hb are not associated. Therefore, we conclude that ctr and cyst_hb are associated.
Example 12.16
If we are interested in quantifying the association between clinical center and performing a cystoscopy, then we would want to take advantage of the MEASURES tables option. The following FREQ procedure illustrates the use of the MEASURES tables option to obtain a basic set of measures of association and their standard errors:
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing measures;
RUN;
Again, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.
Launch and run the SAS program, and review the resulting output. You should again first see a two-way table (with five rows for ctr and four columns for cyst_hb):
ctr | CYST_HB | ||||
---|---|---|---|---|---|
Frequency | |||||
Row Pct | Nothing | Local | Both | Hydro | Total |
1 | 4 | 7 | 11 | ||
7.14 | 1.20 | ||||
2 | 7 | 22 | 29 | ||
12.50 | 3.78 | ||||
3 | 12 | 220 | 232 | ||
21.43 | 37.80 | ||||
4 | 20 | 229 | 249 | ||
35.71 | 39.35 | ||||
5 | 13 | 104 | 117 | ||
23.21 | 17.87 | ||||
Total | 8.78 | 91.22 | 100.00 |
Then, you should see a list of thirteen different statistics lumped into five groups:
Statistic | Value | ASE |
---|---|---|
Gama | 0.1697 | 0.0534 |
Kendall's Tau-c | 0.1071 | 0.0339 |
Stuart's Tau-c | 0.0897 | 0.0286 |
Somers' D C|R | 0.0870 | 0.0276 |
Somers' D R|C | 0.1318 | 0.0417 |
Pearson Correlation | . | . |
Spearman Correlation | 0.1206 | 0.0381 |
Lambda Asymmetric C|R | 0.0000 | 0.0000 |
Lambda Asymmetric R|C | 0.0155 | 0.0267 |
Lambda Symmetric | 0.0101 | 0.0175 |
Uncertainty Coefficient C|R | 0.0802 | 0.0133 |
Uncertainty Coefficient R|C | 0.0455 | 0.0079 |
Uncertainty Coefficient Symmetric | 0.0581 | 0.0099 |
The first statistic is Gamma, the second is Kendall's Tau-b, ..., and the last is Uncertainty Coefficient Symmetric. The column labeled "Value" is the value calculated for the given statistic for this data set, and the column labeled "ASE" is the calculated (asymptotic) standard error of the statistic.
Example 12.17
The Cochran-Mantel-Haenszel test allows us to test for the association between two categorical variables while adjusting for a third categorical variable. To request that SAS performs such a test, we must use the CMH tables option. The following CMH tables option requests the Cochran-Mantel-Haenszel to test for association between the two variables cyst_hb and sym_1 while adjusting for the third variable ctr:
OPTIONS NOFMTERR;
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: CMH';
title2 'Adjusting for Ctr';
format cyst_hb cystfmt.;
tables ctr*cyst_hb*sym_1/nopercent nocol cmh;
RUN;
As always, we put the stratifying variable — in this case, ctr — in the first position of the tables statement. Then, we put the two variables between which we are testing for association — in this case, cyst_hb and sym_1 — in the second and third positions. Again, the NOPERCENT and NOCOL tables options are used to help declutter the resulting frequency tables. Finally, the CMH tables option gives us the Cochran-Mantel-Haenszel test for association.
Launch and run the SAS program, and review the resulting output. You should first see a set of five two-way tables (one for each level of ctr). Here's what the table looks like for ctr = 2:
CYST_HB | SYM_1 (severity of urinary symptoms) | |||||
---|---|---|---|---|---|---|
Frequency | ||||||
Row Pct | 1 | 2 | 3 | 4 | 5 | Total |
Local | 1 | 2 | 2 | 3 | 1 | 9 |
11.11 | 22.22 | 22.22 | 33.33 | 11.11 | ||
Both | 1 | 5 | 22 | 30 | 12 | 70 |
1.43 | 7.14 | 31.43 | 42.86 | 17.14 | ||
Hydro | 0 | 0 | 0 | 2 | 0 | 2 |
0.00 | 0.00 | 0.00 | 100.00 | 0.00 | ||
Total | 2 | 7 | 24 | 35 | 13 | 81 |
Then, you should see a set of three Cochran-Mantel-Haenszel statistics — the first one is labeled Nonzero Correlation, the second one is labeled Row Mean Scores Differ, and the third one is labeled General Association:
Statistics | Alternative Hypothesis | DF | Value | Prob |
---|---|---|---|---|
1 | Nonzero Correlation | 1 | 2.6013 | 0.1068 |
2 | Row Mean Scores Differ | 2 | 2.9560 | 0.2281 |
3 | General Association | 8 | 6.9245 | 0.5448 |
The column labeled "DF" contains the given statistic's degrees of freedom, the column labeled "Value" contains the value calculated for the given statistic for the data set, and the column labeled "Prob" contains the statistic's P-value.