Lesson 25: Analyzing Categorical DataLesson 25: Analyzing Categorical Data
In this lesson, we investigate how to use the FREQ procedure to conduct various statistical analyses on categorical data that can be summarized in two-way frequency tables.
- use the FREQ procedure to create a two-way frequency table using raw data
- use the FREQ procedure's CHISQ option to tell SAS to calculate chi-square statistics for a two-way frequency table
- know the possible shortcuts you can use in the FREQ procedure's TABLES statement to request multiple frequency tables
- use the FREQ procedure's WEIGHT statement to create a two-way frequency table using summarized data
- use the FREQ procedure's AGREE option to request McNemar's chi-square statistic as well as the Kappa statistic
- use the FREQ procedure's CMH option to request odds ratios be calculated for a case-control study
- use the FREQ procedure's CMH option to request relative risks be calculated for a cohort study
- use the FREQ procedure's CHISQ option to tell SAS to calculate the Mantel-Haenszel chi-square statistic so that you can test for linear trend in a 2 × N frequency table
- use the FREQ procedure's ALL option to tell SAS to calculate the Mantel-Haenszel chi-square statistic for stratified 2 × 2 tables
Sections G, H, I and L, M, N, O, P and Q in Chapter 3 of the textbook.
25.1 - Lesson Notes25.1 - Lesson Notes
G. Two-way Frequency Tables
Page 89. The table on the bottom of page 89 is incorrect. The cells in the Dewey row are missing the last row of numbers corresponding to column percent. This is what the table output should look like:
The FREQ Procedure
Table of Exposure by Gender
Using this corrected table, you can now see the correct frequency counts, percentages, row percentages and column percentages in each cell. For example, the Dewey-Male cell tells us:
- 40 of the 180 people sampled were males who preferred Dewey.
- 22.22% of the 180 people sampled — that's 40 divided by 180 — were males who preferred Dewey.
- Of the 110 people in the sample who preferred Dewey, 40 — that is, 36.36% — were male.
- Of the 80 males in the sample, 40 — that is, 50.00% — preferred Dewey.
Page 90. The null and alternative hypothesis here are:
- Null: There is no relationship between gender and preference.
- Alternative hypothesis: There is a relationship between gender and preference.
The Chi-square statistic's P-value (0.0062) tells us that it is highly unlikely that we'd obtain such an extreme difference in the observed counts and the expected counts, (as summarized by the chi-square statistic) by chance alone. The P-value is very small... much smaller than 0.05, say. Therefore, we can reject the null hypothesis in favor of the alternative hypothesis. There is sufficient evidence at the 0.05 level to conclude that there is a relationship between gender and preference.
I. Computing Chi-Square From Frequency Counts
Page 92. I find I use the WEIGHT statement often. Whenever you don't have the original raw data available, but instead have the data already summarized in tables (as you might see on the evening news!), you have to use a WEIGHT statement to tell SAS to calculate the chi-square statistic for you. Here's the code I used to create the corrected table above in Section G:
DATA elect; input Gender $ Candid $ Count; DATALINES; F Dewey 70 M Dewey 40 F Truman 30 M Truman 40 ; RUN; PROC FREQ data = elect; table Candid*Gender / chisq; weight count; RUN;
L. McNemar's Test for Paired Data
Page 98. Without stating so, the authors compare the obtained P-value of 0.0253 to a small pre-set significance level, 0.05 say. Since 0.0253 is smaller than 0.05, they reject the null hypothesis and conclude that the advertising campaign was effective. Two comments here: (1) If the authors or any statistician draw conclusions based on a P-value without stating a significance level, you can probably assume that they are thinking about a 0.05 level. (2) There is nothing etched in stone that says you have to use a 0.05 level. You may have sound scientific reasons to use a smaller value, 0.01 say, or a larger value, 0.10 say. The important thing is that you report what you use when drawing your conclusions.
N. Odds Ratios
Page 101. The authors calculate the odds ratio to be 3.25. We interpret such an odds ratio in this way... we say that the odds of a case being exposed to benzene is 3.25 times the odds of a control being exposed to benzene.
Page 102. If the authors didn't use the trick of using 1-Yes in place of Yes, and 2-No in place of No, this is what their program would look like:
DATA odds; INPUT Outcome $ Exposure $ Count; DATALINES; Case Yes 50 Case No 100 Control Yes 20 Control No 130 ; RUN; PROC FREQ data = odds; TABLE Exposure*Outcome / chisq cmh; WEIGHT Count; RUN;
Note that the Exposure values are entered as Yes and No rather than, respectively, 1-Yes and 2-No. When you launch and run this program, this is what the odds ratio portion of the output looks like:
Estimates of the Common relative Risk (Row1/Row2)
|Type of Study||Method||Value||95% Confidence||Limits|
Total Sample Size = 300
Now, the odds ratio is reported to be 0.3077. That's because the cells in the two-way table are now flip-flopped:
The FREQ Procedure
Table of Exposure by Outcome
Note that the No row appears first here, whereas in the text the 1-Yes row does. Here, we'd have to interpret the odds ratio as... the odds of a case not being exposed is 0.3077 times the odds of a control not being exposed. Do you agree that this interpretation is a little more awkward and a lot less helpful? Incidentally, you should note that 0.3077 is just the reciprocal of 3.25. That is 1 divided 0.3077 equals 3.25.
Page 103. In the text below the output, the authors didn't quite report the 95% confidence interval for the odds ratio correctly. It should be (1.8189 to 5.807). We can be 95% confident that the true population odds ratio falls between 1.8189 and 5.807.
O. Relative Risk
Page 106. The authors didn't quite report the 95% confidence interval for the relative risk correctly either. It should be (1.0761 to 3.7171). We can be 95% confident that the true population relative risk falls between 1.0761 and 3.7171.
P. Chi-square Test for Trend.
Page 108. The authors state that "there may be times when your table chi-square is not significant but, since the test for trend is using more information (the order of the columns), it may be significant." You can see it moving in that direction in this example. The P-value for the table chi-square is 0.0283, whereas the P-value for the M-H chi-square is 0.0074. The P-value for the table chi-square test is almost four times larger than the P-value for the M-H chi-square test. Hence, the M-H chi-square test produces a more significant result than the table chi-square test.
Q. Mantel-Haenszel Chi-Square for Stratified Tables and Meta-Analysis
Page 111. The authors committed another error when reading the output. The P-value for the Cochran-M-H statistic is 0.0004. The relative risk is 1.9775 and the 95% confidence interval for the relative risk is (1.3474, 2.9021).
25.2 - Summary25.2 - Summary
In this lesson, we investigated how to use the FREQ procedure to conduct various statistical analyses on categorical data that can be summarized in two-way frequency tables.
The homework for this lesson will give you more practice with the FREQ procedure so that you become even more familiar with how it works. Then, you can use it to analyze your own categorical data!