Lesson 12: Summarizing Categorical Data

Lesson 12: Summarizing Categorical Data

Overview

In this lesson, we'll investigate the FREQ procedure as a tool for summarizing and analyzing categorical data. The procedure is a descriptive procedure, as well as a statistical procedure. It allows you to produce one-way to n-way frequency and cross tabulation tables. For two-way tables, the FREQ procedure also computes chi-square tests and measures of association. And, for n-way tables, the FREQ procedure also performs stratified analyses, computing statistics within as well as across strata. The FREQ procedure can also be used to output summary statistics, such as counts and percentages, to a SAS data set.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to use the FREQ procedure to summarize a data set numerically in a variety of ways, including:

  • create simple one-way, two-way, ... and n-way table summaries
  • use the NOCUM option to suppress the printing of cumulative statistics in a table
  • use the PAGE option to tell SAS to print only one table per page
  • know how to read the values from a two-way table created by the FREQ procedure
  • create two-way (and in general, n-way) tables using the available shortcuts
  • use the NOROW, NOCOL, and NOPERCENT options to suppress some of the default output in each of the cells of an n-way table
  • request additional output in each of the cells of an n-way table, such as EXPECTED, DEVIATION, and CELLCHI2
  • use the LIST and CROSSLIST tables options to print n-way tables in a list format rather than as crosstabulation tables
  • use a BY statement to tell SAS to perform an operation for each level of the BY-group
  • use the MISSING tables option tells SAS to treat missing values as nonmissing values, and to therefore include them in the calculation of the statistics
  • use the MISSPRINT option tells SAS to treat missing values as nonmissing values when printing the frequencies, but to not include them in the calculation of the statistics
  • use the FREQ procedure create new SAS data sets containing summary statistics of categorical variables
  • use the NOPRINT tables option to tell SAS to suppress printing the n-way crosstabulation
  • use the SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data
  • invoke statistics table options, such as CHISQ, MEASURES, CMH, ALL, and EXACT

12.1 - A Basic One-Way Table

12.1 - A Basic One-Way Table

By default, the FREQ procedure creates a one-way table that contains the frequency, percent, cumulative frequency, and cumulative percent of every value of every variable in the input data set. That every there is italicized with good reason ... the FREQ procedure doesn't care whether the variable is a character variable or a numeric variable. And, if a variable is numeric, the FREQ procedure doesn't care if it is a discrete numeric variable with just a few possible outcomes (number of siblings, say) or a continuous numeric variable with an infinite number of possible outcomes (weight, say). That means then if you rely on the default version of the FREQ procedure, it is possible to create lots and lots and lots of output. That's why we'll skip the default version and will jump right to the more practical version in which you restrict the number of tables SAS creates by using a TABLES statement.

The FREQ procedure takes the following generic form:

PROC FREQ options;
    tables ... /options;
RUN;

The TABLES statement tells SAS the specific frequency table(s) that you want to create. If you don't include a TABLES statement, then SAS creates a one-way frequency table for every variable in your input data set.

As you can see, there are two types of options, namely procedure options and table options. Procedure options, such as the typical "DATA=" option, must follow the PROC FREQ statement. Table options must be specified after a forward slash (/) in the TABLES statement. In either case, you can specify as many options as you would like.

Throughout this lesson, we'll use the ICDB background data set to illustrate the FREQ procedure. Right-click the link to save the data set to a convenient location on your computer.

Example 12.1

The following FREQ procedure illustrates the simplest practical example, namely a one-way frequency table of the variable sex, with no bells or whistles added:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
LIBNAME icdb 'C:\simon\icdb\data';

PROC FREQ data=icdb.back;
   title 'Frequency Count of SEX';
   tables sex;
RUN;

Launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the program and review the output. You should see something along the lines of this basic one-way frequency table, in which, as promised, SAS reports the frequency, percent, cumulative frequency, and cumulative percent of each value of the sex variable:

Frequency Count of SEX
The FREQ Procedure
sex Frequency Percent Cumulative Frequency Cumulative Percent
1 56 8.78 56 8.78
2 582 91.22 638 100.00

This output tells us, for example, that 56 or 8.78% of the subjects in the ICDB Study are male (coded as sex = 1).

Example 12.2

Again, by default, SAS outputs frequency counts, percents, cumulative frequencies, and cumulative percents. The NOCUM table option suppresses the printing of the cumulative frequencies and cumulative percentages for one-way frequency tables. The following SAS code illustrates the NOCUM table option:

 PROC FREQ data=icdb.back;
    title 'Frequency Count of SEX: No Cumulative Stats';
    tables sex/nocum;
 RUN;
Frequency Count of SEX: No Cumulative Stats
The FREQ Procedure
sex Frequency Percent
1 56 8.78
2 528 91.22

Launch and run the SAS program. Review the output to convince yourself that indeed the cumulative frequencies and cumulative percentages are not printed in the table. The table contains only the number and percentage of each sex.

In any FREQ procedure, you can specify many variables in a TABLES statement. If the list is long, you may be able to use a shortcut to specify the list of variables. If you specify a TABLES statement using a numbered range of variables, such as:

tables var1-var4;

then SAS will create a one-way frequency table for the four variables named var1, var2, var3, and var4. If instead in your TABLES statement, you specify a range of variables by their position in the data set, such as:

tables sex--race;

then SAS will create a one-way frequency table for every variable that appears between the sex and race variables in the data set, namely in the case of the background data set, sex, state, country, and race. Recall that if you're not sure of the position of the variables in your data set, you can use the VARNUM option of the CONTENTS procedure to determine the position of the variables in a data set. (Incidentally, that is not a typo in the second TABLES statement ... it takes two dashes to specify a range of variables by their position in the data set.)

Rather than specifying many variables in a TABLES statement, you can specify many TABLES statements in a FREQ procedure. However you tell SAS to make multiple tables, you can use the PAGE option to tell SAS to print only one table per page. Otherwise, the FREQ procedure prints multiple tables per page as space permits.

Example 12.3

The following SAS program illustrates the creation of two one-way frequency tables in conjunction with the PAGE option:

 PROC FREQ data=icdb.back page;
    title 'Frequency Count of SEX and RACE';
    tables sex race;
 RUN;

Launch and run the SAS program. Review the output to convince yourself that indeed SAS creates two one-way frequency tables — one for the categorical variable sex and the other for the categorical variable race. Because the PAGE option was invoked, each table should be printed on a separate page. The first page should contain the frequency table for the sex variable:

Frequency Count of SEX and RACE
The FREQ Procedure
sex Frequency Percent Cumulative Frequency Cumulative Percent
1 56 8.78 56 8.78
2 528 91.22 638 100.00

and the second page should contain the frequency table for the race variable:

Frequency Count of SEX and RACE
The FREQ Procedure
race Frequency Percent Cumulative Frequency Cumulative Percent
1 2 0.31 2 0.31
2 7 1.10 9 1.41
3 29 4.55 38 5.96
4 593 92.95 631 98.90
5 3 0.47 634 99.37
6 2 0.31 636 99.69
7 1 0.16 637 99.87
8 1 0.16 638 100.00

Incidentally, you might also want to notice that, not surprisingly, the order in which the variables appear in the TABLES statement determines the order in which they appear in the output.

Example 12.4

As is the case for many SAS procedures, you can use a BY statement to tell SAS to perform an operation for each level of a BY group. The following program tells SAS to create a one-way frequency table for the variable ed_level for each level of the variable sex:

 PROC SORT data=icbd.back out=s_back;
    by sex;
 RUN;
    
PROC FREQ data=s_back;
    title 'Frequency Count of Education Level within Each Level of Sex';
    tables ed_level;
    by sex;
RUN;
 

As is always the case, the SORT procedure merely prepares the background data set for BY-group processing. The SORT procedure tells SAS to sort the icdb.back data set by sex, and to store the results in a new data set called s_back. Then, as you can see, the FREQ procedure is invoked with a BY statement ("by sex") in addition to the TABLES statement ("tables ed_level"). Launch and run the SAS program. Review the output to convince yourself that SAS creates two one-way frequency tables of education level (ed_level) — one for males (sex= 1):

Frequency Count of Education Level within Each Level of Sex
sex = 1
ed_level Frequency Percent Cumulative Frequency Cumulative Percent
1 4 7.14 4 7.14
2 7 12.50 11 19.64
3 12 21.43 23 41.07
4 20 35.71 43 76.79
5 13 23.21 56 100.00

and one for females (sex = 2):

Frequency Count of Education Level within Each Level of Sex
sex = 2
ed_level Frequency Percent Cumulative Frequency Cumulative Percent
1 7 1.20 7 1.20
2 22 3.78 29 4.98
3 220 37.80 249 42.78
4 229 39.35 478 82.13
5 104 17.87 582 100.00

So far, in each of the examples we have looked at, no missing values existed. When they do exist, SAS by default excludes them from your requested frequency tables. Instead, SAS prints the "Frequency Missing" below each table. You can instead opt to use the MISSING tables option, in which you tell SAS to instead treat missing values as nonmissing values, and to therefore include them in the calculation of percentages and other statistics. Or you can opt to use the MISSPRINT option to tell SAS to treat missing values as nonmissing values when printing the frequencies but do not include them in the calculation of the statistics.

Example 12.5

The following SAS program illustrates the MISSING and MISSPRINT options on the variable state in the background data set:

 PROC FREQ data=icdb.back;
    title 'One-way Table of State: with MISSING Option';
    tables state/missing;
 RUN;

 PROC FREQ data=icdb.back;
    title 'One-way Table of State: with MISSPRINT Option';
    tables state/missprint;
 RUN;

Launch and run the SAS program, and review the resulting output. The first few rows of output from the FREQ procedure with the MISSING tables option should look something like this:

One-way Table of State: with MISSING Option
The FREQ Procedure
sex Frequency Percent Cumulative Frequency Cumulative Percent
. 42 6.58 42 6.58
1 4 0.63 46 7.21
3 1 0.16 47 7.37
4 5 0.78 52 8.15

As you can see, the first row tells us that 42 subjects did not report the state in which they live. Because the MISSING option was used, SAS also tells us the 42 subjects comprise 6.58% of the subjects in the data set. SAS also includes the 42 subjects in the calculation of the cumulative percentage.

On the other hand, the first few rows of output from the FREQ procedure with the MISSPRINT tables option should look something like this:

One-way Table of State: with MISSPRINT Option
The FREQ Procedure
sex Frequency Percent Cumulative Frequency Cumulative Percent
. 42 . . .
1 4 0.67 4 0.67
3 1 0.17 5 0.84
4 5 0.84 10 1.68

As you can see, the first row again tells us that 42 subjects did not report the state in which they live. In this case, however, because the MISSPRINT option was specified, SAS stops there. That is, SAS does not include the subjects in any of its calculations of the percent, cumulative frequency or cumulative percent.


12.2 - Two-way and N-way Tables

12.2 - Two-way and N-way Tables

So far, we have only used the FREQ procedure to create one-way frequency tables. However, it is often helpful to create crosstabulation tables in which the frequencies are determined for more than one variable at a time. For example, if we were interested in knowing how the percentage of voters favoring Barack Obama differs for various age groups, we'd need to create a two-way crosstabulation table between the two categorical variables candidate (Obama, McCain) and agegroup (18-29, 30-44, 45-54, 55-70, more than 70). We'll investigate such tables in this section.

Example 12.6

The following FREQ procedure illustrates the simplest example of telling SAS to create a two-way table, for the variables sex and ed_level, with no bells and whistles added:

PROC FREQ data=icdb.back;
   title 'Crosstabulation of Education Level and Sex';
   tables ed_level*sex;
RUN;
Crosstabulation of Education Level and RACE
The FREQ Procedure
table of ed_level by sex
Frequency  
Percent  
Row Pct  
Col Pct 1 2 Total
1 4 7 11
0.63 1.10 1.72
36.36 63.64  
7.14 1.20  
2 7 22 29
1.10 3.45 4.55
24.14 75.86  
12.50 3.78  
3 12 220 232
1.88 34.48 36.36
5.17 94.83  
21.43 37.80  
4 20 229 249
3.13 35.89 39.03
8.03 91.97  
35.71 39.35  
5 13 104 117
2.04 16.30 18.34
11.11 88.89  
23.21 17.87  
Total 56 582 638
8.78 91.22 100.00

As you can see, to tell SAS to create a two-way table of ed_level and sex, we merely use an asterisk (*) to join the two variables in the TABLES statement.

Launch and run the SAS program. Review the output to convince yourself that SAS created the requested two-way table. In general, the values of the variable appearing before the asterisk form the rows of the table, and the values of the variable appearing after the asterisk form the columns of the table. In this case, since ed_level appears before the asterisk in the TABLES statement, its values form the rows of the table. And, since sex appears after the asterisk, its values form the columns of the table.

When SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the cell percentage of the total frequency, the cell percentage of the row frequency, and the cell percentage of the column frequency. You might want to review the numbers in each of the cells of the table to make sure you understand what is what. In the upper left-hand corner of the table, SAS always prints a guide to the numbers appearing in each of the cells in the table. Here, SAS tells us that the first number in cell (i,j) is the number of subjects of ed_level i and sex j; the second number in cell (i,j) is the percentage of subjects of ed_level i and sex j; the third number in cell (i,j) is the percentage of subjects who are sex j given that they are ed_level i; and the fourth number in cell (i,j) is the percentage of subjects who are ed_level i given that they are sex j. For example, for the cell in which ed_level = 4 and sex = 2, SAS tells us that: 229 of the subjects in the data set are ed_level 4 and sex 2; 35.89% of the subjects in the data set are ed_level 4 and sex 2; 91.97% of the subjects in the data set who are ed_level 4 are sex 2; and 39.35% of the subjects in the data set who are sex 2 are ed_level 4.

A little note about shortcuts. If you have many different two-way tables to create, you can use a variety of shortcuts. For example, the TABLES statement:

tables a*(b c);

tells SAS to create a two-way table between variables a and b (a*b) and a two-way table between variables a and c (a*c). The TABLES statement:

tables (a b)*(c d);

tells SAS to create four two-way tables, namely: a*c, b*c, a*d, and b*d. The TABLES statement:

tables (a b c)*d;

tells SAS to create three two-way tables, namely: a*d, b*d, and c*d.

Example 12.7

For a frequency analysis of more than two variables, we can use the FREQ procedure to create n-way crosstabulation tables. In that case, a series of two-way tables is created, with a table for each level of the other variable(s). The following program creates a three-way table of sex, job_chng, and ed_level:

PROC FREQ data=icdb.back;
   title '3-way Table of Sex, Job Change, and Ed. Level';
   tables sex*job_chng*ed_level;
RUN;

As you can see, to tell SAS to create a three-way table of sex, job_chng, and ed_level, we use an asterisk (*) to join the three variables in the TABLES statement. The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the rows and columns of the two-way tables. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables. So, in this case, we should expect SAS to create two two-way tables of job_chng and ed_level, one for when sex = 1 and one for when sex = 2.

Launch and run the SAS program, and review the output to convince yourself that SAS created the three-way table as described. You should see that, indeed, SAS created one two-way table of job_chng and ed_level for when sex = 1:

Crosstabulation of Sex, Job Change, and Education Level
The FREQ Procedure
Table 1 of job_chng by ed_level
Controlling for sex=1

job_chng ed_level

Frequency  
Percent  
Row Pct  
Col Pct 1 2 3 4 5 Total
0 4 6 11 15 12 48
7.84 11.76 21.57 29.41 23.53 94.12
8.33 12.50 22.92 31.25 25.00  
100.00 100.00 100.00 88.24 92.31  
1 0 0 0 2 1 3
0.00 0.00 0.00 3.92 1.96 5.88
0.00 0.00 0.00 66.67 33.33  
0.00 0.00 0.00 11.76 7.69  
Total 4 6 11 17 13 51
7.84 11.76 21.57 33.33 25.49 100.00

Frequency Missing = 107

and one two-way table of job_chng and ed_level for when sex = 2:

table 2 of job_chng by ed_level
Controlling for sex=2

job_chng ed_level

Frequency  
Percent  
Row Pct  
Col Pct 1 2 3 4 5 Total
0 4 14 148 163 74 403
0.84 2.95 31.16 34.32 15.58 84.84
0.99 3.47 36.72 40.45 18.36  
80.00 82.35 85.06 83.16 89.16  
1 1 3 26 33 9 72
0.21 0.63 5.47 6.95 1.89 15.16
1.39 4.17 36.11 45.83 12.50  
20.00 17.65 14.94 16.84 10.84  
Total 5 17 174 196 83 475
1.05 3.58 36.63 41.26 17.47 100.00

Frequency Missing = 107

It probably goes without saying that, in general, n-way tables can generate lots of output.


12.3 - Cell Statistics

12.3 - Cell Statistics

As you now know, when SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the (joint) cell percentage, the conditional row percentage, and the conditional column percentage. As usual, you don't have to accept the default ... you can tell SAS to suppress some of the default statistics it displays ... or you can tell SAS to display alternative statistics.

For n-way crosstabulations, you can suppress the default output in each of the cells by using any (or all) of the following table options:

  • NOFREQ suppresses the printing of the cell freqencies
  • NOROW suppresses the printing of the row percentages
  • NOCOL suppresses the printing of the column percentages, and
  • NOPERCENT suppresses the printing of the (joint) cell percentages

Example 12.8

The following SAS program illustrates the NOROW, NOCOL, and NOPERCENT table options:

 PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: No percents';
    tables race*sex/norow nocol nopercent;
 RUN;
Crosstabulation of SEX and RACE: No percents
The FREQ Procedure
Table of race by sex
race sex
Frequency 1 2 Total
1 0 0 2
2 2 3 7
3 1 28 29
4 51 542 593
5 0 3 3
6 0 2 2
7 1 0 1
8 0 1 1
Total 56 582 638

Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the default row percentages, column percentages, and cell percentages. Note, too, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the lone number reported in each cell is the cell frequency.

Instead of suppressing output in an n-way table, we can request additional output in each of the cells by using the following options:

  • EXPECTED, to print the expected cell frequencies under the null hypothesis of independence
  • DEVIATION, to print the deviation of the cell frequency from the expected cell frequency (under the null hypothesis of independence)
  • CELLCHI2, to print each cell's contribution to the total chi-squared statistic

That is, CELLCHI2 is defined as:

\(\text{CELLCHI2} = \dfrac{\left(\text{frequency} - \text{expected}\right)^2}{\text{expected}}\)

Example 12.9

In creating the two-way table between race and sex, the following FREQ procedure requests that the EXPECTED and CELLCHI2 statistics be printed, while at the same time suppressing the printing of the joint, row, and column percentages:

 PROC FREQ data=icdb.back;
   title 'Crosstabulation of SEX and RACE: With Expecteds';
   tables race*sex/expected cellchi2 norow nocol nopercent;
 RUN;
Table of race by sex
race sex
Frequency  
Expected  
Cell Chi-Square 1 2 Total
1 0 2 2
0.1755 1.8245  
0.1755 0.0169  
2 3 4 7
0.6144 6.3856  
9.2624 0.8912  
3 1 28 29
2.5455 26.455  
0.9383 0.0903  
4 51 542 593
52.05 540.95  
0.0212 0.002  
5 0 3 3
0.2633 2.7367  
0.2633 0.0253  
6 0 2 2
0.1755 1.8245  
0.1755 0.0169  
7 0 2 2
0.0878 0.9122  
9.4806 09.122  
8 0 1 1
0.0878 0.9122  
0.0878 0.0084  
Total 56 582 638

Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the printing of the joint, row, and column percentages, while adding the expected cell frequencies, as well as the cell's contribution to the chi-square statistic, to the cell output. Note, again, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the second number in each cell is the expected cell frequency and the third number is the cell's contribution to the chi-square statistic.


12.4 - Changing the Table Format

12.4 - Changing the Table Format

When three or more variables are specified, the multiple levels of n-way tables can produce lots of output. Such bulky and complex crosstabulations are often easier to read as a continuous list. To tell SAS to generate list output for crosstabulations rather than the default tables, simply use the LIST tables option.

Example 12.10

The following FREQ procedure illustrates the LIST option:

PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: In List Format';
    tables sex*race/list;
 RUN;
Crosstabulation of SEX and RACE: in List Format
The FREQ Procedure
sex race Frequency Percent Cumulative Frequency Cumulative Percent
1 2 3 0.47 3 0.47
1 3 1 0.16 4 0.63
1 4 51 7.99 55 8.62
1 7 1 0.16 56 8.78
2 1 2 0.31 58 9.09
2 2 4 0.63 62 9.72
2 3 28 4.39 90 14.11
2 4 542 84.95 632 99.06
2 5 3 0.47 635 99.53
2 6 2 0.31 637 99.84
2 8 1 0.16 638 100.00

Launch and run the SAS program. Review the output to convince yourself that rather than creating a two-way table, SAS prints the requested two-way table as a list. I tend to prefer the LIST format for such cross-tabulations, as long as I'm not in need of the (conditional) row and column percentages (which, as you can see, are not reported in LIST format).

Incidentally, the LIST option cannot be used when the TABLES statement requests statistical tests or measures of association to be included in the output. (We'll be discussing such tests and measures later in this lesson.) Also, expected cell frequencies are not printed when the LIST option is specified, even if the EXPECTED option is invoked.

An alternative to the LIST option is the CROSSLIST table option. When you add the CROSSLIST option to your TABLES statement, SAS displays crosstabulation tables in the standard Output Delivery System (ODS) column format. The benefit of doing so probably won't make much sense until we learn about ODS in Stat 481. The option creates a table that has a table definition that you can customize by using the TEMPLATE procedure. For now, let's just be satisfied with looking at an example to see what the output looks like when you use the CROSSLIST option.

Example 12.11

The following FREQ procedure illustrates the CROSSLIST option:

 PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: In Crosslist Format';
    tables sex*race/crosslist;
 RUN;
sex race Frequency Percent Row Percent Column Percent
1 1 0 0.00 0.00 0.00
2 3 0.47 5.36 42.86
3 1 0.16 1.79 3.45
4 51 7.99 91.07 8.60
5 0 0.00 0.00 0.00
6 0 0.00 0.00 0.00
7 1 0.16 1.79 100.00
8 0 0.00 0.00 0.00
Total 56 8.78 100.00  
-------------------------------------------------------------------
2 1 2 0.31 0.34 100.00
2 4 0.63 0.69 57.14
3 28 4.36 4.81 96.55
4 542 84.95 93.13 91.40
5 3 0.47 0.52 100.00
6 2 0.31 0.34 100.00
7 0 0.00 0.00 0.00
8 1 0.16 0.17 100.00
Total 582 91.22 100.00  
-------------------------------------------------------------------
Total 1 2 0.31   100.00
2 7 1.10   100.00
3 29 4.55   100.00
4 593 92.95   100.00
5 3 0.47   100.00
6 2 0.31   100.00
7 1 0.16   100.00
8 1 0.16   100.00
Total 638 100.00    

Launch and run the SAS program, and review the output to familiarize yourself with the appearance of the output when you use the CROSSLIST option.


12.5 - Creating Output Data Sets

12.5 - Creating Output Data Sets

Just as the UNIVARIATE, MEANS and SUMMARY procedures can be used to create new SAS data sets containing summary statistics of numeric variables, the FREQ procedure can be used to create new SAS data sets containing summary statistics of categorical variables. This can be a very useful tool when manipulating and subsequently analyzing data.

Example 12.12

The following FREQ procedure tells SAS to create an output data set which contains the counts and percentages for each combination of the variables sex and race:

 PROC FREQ data=icdb.back;
     tables sex*race/out=sexfreq noprint;
 RUN;
 
 PROC PRINT;
    title 'Output Dataset: SEXFREQ';
 RUN;
Output Dataset: SEXFREQ
Obs sex race COUNT PERCENT
1 1 2 3 0.4702
2 1 3 1 0.1567
3 1 4 51 7.9937
4 1 7 1 0.1567
5 2 1 2 0.3135
6 2 2 4 0.6270
7 2 3 28 4.3887
8 2 4 542 84.9530
9 2 5 3 0.4702
10 2 6 2 0.3135
11 2 8 1 0.1567

The NOPRINT tables option tells SAS to suppress printing the two-way crosstabulation of sex*race. Instead, the results are dumped to the temporary data set sexfreq that is specified in the OUT= tables option. The data set in the OUT= option becomes the current data set. Therefore, since the PRINT procedure does not have a DATA= option to tell SAS the name of the data set to print, SAS prints the current data set, namely sexfreq.

Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set contains one record for each combination of sex and race that occurs in the data set. And, note that the data set contains four variables, sex, race, COUNT (whose value is the number of subjects in the data set of that sex and race), and PERCENT (whose value is the percentage of subjects in the data set of that sex and race).

Example 12.13

The SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data. This option affects printouts under the LIST option and output data sets only. The following SAS code illustrates use of the SPARSE option in the creation of an output data set called sexfreq:

PROC FREQ data=icdb.back;
     tables sex*race/out=sexfreq noprint sparse;
 RUN;
 
 PROC PRINT;
    title 'Output Dataset: SEXFREQ with SPARSE option';
 RUN;
Output Dataset: SEXFREQ with SPARSE option
Obs sex race COUNT PERCENT
1 1 1 0 0.0000
2 1 2 3 0.4702
3 1 3 1 0.1567
4 1 4 51 7.9937
5 1 5 0 0.0000
6 1 6 0 0.0000
7 1 7 1 0.1567
8 1 8 0 0.0000
9 2 1 2 0.3135
10 2 2 4 0.6270
11 2 3 28 4.3887
12 2 4 542 84.9530
13 2 5 3 0.4702
14 2 6 2 0.3135
15 2 7 0 0.0000
16 2 8 1 0.1567

The only difference between this code and the previous code is the inclusion of the SPARSE option. Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set now contains one record for each possible combination of sex and race regardless of whether or not the combination occurs in the data set. That is, since there are 2 sexes and 8 races, the sexfreq data set contains 16 records — one for each of the combinations. For example, there is now a record in the sexfreq data set corresponding to sex = 1 and race = 1 even though there are no subjects in the data set of that sex and race combination (COUNT = 0).


12.6 - Table Statistics

12.6 - Table Statistics

There are a few possible statistical table options available in the FREQ procedure. Unfortunately, we don't have the luxury of digressing and discussing in this course when it is appropriate to use each of the statistics. We merely make ourselves aware of their existence in SAS should we need them in analyzing our data in the future. The options include:

  • CHISQ, which requests a chi-square test of homogeneity or independence. Reported measures include the phi coefficient, the contingency coefficient, and Cramer's V. For 2×2 tables, Fisher's exact test is also included.
  • MEASURES, which requests a basic set of measures of association and their standard errors. Measures include Pearson and Spearman correlation coefficients, gamma, Kendall's tau-b, Stuart's tau-c, Somer's D, lambda, and uncertainty coefficients. For 2×2 tables, odds ratios and risk ratios (with their confidence intervals) are reported.
  • CMH, which requests Cochran-Mantel-Haenszel statistsics, which test for association between a row and column variable after adjusting for all of the other variables in the tables statement. For 2×2 tables, Breslow's test for homogeneity of odds ratios is reported as well.
  • ALL, which requests all of the tests and measures provided by the CHISQ, MEASURE, and CMH options.
  • EXACT, which requests Fisher's Exact Test for tables larger than 2×2 tables. Note that this can take painfully long to compute for large tables.

Of course, you can always refer to the SAS Help and Documentation for the technical details of the statistical tests and measures.

Example 12.14

For the remaining examples in this lesson, we need an analysis data set with which to work. The analysis data set (click to save!) that we'll use contains (just!) four variables pulled from the ICDB background data set (with which we've already been working!), the ICDB cystoscopy data set and the ICDB symptoms data set. The following program merely displays the variable names of, and the first 15 observations in, the analysis data set:

 OPTIONS NOFMTERR;

 PROC CONTENTS data = icdb.analysis position;
   title 'The Analysis data set';
 RUN;

 PROC PRINT data = icdb.analysis (OBS = 15);
   title 'The Analysis data set';
 RUN;

Launch and run the SAS program. From the CONTENTS procedure output, you should see that the analysis data set contains 638 observations and the following four variables:

Variables in Creation Order
# Variable Type Len Format Informat Label
1 subj Num 8 11. 11. patient ID number
2 CYST_HB Num 8      
3 SYM_1 Num 8 SYMPTSEV. 4. severity of urinary symptoms
4 ctr Num 8      

Besides the variable names, this list tells us that the sym_1 variable has a permanent format called symptsev. associated with it. You might recall that if we don't have access to the permanent formats catalog — which we don't — then we'll run into trouble when trying to use the sym_1 variable. That's why we specified the NOFMTERR option before trying to use the analysis data set. The NOFMTERR option allows us to use the analysis data set. We just can't take advantage of its variables' permanent formats.

Perhaps you'd appreciate an explanation of what the four variables are? Here goes:

  • subj: subject ID number
  • ctr: clinical center number (1, 2, 3, 4, 5)
  • cyst_hb: indicates whether or not a hydrodistention and/or biopsy was performed (0 = a local cystoscopy with no hydrodistention or biopsy, 1 = hydrodistention and biopsy, and 2 = hydrodistention only)
  • sym_1: which indicates the subject's general assessment of the severity of his/her urinary symptoms (1 = not severe at all to 5 = extremely severe)

It might also help to review the first 15 observations just to get a feel for what the data look like:

The Analysis data set
Obs subj CYST_HB SYM_1 ctr
1 110027 1 4 1
2 110029 . 4 1
3 110039 . 3 1
4 110040 . 3 1
5 110045 . 3 1
6 11049 . 3 1
7 110051 . 4 1
8 110052 1 4 1
9 110053 . 3 1
10 110055 . 3 1
11 110057 . . 1
12 110058 1 4 1
13 110059 . 4 1
14 110060 . 3 1
15 110062 1 4 1

Now, as promised, let's use the analysis data set to illustrate some of the statistical tests and measures of association you can request from SAS while using the FREQ procedure.

Example 12.15

Some clinical centers may be more likely to perform a cystoscopy (which is a fairly invasive procedure) on their patients than other clinical centers. One could imagine this happening for a variety of reasons. For instance, one clinical center might have more severe patients thereby justifying more invasive procedures. Or perhaps the patients attending a particular clinical center might be better off financially and therefore more willing to pay for additional procedures. In any case, suppose we are interested in testing whether or not there is an association between performing a cystoscopy (cyst_hb) and clinical center (ctr). A chi-square test between the two variables would help us answer our research question. The following FREQ procedure illustrates the use of the CHISQ tables option in order to obtain the value of the chi-square test statistic and its associated P-value (as well as a few other useful statistics and P-values):

 PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;

 PROC FREQ data=icdb.analysis;
   title 'Chi-square Test of Hospital and Cystoscopy Procedure: CHISQ';
   format cyst_hb cystfmt.;
   tables ctr*cyst_hb/nopercent nocol missing chisq;
 RUN;
Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES
The FREQ Procedure
Table of ctr by CYST_HB
ctr CYST_HB
Frequency  
Row Pct Nothing Local Both Hydro Total
1 96 0 10 0 106
90.57 0.00 9.43 0.00  
2 77 0 14 0 52
48.73 5.70 44.30 1.27  
3 38 0 14 0 52
73.08 0.00 26.92 0.00  
4 101 16 65 5 187
54.01 8.56 34.76 2.67  
5 86 1 41 7 135
63.70 0.74 30.37 5.19  
Total 398 26 200 14 638
Statistics for Table of ctr by CYST_HB
Statistics DF Value Prob
Chi-Square 12 77.2068 <.001
Likelihood Ratio Chi-Square 12 89.2250 <.001
Mantel-Haenszel Chi-Square 1 . .
Phi Coefficient   0.3479  
contingency Coefficient   0.3286  
Cramer's V   0.2008  

WARNING: 35% of the cell's have expected counts less then 5. Chi-square may not be a valid test

Sample Size 638

Of course, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.

The TABLES statement first requests a two-way table between the (row) variable ctr and the (column) variable cyst_hb. Because we are interested in including the missing values ("Nothing") in the analysis, we include the MISSING tables option. When doing this kind of analysis, I like to use the NOPERCENT and NOCOL tables options to help declutter the two-way frequency table. Finally, the CHISQ tables option tells SAS to calculate the chi-square statistics and its P-value (as well as a few other statistics).

Launch and run the SAS program, and review the resulting output. You should first see a two-way table (with five rows for ctr and four columns for cyst_hb). Then, you should see a list of six different statistics, of which the first one is the chi-square test statistic. Here, the value of the statistic is 77.2 (rounded) with 12 degrees of freedom (DF) and a P-value that is less than 0.0001. (Assuming that there is enough non-missing data and therefore that the chi-square test is valid), the P-value tells us that it is highly unlikely that we would observe the data we did under the assumption that ctr and cyst_hb are not associated. Therefore, we conclude that ctr and cyst_hb are associated.

Example 12.16

If we are interested in quantifying the association between clinical center and performing a cystoscopy, then we would want to take advantage of the MEASURES tables option. The following FREQ procedure illustrates the use of the MEASURES tables option to obtain a basic set of measures of association and their standard errors:

PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;

PROC FREQ data=icdb.analysis;
    title 'Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES';
	format cyst_hb cystfmt.;
    tables ctr*cyst_hb/nopercent nocol missing measures;
 RUN;

Again, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.

Launch and run the SAS program, and review the resulting output. You should again first see a two-way table (with five rows for ctr and four columns for cyst_hb):

Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES
The FREQ Procedure
Table of ctr by CYST_HB
ctr CYST_HB
Frequency  
Row Pct Nothing Local Both Hydro Total
1 4 7     11
7.14 1.20      
2 7 22     29
12.50 3.78      
3 12 220     232
21.43 37.80      
4 20 229     249
35.71 39.35      
5 13 104     117
23.21 17.87      
Total 8.78 91.22     100.00

Then, you should see a list of thirteen different statistics lumped into five groups:

Statistics for Table of ctr by CYST_HB
Statistic Value ASE
Gama 0.1697 0.0534
Kendall's Tau-c 0.1071 0.0339
Stuart's Tau-c 0.0897 0.0286
     
Somers' D C|R 0.0870 0.0276
Somers' D R|C 0.1318 0.0417
     
Pearson Correlation . .
Spearman Correlation 0.1206 0.0381
     
Lambda Asymmetric C|R 0.0000 0.0000
Lambda Asymmetric R|C 0.0155 0.0267
Lambda Symmetric 0.0101 0.0175
     
Uncertainty Coefficient C|R 0.0802 0.0133
Uncertainty Coefficient R|C 0.0455 0.0079
Uncertainty Coefficient Symmetric 0.0581 0.0099

The first statistic is Gamma, the second is Kendall's Tau-b, ..., and the last is Uncertainty Coefficient Symmetric. The column labeled "Value" is the value calculated for the given statistic for this data set, and the column labeled "ASE" is the calculated (asymptotic) standard error of the statistic.

Example 12.17

The Cochran-Mantel-Haenszel test allows us to test for the association between two categorical variables while adjusting for a third categorical variable. To request that SAS performs such a test, we must use the CMH tables option. The following CMH tables option requests the Cochran-Mantel-Haenszel to test for association between the two variables cyst_hb and sym_1 while adjusting for the third variable ctr:

OPTIONS NOFMTERR;

PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;

PROC FREQ data=icdb.analysis;
   title 'Chi-square Test of Hospital and Cystoscopy Procedure: CMH';
   title2 'Adjusting for Ctr';
   format cyst_hb cystfmt.;
   tables ctr*cyst_hb*sym_1/nopercent nocol cmh;
 RUN;

As always, we put the stratifying variable — in this case, ctr — in the first position of the tables statement. Then, we put the two variables between which we are testing for association — in this case, cyst_hb and sym_1 — in the second and third positions. Again, the NOPERCENT and NOCOL tables options are used to help declutter the resulting frequency tables. Finally, the CMH tables option gives us the Cochran-Mantel-Haenszel test for association.

Launch and run the SAS program, and review the resulting output. You should first see a set of five two-way tables (one for each level of ctr). Here's what the table looks like for ctr = 2:

Table 2 of CYST_HB by SYM_1
Controlling for ctr=2
CYST_HB SYM_1 (severity of urinary symptoms)
Frequency  
Row Pct 1 2 3 4 5 Total
Local 1 2 2 3 1 9
11.11 22.22 22.22 33.33 11.11  
Both 1 5 22 30 12 70
1.43 7.14 31.43 42.86 17.14  
Hydro 0 0 0 2 0 2
0.00 0.00 0.00 100.00 0.00  
Total 2 7 24 35 13 81

Then, you should see a set of three Cochran-Mantel-Haenszel statistics — the first one is labeled Nonzero Correlation, the second one is labeled Row Mean Scores Differ, and the third one is labeled General Association:

Chi-square test of Hospital and Cystoscopy Procedure: CMH
Adjusting for Ctr
The FREQ Procedure
Summary Statistics for CYST_HB by SYM_1
Controlling for ctr
Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistics Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 2.6013 0.1068
2 Row Mean Scores Differ 2 2.9560 0.2281
3 General Association 8 6.9245 0.5448

The column labeled "DF" contains the given statistic's degrees of freedom, the column labeled "Value" contains the value calculated for the given statistic for the data set, and the column labeled "Prob" contains the statistic's P-value.


12.7 - Summary

12.7 - Summary

In this lesson, we learned about various features and options of the FREQ procedure.

The homework for this lesson will give you practice using these features and options.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility