Lesson 12: Summarizing Categorical Data
Lesson 12: Summarizing Categorical DataOverview
In this lesson, we'll investigate the FREQ procedure as a tool for summarizing and analyzing categorical data. The procedure is a descriptive procedure, as well as a statistical procedure. It allows you to produce oneway to nway frequency and crosstabulation tables. For twoway tables, the FREQ procedure also computes chisquare tests and measures of association. And, for nway tables, the FREQ procedure also performs stratified analyses, computing statistics within and across strata. The FREQ procedure can also output summary statistics, such as counts and percentages, to a SAS data set.
Objectives
Upon completing this lesson, you should be able to use the FREQ procedure to summarize a data set numerically in a variety of ways, including:
 create simple oneway, twoway, ... and nway table summaries
 suppress the printing of cumulative statistics in a table using the NOCUM option
 print only one table per page using the PAGE option
 read the values from a twoway table created by the FREQ procedure
 create twoway (and in general, nway) tables using the available shortcuts
 suppress some of the default output in each of the cells of an nway table using the NOROW, NOCOL, and NOPERCENT options
 request additional output in each of the cells of an nway table, such as EXPECTED, DEVIATION, and CELLCHI2
 print nway tables in a list format rather than as crosstabulation tables using the LIST and CROSSLIST tables options
 perform an operation for each level of the BY group using a BY statement
 treat missing values as nonmissing values, and therefore include them in the calculation of the statistics using the MISSING tables option
 treat missing values as nonmissing values when printing the frequencies, but do not include them in the calculation of the statistics using the MISSPRINT option
 create new SAS data sets containing summary statistics of categorical variables using the FREQ procedure
 suppress printing the nway crosstabulation using the NOPRINT tables option
 print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data using the SPARSE tables option
 invoke statistics table options, such as CHISQ, MEASURES, CMH, ALL, and EXACT
12.1  A Basic OneWay Table
12.1  A Basic OneWay TableBy default, the FREQ procedure creates a oneway table that contains the frequency, percent, cumulative frequency, and cumulative percent of every value of every variable in the input data set. Every is italicized with good reason ... the FREQ procedure doesn't care whether the variable is a character variable or a numeric variable. And, if a variable is numeric, the FREQ procedure doesn't care if it is a discrete numeric variable with just a few possible outcomes (number of siblings, say) or a continuous numeric variable with an infinite number of possible outcomes (weight, say). That means then if you rely on the default version of the FREQ procedure, it is possible to create lots and lots and lots of output. That's why we'll skip the default version and will jump right to the more practical version in which you restrict the number of tables SAS creates by using a TABLES statement.
The FREQ procedure takes the following generic form:
PROC FREQ options;
tables ... /options;
RUN;
The TABLES statement tells SAS the specific frequency table(s) that you want to create. If you don't include a TABLES statement, then SAS creates a oneway frequency table for every variable in your input data set.
As you can see, there are two types of options, namely procedure options and table options. Procedure options, such as the typical "DATA=" option, must follow the PROC FREQ statement. Table options must be specified after a forward slash (/) in the TABLES statement. In either case, you can specify as many options as you would like.
Throughout this lesson, we'll use the ICDB background data set to illustrate the FREQ procedure. Rightclick the link to save the data set to a convenient location on your computer.
Example 12.1
The following FREQ procedure illustrates the simplest practical example, namely a oneway frequency table of the variable sex, with no bells or whistles added:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
LIBNAME icdb 'C:\simon\icdb\data';
PROC FREQ data=icdb.back;
title 'Frequency Count of SEX';
tables sex;
RUN;
Launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run the program and review the output. You should see something along the lines of this basic oneway frequency table, in which, as promised, SAS reports the frequency, percent, cumulative frequency, and cumulative percent of each value of the sex variable:
sex  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  56  8.78  56  8.78 
2  582  91.22  638  100.00 
This output tells us, for example, that 56 or 8.78% of the subjects in the ICDB Study are male (coded as sex = 1).
Example 12.2
Again, by default, SAS outputs frequency counts, percents, cumulative frequencies, and cumulative percents. The NOCUM table option suppresses the printing of the cumulative frequencies and cumulative percentages for oneway frequency tables. The following SAS code illustrates the NOCUM table option:
PROC FREQ data=icdb.back;
title 'Frequency Count of SEX: No Cumulative Stats';
tables sex/nocum;
RUN;
sex  Frequency  Percent 

1  56  8.78 
2  528  91.22 
Launch and run the SAS program. Review the output to convince yourself that indeed the cumulative frequencies and cumulative percentages are not printed in the table. The table contains only the number and percentage of each sex.
In any FREQ procedure, you can specify many variables in a TABLES statement. If the list is long, you may be able to use a shortcut to specify the list of variables. If you specify a TABLES statement using a numbered range of variables, such as:
tables var1var4;
then SAS will create a oneway frequency table for the four variables named var1, var2, var3, and var4. If instead in your TABLES statement, you specify a range of variables by their position in the data set, such as:
tables sexrace;
then SAS will create a oneway frequency table for every variable that appears between the sex and race variables in the data set, namely in the case of the background data set, sex, state, country, and race. Recall that if you're not sure of the position of the variables in your data set, you can use the VARNUM option of the CONTENTS procedure to determine the position of the variables in a data set. (Incidentally, that is not a typo in the second TABLES statement ... it takes two dashes to specify a range of variables by their position in the data set.)
Rather than specifying many variables in a TABLES statement, you can specify many TABLES statements in a FREQ procedure. However you tell SAS to make multiple tables, you can use the PAGE option to tell SAS to print only one table per page. Otherwise, the FREQ procedure prints multiple tables per page as space permits.
Example 12.3
The following SAS program illustrates the creation of two oneway frequency tables in conjunction with the PAGE option:
PROC FREQ data=icdb.back page;
title 'Frequency Count of SEX and RACE';
tables sex race;
RUN;
Launch and run the SAS program. Review the output to convince yourself that indeed SAS creates two oneway frequency tables — one for the categorical variable sex and the other for the categorical variable race. Because the PAGE option was invoked, each table should be printed on a separate page. The first page should contain the frequency table for the sex variable:
sex  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  56  8.78  56  8.78 
2  528  91.22  638  100.00 
and the second page should contain the frequency table for the race variable:
race  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  2  0.31  2  0.31 
2  7  1.10  9  1.41 
3  29  4.55  38  5.96 
4  593  92.95  631  98.90 
5  3  0.47  634  99.37 
6  2  0.31  636  99.69 
7  1  0.16  637  99.87 
8  1  0.16  638  100.00 
Incidentally, you might also want to notice that, not surprisingly, the order in which the variables appear in the TABLES statement determines the order in which they appear in the output.
Example 12.4
As is the case for many SAS procedures, you can use a BY statement to tell SAS to perform an operation for each level of a BY group. The following program tells SAS to create a oneway frequency table for the variable ed_level for each level of the variable sex:
PROC SORT data=icbd.back out=s_back;
by sex;
RUN;
PROC FREQ data=s_back;
title 'Frequency Count of Education Level within Each Level of Sex';
tables ed_level;
by sex;
RUN;
As is always the case, the SORT procedure merely prepares the background data set for BYgroup processing. The SORT procedure tells SAS to sort the icdb.back data set by sex, and to store the results in a new data set called s_back. Then, as you can see, the FREQ procedure is invoked with a BY statement ("by sex") in addition to the TABLES statement ("tables ed_level"). Launch and run the SAS program. Review the output to convince yourself that SAS creates two oneway frequency tables of education level (ed_level) — one for males (sex= 1):
ed_level  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  4  7.14  4  7.14 
2  7  12.50  11  19.64 
3  12  21.43  23  41.07 
4  20  35.71  43  76.79 
5  13  23.21  56  100.00 
and one for females (sex = 2):
ed_level  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  7  1.20  7  1.20 
2  22  3.78  29  4.98 
3  220  37.80  249  42.78 
4  229  39.35  478  82.13 
5  104  17.87  582  100.00 
So far, in each of the examples we have looked at, no missing values exist. When they do exist, SAS by default excludes them from your requested frequency tables. Instead, SAS prints the "Frequency Missing" below each table. You can instead opt to use the MISSING tables option, in which you tell SAS to treat missing values as nonmissing values and to therefore include them in the calculation of percentages and other statistics. Or you can opt to use the MISSPRINT option to tell SAS to treat missing values as nonmissing values when printing the frequencies but do not include them in the calculation of the statistics.
Example 12.5
The following SAS program illustrates the MISSING and MISSPRINT options on the variable state in the background data set:
PROC FREQ data=icdb.back;
title 'Oneway Table of State: with MISSING Option';
tables state/missing;
RUN;
PROC FREQ data=icdb.back;
title 'Oneway Table of State: with MISSPRINT Option';
tables state/missprint;
RUN;
Launch and run the SAS program, and review the resulting output. The first few rows of output from the FREQ procedure with the MISSING tables option should look something like this:
sex  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

.  42  6.58  42  6.58 
1  4  0.63  46  7.21 
3  1  0.16  47  7.37 
4  5  0.78  52  8.15 
As you can see, the first row tells us that 42 subjects did not report the state in which they live. Because the MISSING option was used, SAS also tells us that 42 subjects comprise 6.58% of the subjects in the data set. SAS also includes the 42 subjects in the calculation of the cumulative percentage.
On the other hand, the first few rows of output from the FREQ procedure with the MISSPRINT tables option should look something like this:
sex  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

.  42  .  .  . 
1  4  0.67  4  0.67 
3  1  0.17  5  0.84 
4  5  0.84  10  1.68 
As you can see, the first row again tells us that 42 subjects did not report the state in which they live. In this case, however, because the MISSPRINT option was specified, SAS stops there. That is, SAS does not include the subjects in any of its calculations of the percent, cumulative frequency, or cumulative percent.
12.2  Twoway and Nway Tables
12.2  Twoway and Nway TablesSo far, we have only used the FREQ procedure to create oneway frequency tables. However, it is often helpful to create crosstabulation tables in which the frequencies are determined for more than one variable at a time. For example, if we were interested in knowing how the percentage of voters favoring Barack Obama differs for various age groups, we'd need to create a twoway crosstabulation table between the two categorical variables candidate (Obama, McCain) and agegroup (1829, 3044, 4554, 5570, more than 70). We'll investigate such tables in this section.
Example 12.6
The following FREQ procedure illustrates the simplest example of telling SAS to create a twoway table, for the variables sex and ed_level, with no bells and whistles added:
PROC FREQ data=icdb.back;
title 'Crosstabulation of Education Level and Sex';
tables ed_level*sex;
RUN;


Table of ed_level by sex  

ed_level  sex  
1  2  Total  
1 


 
2 


 
3 


 
4 


 
5 


 
Total 



As you can see, to tell SAS to create a twoway table of ed_level and sex, we merely use an asterisk (*) to join the two variables in the TABLES statement.
Launch and run the SAS program. Review the output to convince yourself that SAS created the requested twoway table. In general, the values of the variable appearing before the asterisk form the rows of the table, and the values of the variable appearing after the asterisk form the columns of the table. In this case, since ed_level appears before the asterisk in the TABLES statement, its values form the rows of the table. And, since sex appears after the asterisk, its values form the columns of the table.
When SAS creates twoway tables, each cell of the table contains, by default, the cell frequency, the cell percentage of the total frequency, the cell percentage of the row frequency, and the cell percentage of the column frequency. You might want to review the numbers in each of the cells of the table to make sure you understand what is what. In the upper lefthand corner of the table, SAS always prints a guide to the numbers appearing in each of the cells in the table. Here, SAS tells us that the first number in cell (i,j) is the number of subjects of ed_level i and sex j; the second number in cell (i,j) is the percentage of subjects of ed_level i and sex j; the third number in cell (i,j) is the percentage of subjects who are sex j given that they are ed_level i; and the fourth number in cell (i,j) is the percentage of subjects who are ed_level i given that they are sex j. For example, for the cell in which ed_level = 4 and sex = 2, SAS tells us that: 229 of the subjects in the data set are ed_level 4 and sex 2; 35.89% of the subjects in the data set are ed_level 4 and sex 2; 91.97% of the subjects in the data set who are ed_level 4 are sex 2; and 39.35% of the subjects in the data set who are sex 2 are ed_level 4.
A little note about shortcuts. If you have many different twoway tables to create, you can use a variety of shortcuts. For example, the TABLES statement:
tables a*(b c);
tells SAS to create a twoway table between variables a and b (a*b) and a twoway table between variables a and c (a*c). The TABLES statement:
tables (a b)*(c d);
tells SAS to create four twoway tables, namely: a*c, b*c, a*d, and b*d. The TABLES statement:
tables (a b c)*d;
tells SAS to create three twoway tables, namely: a*d, b*d, and c*d.
Example 12.7
For a frequency analysis of more than two variables, we can use the FREQ procedure to create nway crosstabulation tables. In that case, a series of twoway tables is created, with a table for each level of the other variable(s). The following program creates a threeway table of sex, job_chng, and ed_level:
PROC FREQ data=icdb.back;
title '3way Table of Sex, Job Change, and Ed. Level';
tables sex*job_chng*ed_level;
RUN;
As you can see, to tell SAS to create a threeway table of sex, job_chng, and ed_level, we use an asterisk (*) to join the three variables in the TABLES statement. The order of the variables is important. In nway tables, the last two variables of the TABLES statement become the rows and columns of the twoway tables. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables. So, in this case, we should expect SAS to create two twoway tables of job_chng and ed_level, one for when sex = 1 and one for when sex = 2.
Launch and run the SAS program, and review the output to convince yourself that SAS created the threeway table as described. You should see that, indeed, SAS created one twoway table of job_chng and ed_level for when sex = 1:
Frequency  

Percent  
Row Pct  
Col Pct  1  2  3  4  5  Total 
0  4  6  11  15  12  48 
7.84  11.76  21.57  29.41  23.53  94.12  
8.33  12.50  22.92  31.25  25.00  
100.00  100.00  100.00  88.24  92.31  
1  0  0  0  2  1  3 
0.00  0.00  0.00  3.92  1.96  5.88  
0.00  0.00  0.00  66.67  33.33  
0.00  0.00  0.00  11.76  7.69  
Total  4  6  11  17  13  51 
7.84  11.76  21.57  33.33  25.49  100.00 
Frequency Missing = 107
and one twoway table of job_chng and ed_level for when sex = 2:
Frequency  

Percent  
Row Pct  
Col Pct  1  2  3  4  5  Total 
0  4  14  148  163  74  403 
0.84  2.95  31.16  34.32  15.58  84.84  
0.99  3.47  36.72  40.45  18.36  
80.00  82.35  85.06  83.16  89.16  
1  1  3  26  33  9  72 
0.21  0.63  5.47  6.95  1.89  15.16  
1.39  4.17  36.11  45.83  12.50  
20.00  17.65  14.94  16.84  10.84  
Total  5  17  174  196  83  475 
1.05  3.58  36.63  41.26  17.47  100.00 
Frequency Missing = 107
It probably goes without saying that, in general, nway tables can generate lots of output.
12.3  Cell Statistics
12.3  Cell StatisticsAs you now know, when SAS creates twoway tables, each cell of the table contains, by default, the cell frequency, the (joint) cell percentage, the conditional row percentage, and the conditional column percentage. As usual, you don't have to accept the default ... you can tell SAS to suppress some of the default statistics it displays ... or you can tell SAS to display alternative statistics.
For nway crosstabulations, you can suppress the default output in each of the cells by using any (or all) of the following table options:
 NOFREQ suppresses the printing of the cell frequencies
 NOROW suppresses the printing of the row percentages
 NOCOL suppresses the printing of the column percentages, and
 NOPERCENT suppresses the printing of the (joint) cell percentages
Example 12.8
The following SAS program illustrates the NOROW, NOCOL, and NOPERCENT table options:
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: No percents';
tables race*sex/norow nocol nopercent;
RUN;
Frequency  sex  

race  1  2  Total 
1  0  0  2 
2  2  3  7 
3  1  28  29 
4  51  542  593 
5  0  3  3 
6  0  2  2 
7  1  0  1 
8  0  1  1 
Total  56  582  638 
Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the default row percentages, column percentages, and cell percentages. Note, too, that SAS changed the guide in the upper lefthand corner of the table to reflect the new situation — the guide now tells us that the lone number reported in each cell is the cell frequency.
Instead of suppressing output in an nway table, we can request additional output in each of the cells by using the following options:
 EXPECTED, to print the expected cell frequencies under the null hypothesis of independence
 DEVIATION, to print the deviation of the cell frequency from the expected cell frequency (under the null hypothesis of independence)
 CELLCHI2, to print each cell's contribution to the total chisquared statistic
That is, CELLCHI2 is defined as:
\(\text{CELLCHI2} = \dfrac{\left(\text{frequency}  \text{expected}\right)^2}{\text{expected}}\)
Example 12.9
In creating the twoway table between race and sex, the following FREQ procedure requests that the EXPECTED and CELLCHI2 statistics be printed, while at the same time suppressing the printing of the joint, row, and column percentages:
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: With Expecteds';
tables race*sex/expected cellchi2 norow nocol nopercent;
RUN;
race  sex  

Frequency  
Expected  
Cell ChiSquare  1  2  Total 
1  0  2  2 
0.1755  1.8245  
0.1755  0.0169  
2  3  4  7 
0.6144  6.3856  
9.2624  0.8912  
3  1  28  29 
2.5455  26.455  
0.9383  0.0903  
4  51  542  593 
52.05  540.95  
0.0212  0.002  
5  0  3  3 
0.2633  2.7367  
0.2633  0.0253  
6  0  2  2 
0.1755  1.8245  
0.1755  0.0169  
7  0  2  2 
0.0878  0.9122  
9.4806  09.122  
8  0  1  1 
0.0878  0.9122  
0.0878  0.0084  
Total  56  582  638 
Launch and run the SAS program. Review the output to convince yourself that SAS did indeed suppress the printing of the joint, row, and column percentages, while adding the expected cell frequencies, as well as the cell's contribution to the chisquare statistic, to the cell output. Note, again, that SAS changed the guide in the upper lefthand corner of the table to reflect the new situation — the guide now tells us that the second number in each cell is the expected cell frequency and the third number is the cell's contribution to the chisquare statistic.
12.4  Changing the Table Format
12.4  Changing the Table FormatWhen three or more variables are specified, the multiple levels of nway tables can produce lots of output. Such bulky and complex crosstabulations are often easier to read as a continuous list. To tell SAS to generate list output for crosstabulations rather than the default tables, simply use the LIST tables option.
Example 12.10
The following FREQ procedure illustrates the LIST option:
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: In List Format';
tables sex*race/list;
RUN;
sex  race  Frequency  Percent  Cumulative Frequency  Cumulative Percent 

1  2  3  0.47  3  0.47 
1  3  1  0.16  4  0.63 
1  4  51  7.99  55  8.62 
1  7  1  0.16  56  8.78 
2  1  2  0.31  58  9.09 
2  2  4  0.63  62  9.72 
2  3  28  4.39  90  14.11 
2  4  542  84.95  632  99.06 
2  5  3  0.47  635  99.53 
2  6  2  0.31  637  99.84 
2  8  1  0.16  638  100.00 
Launch and run the SAS program. Review the output to convince yourself that rather than creating a twoway table, SAS prints the requested twoway table as a list. I tend to prefer the LIST format for such crosstabulations, as long as I'm not in need of the (conditional) row and column percentages (which, as you can see, are not reported in LIST format).
Incidentally, the LIST option cannot be used when the TABLES statement requests statistical tests or measures of association to be included in the output. (We'll be discussing such tests and measures later in this lesson.) Also, expected cell frequencies are not printed when the LIST option is specified, even if the EXPECTED option is invoked.
An alternative to the LIST option is the CROSSLIST table option. When you add the CROSSLIST option to your TABLES statement, SAS displays crosstabulation tables in the standard Output Delivery System (ODS) column format. The benefit of doing so probably won't make much sense until we learn about ODS in Stat 481. The option creates a table that has a table definition that you can customize by using the TEMPLATE procedure. For now, let's just be satisfied with looking at an example to see what the output looks like when you use the CROSSLIST option.
Example 12.11
The following FREQ procedure illustrates the CROSSLIST option:
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: In Crosslist Format';
tables sex*race/crosslist;
RUN;
Table of sex by race  

sex  race  Frequency  Percent  Row  Column 
1  1  0  0.00  0.00  0.00 
2  3  0.47  5.36  42.86  
3  1  0.16  1.79  3.45  
4  51  7.99  91.07  8.60  
5  0  0.00  0.00  0.00  
6  0  0.00  0.00  0.00  
7  1  0.16  1.79  100.00  
8  0  0.00  0.00  0.00  
Total  56  8.78  100.00  
2  1  2  0.31  0.34  100.00 
2  4  0.63  0.69  57.14  
3  28  4.39  4.81  96.55  
4  542  84.95  93.13  91.40  
5  3  0.47  0.52  100.00  
6  2  0.31  0.34  100.00  
7  0  0.00  0.00  0.00  
8  1  0.16  0.17  100.00  
Total  582  91.22  100.00  
Total  1  2  0.31  100.00  
2  7  1.10  100.00  
3  29  4.55  100.00  
4  593  92.95  100.00  
5  3  0.47  100.00  
6  2  0.31  100.00  
7  1  0.16  100.00  
8  1  0.16  100.00  
Total  638  100.00 
Launch and run the SAS program, and review the output to familiarize yourself with the appearance of the output when you use the CROSSLIST option.
12.5  Creating Output Data Sets
12.5  Creating Output Data SetsJust as the UNIVARIATE, MEANS and SUMMARY procedures can be used to create new SAS data sets containing summary statistics of numeric variables, the FREQ procedure can be used to create new SAS data sets containing summary statistics of categorical variables. This can be a very useful tool when manipulating and subsequently analyzing data.
Example 12.12
The following FREQ procedure tells SAS to create an output data set that contains the counts and percentages for each combination of the variables sex and race:
PROC FREQ data=icdb.back;
tables sex*race/out=sexfreq noprint;
RUN;
PROC PRINT;
title 'Output Dataset: SEXFREQ';
RUN;
Obs  sex  race  COUNT  PERCENT 

1  1  2  3  0.4702 
2  1  3  1  0.1567 
3  1  4  51  7.9937 
4  1  7  1  0.1567 
5  2  1  2  0.3135 
6  2  2  4  0.6270 
7  2  3  28  4.3887 
8  2  4  542  84.9530 
9  2  5  3  0.4702 
10  2  6  2  0.3135 
11  2  8  1  0.1567 
The NOPRINT tables option tells SAS to suppress printing the twoway crosstabulation of sex*race. Instead, the results are dumped to the temporary data set sexfreq that is specified in the OUT= tables option. The data set in the OUT= option becomes the current data set. Therefore, since the PRINT procedure does not have a DATA= option to tell SAS the name of the data set to print, SAS prints the current data set, namely sexfreq.
Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set contains one record for each combination of sex and race that occurs in the data set. Note that the data set contains four variables, sex, race, COUNT (whose value is the number of subjects in the data set of that sex and race), and PERCENT (whose value is the percentage of subjects in the data set of that sex and race).
Example 12.13
The SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data. This option affects printouts under the LIST option and output data sets only. The following SAS code illustrates the use of the SPARSE option in the creation of an output data set called sexfreq:
PROC FREQ data=icdb.back;
tables sex*race/out=sexfreq noprint sparse;
RUN;
PROC PRINT;
title 'Output Dataset: SEXFREQ with SPARSE option';
RUN;
Obs  sex  race  COUNT  PERCENT 

1  1  1  0  0.0000 
2  1  2  3  0.4702 
3  1  3  1  0.1567 
4  1  4  51  7.9937 
5  1  5  0  0.0000 
6  1  6  0  0.0000 
7  1  7  1  0.1567 
8  1  8  0  0.0000 
9  2  1  2  0.3135 
10  2  2  4  0.6270 
11  2  3  28  4.3887 
12  2  4  542  84.9530 
13  2  5  3  0.4702 
14  2  6  2  0.3135 
15  2  7  0  0.0000 
16  2  8  1  0.1567 
The only difference between this code and the previous code is the inclusion of the SPARSE option. Launch and run the SAS program. Review the output to convince yourself that the sexfreq data set now contains one record for each possible combination of sex and race regardless of whether or not the combination occurs in the data set. That is, since there are 2 sexes and 8 races, the sexfreq data set contains 16 records — one for each of the combinations. For example, there is now a record in the sexfreq data set corresponding to sex = 1 and race = 1 even though there are no subjects in the data set of that sex and race combination (COUNT = 0).
12.6  Table Statistics
12.6  Table StatisticsThere are a few possible statistical table options available in the FREQ procedure. Unfortunately, we don't have the luxury of digressing and discussing in this course when it is appropriate to use each of the statistics. We merely make ourselves aware of their existence in SAS should we need them in analyzing our data in the future. The options include:
 CHISQ, which requests a chisquare test of homogeneity or independence. Reported measures include the phi coefficient, the contingency coefficient, and Cramer's V. For 2×2 tables, Fisher's exact test is also included.
 MEASURES, which requests a basic set of measures of association and their standard errors. Measures include Pearson and Spearman correlation coefficients, gamma, Kendall's taub, Stuart's tauc, Somer's D, lambda, and uncertainty coefficients. For 2×2 tables, odds ratios and risk ratios (with their confidence intervals) are reported.
 CMH, which requests CochranMantelHaenszel statistics, which test for association between a row and column variable after adjusting for all of the other variables in the tables statement. For 2×2 tables, Breslow's test for homogeneity of odds ratios is reported as well.
 ALL, which requests all of the tests and measures provided by the CHISQ, MEASURE, and CMH options.
 EXACT, which requests Fisher's Exact Test for tables larger than 2×2 tables. Note that this can take painfully long to compute for large tables.
Of course, you can always refer to the SAS Help and Documentation for the technical details of the statistical tests and measures.
Example 12.14
For the remaining examples in this lesson, we need an analysis data set with which to work. The analysis data set (click to save!) that we'll use contains (just!) four variables pulled from the ICDB background data set (with which we've already been working!), the ICDB cystoscopy data set, and the ICDB symptoms data set. The following program merely displays the variable names of, and the first 15 observations in, the analysis data set:
OPTIONS NOFMTERR;
PROC CONTENTS data = icdb.analysis position;
title 'The Analysis data set';
RUN;
PROC PRINT data = icdb.analysis (OBS = 15);
title 'The Analysis data set';
RUN;
Launch and run the SAS program. From the CONTENTS procedure output, you should see that the analysis data set contains 638 observations and the following four variables:
#  Variable  Type  Len  Format  Informat  Label 

1  subj  Num  8  11.  11.  patient ID number 
2  CYST_HB  Num  8  
3  SYM_1  Num  8  SYMPTSEV.  4.  severity of urinary symptoms 
4  ctr  Num  8 
Besides the variable names, this list tells us that the sym_1 variable has a permanent format called symptsev. associated with it. You might recall that if we don't have access to the permanent formats catalog — which we don't — then we'll run into trouble when trying to use the sym_1 variable. That's why we specified the NOFMTERR option before trying to use the analysis data set. The NOFMTERR option allows us to use the analysis data set. We just can't take advantage of its variables' permanent formats.
Perhaps you'd appreciate an explanation of what the four variables are:
 subj: subject ID number
 ctr: clinical center number (1, 2, 3, 4, 5)
 cyst_hb: indicates whether or not a hydrodistention and/or biopsy was performed (0 = a local cystoscopy with no hydrodistention or biopsy, 1 = hydrodistention and biopsy, and 2 = hydrodistention only)
 sym_1: which indicates the subject's general assessment of the severity of his/her urinary symptoms (1 = not severe at all to 5 = extremely severe)
It might also help to review the first 15 observations just to get a feel for what the data looks like:
Obs  subj  CYST_HB  SYM_1  ctr 

1  110027  1  4  1 
2  110029  .  4  1 
3  110039  .  3  1 
4  110040  .  3  1 
5  110045  .  3  1 
6  11049  .  3  1 
7  110051  .  4  1 
8  110052  1  4  1 
9  110053  .  3  1 
10  110055  .  3  1 
11  110057  .  .  1 
12  110058  1  4  1 
13  110059  .  4  1 
14  110060  .  3  1 
15  110062  1  4  1 
Now, as promised, let's use the analysis data set to illustrate some of the statistical tests and measures of association you can request from SAS while using the FREQ procedure.
Example 12.15
Some clinical centers may be more likely to perform a cystoscopy (which is a fairly invasive procedure) on their patients than other clinical centers. One could imagine this happening for a variety of reasons. For instance, one clinical center might have more severe patients thereby justifying more invasive procedures. Or perhaps the patients attending a particular clinical center might be better off financially and therefore more willing to pay for additional procedures. In any case, suppose we are interested in testing whether or not there is an association between performing a cystoscopy (cyst_hb) and clinical center (ctr). A chisquare test between the two variables would help us answer our research question. The following FREQ procedure illustrates the use of the CHISQ tables option in order to obtain the value of the chisquare test statistic and its associated Pvalue (as well as a few other useful statistics and Pvalues):
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chisquare Test of Hospital and Cystoscopy Procedure: CHISQ';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing chisq;
RUN;
ctr  CYST_HB  

Frequency  
Row Pct  Nothing  Local  Both  Hydro  Total 
1  96  0  10  0  106 
90.57  0.00  9.43  0.00  
2  77  0  14  0  52 
48.73  5.70  44.30  1.27  
3  38  0  14  0  52 
73.08  0.00  26.92  0.00  
4  101  16  65  5  187 
54.01  8.56  34.76  2.67  
5  86  1  41  7  135 
63.70  0.74  30.37  5.19  
Total  398  26  200  14  638 
Statistics  DF  Value  Prob 

ChiSquare  12  77.2068  <.001 
Likelihood Ratio ChiSquare  12  89.2250  <.001 
MantelHaenszel ChiSquare  1  .  . 
Phi Coefficient  0.3479  
contingency Coefficient  0.3286  
Cramer's V  0.2008 
WARNING: 35% of the cell's have expected counts less then 5. Chisquare may not be a valid test
Sample Size 638
Of course, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.
The TABLES statement first requests a twoway table between the (row) variable ctr and the (column) variable cyst_hb. Because we are interested in including the missing values ("Nothing") in the analysis, we include the MISSING tables option. When doing this kind of analysis, I like to use the NOPERCENT and NOCOL tables options to help declutter the twoway frequency table. Finally, the CHISQ tables option tells SAS to calculate the chisquare statistics and its Pvalue (as well as a few other statistics).
Launch and run the SAS program, and review the resulting output. You should first see a twoway table (with five rows for ctr and four columns for cyst_hb). Then, you should see a list of six different statistics, of which the first one is the chisquare test statistic. Here, the value of the statistic is 77.2 (rounded) with 12 degrees of freedom (DF) and a Pvalue that is less than 0.0001. (Assuming that there is enough nonmissing data and therefore that the chisquare test is valid), the Pvalue tells us that it is highly unlikely that we would observe the data we did under the assumption that ctr and cyst_hb are not associated. Therefore, we conclude that ctr and cyst_hb are associated.
Example 12.16
If we are interested in quantifying the association between clinical center and performing a cystoscopy, then we would want to take advantage of the MEASURES tables option. The following FREQ procedure illustrates the use of the MEASURES tables option to obtain a basic set of measures of association and their standard errors:
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chisquare Test of Hospital and Cystoscopy Procedure: MEASURES';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing measures;
RUN;
Again, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.
Launch and run the SAS program, and review the resulting output. You should again first see a twoway table (with five rows for ctr and four columns for cyst_hb):
ctr  CYST_HB  

Frequency  
Row Pct  Nothing  Local  Both  Hydro  Total 
1  4  7  11  
7.14  1.20  
2  7  22  29  
12.50  3.78  
3  12  220  232  
21.43  37.80  
4  20  229  249  
35.71  39.35  
5  13  104  117  
23.21  17.87  
Total  8.78  91.22  100.00 
Then, you should see a list of thirteen different statistics lumped into five groups:
Statistic  Value  ASE 

Gama  0.1697  0.0534 
Kendall's Tauc  0.1071  0.0339 
Stuart's Tauc  0.0897  0.0286 
Somers' D CR  0.0870  0.0276 
Somers' D RC  0.1318  0.0417 
Pearson Correlation  .  . 
Spearman Correlation  0.1206  0.0381 
Lambda Asymmetric CR  0.0000  0.0000 
Lambda Asymmetric RC  0.0155  0.0267 
Lambda Symmetric  0.0101  0.0175 
Uncertainty Coefficient CR  0.0802  0.0133 
Uncertainty Coefficient RC  0.0455  0.0079 
Uncertainty Coefficient Symmetric  0.0581  0.0099 
The first statistic is Gamma, the second is Kendall's Taub, ..., and the last is Uncertainty Coefficient Symmetric. The column labeled "Value" is the value calculated for the given statistic for this data set, and the column labeled "ASE" is the calculated (asymptotic) standard error of the statistic.
Example 12.17
The CochranMantelHaenszel test allows us to test for the association between two categorical variables while adjusting for a third categorical variable. To request that SAS performs such a test, we must use the CMH tables option. The following CMH tables option requests the CochranMantelHaenszel to test for association between the two variables cyst_hb and sym_1 while adjusting for the third variable ctr:
OPTIONS NOFMTERR;
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chisquare Test of Hospital and Cystoscopy Procedure: CMH';
title2 'Adjusting for Ctr';
format cyst_hb cystfmt.;
tables ctr*cyst_hb*sym_1/nopercent nocol cmh;
RUN;
As always, we put the stratifying variable — in this case, ctr — in the first position of the tables statement. Then, we put the two variables between which we are testing for association — in this case, cyst_hb and sym_1 — in the second and third positions. Again, the NOPERCENT and NOCOL tables options are used to help declutter the resulting frequency tables. Finally, the CMH tables option gives us the CochranMantelHaenszel test for association.
Launch and run the SAS program, and review the resulting output. You should first see a set of five twoway tables (one for each level of ctr). Here's what the table looks like for ctr = 2:
CYST_HB  SYM_1 (severity of urinary symptoms)  

Frequency  
Row Pct  1  2  3  4  5  Total 
Local  1  2  2  3  1  9 
11.11  22.22  22.22  33.33  11.11  
Both  1  5  22  30  12  70 
1.43  7.14  31.43  42.86  17.14  
Hydro  0  0  0  2  0  2 
0.00  0.00  0.00  100.00  0.00  
Total  2  7  24  35  13  81 
Then, you should see a set of three CochranMantelHaenszel statistics — the first one is labeled Nonzero Correlation, the second one is labeled Row Mean Scores Differ, and the third one is labeled General Association:
Statistics  Alternative Hypothesis  DF  Value  Prob 

1  Nonzero Correlation  1  2.6013  0.1068 
2  Row Mean Scores Differ  2  2.9560  0.2281 
3  General Association  8  6.9245  0.5448 
The column labeled "DF" contains the given statistic's degrees of freedom, the column labeled "Value" contains the value calculated for the given statistic for the data set, and the column labeled "Prob" contains the statistic's Pvalue.
12.7  Summary
12.7  SummaryIn this lesson, we learned about various features and options of the FREQ procedure.
The homework for this lesson will give you practice using these features and options.