# Lesson 12: Summarizing Categorical Data

Lesson 12: Summarizing Categorical Data## Overview

In this lesson, we'll investigate the FREQ procedure as a tool for summarizing and analyzing categorical data. The procedure is a descriptive procedure, as well as a statistical procedure. It allows you to produce one-way to *n*-way frequency and cross tabulation tables. For two-way tables, the FREQ procedure also computes chi-square tests and measures of association. And, for *n*-way tables, the FREQ procedure also performs stratified analyses, computing statistics within as well as across strata. The FREQ procedure can also be used to output summary statistics, such as counts and percentages, to a SAS data set.

## Objectives

Upon completing this lesson, you should be able to use the FREQ procedure to summarize a data set numerically in a variety of ways, including:

- create simple one-way, two-way, ... and
*n*-way table summaries - use the NOCUM option to suppress the printing of cumulative statistics in a table
- use the PAGE option to tell SAS to print only one table per page
- know how to read the values from a two-way table created by the FREQ procedure
- create two-way (and in general,
*n*-way) tables using the available shortcuts - use the NOROW, NOCOL, and NOPERCENT options to suppress some of the default output in each of the cells of an
*n*-way table - request additional output in each of the cells of an
*n*-way table, such as EXPECTED, DEVIATION, and CELLCHI2 - use the LIST and CROSSLIST tables options to print
*n*-way tables in a list format rather than as crosstabulation tables - use a BY statement to tell SAS to perform an operation for each level of the BY-group
- use the MISSING tables option tells SAS to treat missing values as nonmissing values, and to therefore include them in the calculation of the statistics
- use the MISSPRINT option tells SAS to treat missing values as nonmissing values when printing the frequencies, but to not include them in the calculation of the statistics
- use the FREQ procedure create new SAS data sets containing summary statistics of categorical variables
- use the NOPRINT tables option to tell SAS to suppress printing the
*n*-way crosstabulation - use the SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data
- invoke statistics table options, such as CHISQ, MEASURES, CMH, ALL, and EXACT

# 12.1 - A Basic One-Way Table

12.1 - A Basic One-Way TableBy default, the FREQ procedure creates a one-way table that contains the **frequency**, **percent**, **cumulative frequency**, and **cumulative percent** of every value of *every* variable in the input data set. That *every* there is italicized with good reason ... the FREQ procedure doesn't care whether the variable is a character variable or a numeric variable. And, if a variable is numeric, the FREQ procedure doesn't care if it is a discrete numeric variable with just a few possible outcomes (number of siblings, say) or a continuous numeric variable with an infinite number of possible outcomes (weight, say). That means then if you rely on the default version of the FREQ procedure, it is possible to create lots and lots and lots of output. That's why we'll skip the default version and will jump right to the more practical version in which you restrict the number of tables SAS creates by using a TABLES statement.

The FREQ procedure takes the following generic form:

```
PROC FREQ options;
tables ... /options;
RUN;
```

The TABLES statement tells SAS the specific frequency table(s) that you want to create. If you don't include a TABLES statement, then SAS creates a one-way frequency table for every variable in your input data set.

As you can see, there are two types of options, namely **procedure options** and **table options**. Procedure options, such as the typical "DATA=" option, must follow the PROC FREQ statement. Table options must be specified after a forward slash (/) in the TABLES statement. In either case, you can specify as many options as you would like.

Throughout this lesson, we'll use the ICDB background data set to illustrate the FREQ procedure. Right-click the link to save the data set to a convenient location on your computer.

## Example 12.1

The following FREQ procedure illustrates the simplest practical example, namely a one-way frequency table of the variable `sex`, with no bells or whistles added:

```
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
LIBNAME icdb 'C:\simon\icdb\data';
PROC FREQ data=icdb.back;
title 'Frequency Count of SEX';
tables sex;
RUN;
```

Launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run * * the program and review the output. You should see something along the lines of this basic one-way frequency table, in which, as promised, SAS reports the frequency, percent, cumulative frequency, and cumulative percent of each value of the `sex` variable:

##### Frequency Count of SEX

##### The FREQ Procedure

sex | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

1 | 56 | 8.78 | 56 | 8.78 |

2 | 582 | 91.22 | 638 | 100.00 |

This output tells us, for example, that 56 or 8.78% of the subjects in the ICDB Study are male (coded as` sex` = 1).

## Example 12.2

Again, by default, SAS outputs frequency counts, percents, cumulative frequencies, and cumulative percents. The NOCUM table option suppresses the printing of the cumulative frequencies and cumulative percentages for one-way frequency tables. The following SAS code illustrates the NOCUM table option:

```
PROC FREQ data=icdb.back;
title 'Frequency Count of SEX: No Cumulative Stats';
tables sex/nocum;
RUN;
```

##### Frequency Count of SEX: No Cumulative Stats

##### The FREQ Procedure

sex | Frequency | Percent |
---|---|---|

1 | 56 | 8.78 |

2 | 528 | 91.22 |

Launch and run * * the SAS program. Review the output to convince yourself that indeed the cumulative frequencies and cumulative percentages are not printed in the table. The table contains only the number and percentage of each `sex`.

In any FREQ procedure, you can specify many variables in a TABLES statement. If the list is long, you may be able to use a shortcut to specify the list of variables. If you specify a TABLES statement using a numbered range of variables, such as:

`tables var1-var4;`

then SAS will create a one-way frequency table for the four variables named `var1`, `var2`, `var3`, and `var4`. If instead in your TABLES statement, you specify a range of variables by their position in the data set, such as:

`tables sex--race;`

then SAS will create a one-way frequency table for every variable that appears between the `sex` and `race` variables in the data set, namely in the case of the background data set, `sex`, `state`, `country`, and `race`. Recall that if you're not sure of the position of the variables in your data set, you can use the VARNUM option of the CONTENTS procedure to determine the position of the variables in a data set. (Incidentally, that is *not* a typo in the second TABLES statement ... it takes two dashes to specify a range of variables by their position in the data set.)

Rather than specifying many variables in a TABLES statement, you can specify many TABLES statements in a FREQ procedure. However you tell SAS to make multiple tables, you can use the PAGE option to tell SAS to print only one table per page. Otherwise, the FREQ procedure prints multiple tables per page as space permits.

## Example 12.3

The following SAS program illustrates the creation of two one-way frequency tables in conjunction with the PAGE option:

```
PROC FREQ data=icdb.back page;
title 'Frequency Count of SEX and RACE';
tables sex race;
RUN;
```

Launch and run * * the SAS program. Review the output to convince yourself that indeed SAS creates two one-way frequency tables — one for the categorical variable `sex` and the other for the categorical variable `race`. Because the PAGE option was invoked, each table should be printed on a separate page. The first page should contain the frequency table for the `sex` variable:

##### Frequency Count of SEX and RACE

##### The FREQ Procedure

sex | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

1 | 56 | 8.78 | 56 | 8.78 |

2 | 528 | 91.22 | 638 | 100.00 |

and the second page should contain the frequency table for the `race` variable:

##### Frequency Count of SEX and RACE

##### The FREQ Procedure

race | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

1 | 2 | 0.31 | 2 | 0.31 |

2 | 7 | 1.10 | 9 | 1.41 |

3 | 29 | 4.55 | 38 | 5.96 |

4 | 593 | 92.95 | 631 | 98.90 |

5 | 3 | 0.47 | 634 | 99.37 |

6 | 2 | 0.31 | 636 | 99.69 |

7 | 1 | 0.16 | 637 | 99.87 |

8 | 1 | 0.16 | 638 | 100.00 |

Incidentally, you might also want to notice that, not surprisingly, the order in which the variables appear in the TABLES statement determines the order in which they appear in the output.

## Example 12.4

As is the case for many SAS procedures, you can use a BY statement to tell SAS to perform an operation for each level of a BY group. The following program tells SAS to create a one-way frequency table for the variable *ed_level* for each level of the variable *sex*:

```
PROC SORT data=icbd.back out=s_back;
by sex;
RUN;
PROC FREQ data=s_back;
title 'Frequency Count of Education Level within Each Level of Sex';
tables ed_level;
by sex;
RUN;
```

As is always the case, the SORT procedure merely prepares the background data set for BY-group processing. The SORT procedure tells SAS to sort the *icdb.back* data set by *sex*, and to store the results in a new data set called *s_back*. Then, as you can see, the FREQ procedure is invoked with a BY statement ("by sex") in addition to the TABLES statement ("tables ed_level"). Launch and run * * the SAS program. Review the output to convince yourself that SAS creates two one-way frequency tables of education level (*ed_level*) — one for males (*sex*= 1):

##### Frequency Count of Education Level within Each Level of Sex

##### sex = 1

ed_level | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

1 | 4 | 7.14 | 4 | 7.14 |

2 | 7 | 12.50 | 11 | 19.64 |

3 | 12 | 21.43 | 23 | 41.07 |

4 | 20 | 35.71 | 43 | 76.79 |

5 | 13 | 23.21 | 56 | 100.00 |

and one for females (*sex* = 2):

##### Frequency Count of Education Level within Each Level of Sex

##### sex = 2

ed_level | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

1 | 7 | 1.20 | 7 | 1.20 |

2 | 22 | 3.78 | 29 | 4.98 |

3 | 220 | 37.80 | 249 | 42.78 |

4 | 229 | 39.35 | 478 | 82.13 |

5 | 104 | 17.87 | 582 | 100.00 |

So far, in each of the examples we have looked at, no missing values existed. When they do exist, SAS by default excludes them from your requested frequency tables. Instead, SAS prints the "**Frequency Missing**" below each table. You can instead opt to use the MISSING tables option, in which you tell SAS to instead treat missing values as nonmissing values, and to therefore include them in the calculation of percentages and other statistics. Or you can opt to use the MISSPRINT option to tell SAS to treat missing values as nonmissing values when printing the frequencies but do not include them in the calculation of the statistics.

## Example 12.5

The following SAS program illustrates the MISSING and MISSPRINT options on the variable `state` in the background data set:

```
PROC FREQ data=icdb.back;
title 'One-way Table of State: with MISSING Option';
tables state/missing;
RUN;
PROC FREQ data=icdb.back;
title 'One-way Table of State: with MISSPRINT Option';
tables state/missprint;
RUN;
```

Launch and run * * the SAS program, and review the resulting output. The first few rows of output from the FREQ procedure with the MISSING tables option should look something like this:

##### One-way Table of State: with MISSING Option

##### The FREQ Procedure

sex | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

. | 42 | 6.58 | 42 | 6.58 |

1 | 4 | 0.63 | 46 | 7.21 |

3 | 1 | 0.16 | 47 | 7.37 |

4 | 5 | 0.78 | 52 | 8.15 |

As you can see, the first row tells us that 42 subjects did not report the state in which they live. Because the MISSING option was used, SAS also tells us the 42 subjects comprise 6.58% of the subjects in the data set. SAS also includes the 42 subjects in the calculation of the cumulative percentage.

On the other hand, the first few rows of output from the FREQ procedure with the MISSPRINT tables option should look something like this:

##### One-way Table of State: with MISSPRINT Option

##### The FREQ Procedure

sex | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|

. | 42 | . | . | . |

1 | 4 | 0.67 | 4 | 0.67 |

3 | 1 | 0.17 | 5 | 0.84 |

4 | 5 | 0.84 | 10 | 1.68 |

As you can see, the first row again tells us that 42 subjects did not report the state in which they live. In this case, however, because the MISSPRINT option was specified, SAS stops there. That is, SAS does not include the subjects in any of its calculations of the percent, cumulative frequency or cumulative percent.

# 12.2 - Two-way and N-way Tables

12.2 - Two-way and N-way TablesSo far, we have only used the FREQ procedure to create one-way frequency tables. However, it is often helpful to create crosstabulation tables in which the frequencies are determined for more than one variable at a time. For example, if we were interested in knowing how the percentage of voters favoring Barack Obama differs for various age groups, we'd need to create a two-way crosstabulation table between the two categorical variables *candidate* (Obama, McCain) and *agegroup* (18-29, 30-44, 45-54, 55-70, more than 70). We'll investigate such tables in this section.

## Example 12.6

The following FREQ procedure illustrates the simplest example of telling SAS to create a two-way table, for the variables `sex` and `ed_level`, with no bells and whistles added:

```
PROC FREQ data=icdb.back;
title 'Crosstabulation of Education Level and Sex';
tables ed_level*sex;
RUN;
```

##### Crosstabulation of Education Level and RACE

##### The FREQ Procedure

##### table of ed_level by sex

Frequency | |||
---|---|---|---|

Percent | |||

Row Pct | |||

Col Pct | 1 | 2 | Total |

1 | 4 | 7 | 11 |

0.63 | 1.10 | 1.72 | |

36.36 | 63.64 | ||

7.14 | 1.20 | ||

2 | 7 | 22 | 29 |

1.10 | 3.45 | 4.55 | |

24.14 | 75.86 | ||

12.50 | 3.78 | ||

3 | 12 | 220 | 232 |

1.88 | 34.48 | 36.36 | |

5.17 | 94.83 | ||

21.43 | 37.80 | ||

4 | 20 | 229 | 249 |

3.13 | 35.89 | 39.03 | |

8.03 | 91.97 | ||

35.71 | 39.35 | ||

5 | 13 | 104 | 117 |

2.04 | 16.30 | 18.34 | |

11.11 | 88.89 | ||

23.21 | 17.87 | ||

Total | 56 | 582 | 638 |

8.78 | 91.22 | 100.00 |

As you can see, to tell SAS to create a two-way table of `ed_level` and `sex`, we merely use an asterisk (*) to join the two variables in the TABLES statement.

Launch and run * * the SAS program. Review the output to convince yourself that SAS created the requested two-way table. In general, the values of the variable appearing before the asterisk form the rows of the table, and the values of the variable appearing after the asterisk form the columns of the table. In this case, since `ed_level` appears before the asterisk in the TABLES statement, its values form the rows of the table. And, since `sex` appears after the asterisk, its values form the columns of the table.

When SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the cell percentage of the total frequency, the cell percentage of the row frequency, and the cell percentage of the column frequency. You might want to review the numbers in each of the cells of the table to make sure you understand what is what. In the upper left-hand corner of the table, SAS always prints a guide to the numbers appearing in each of the cells in the table. Here, SAS tells us that the first number in cell (`i`,`j`) is the number of subjects of `ed_level` `i` and `sex` `j`; the second number in cell (`i`,`j`) is the percentage of subjects of `ed_level` `i` and `sex` `j`; the third number in cell (`i`,`j`) is the percentage of subjects who are `sex` `j` given that they are `ed_level` `i`; and the fourth number in cell (`i`,`j`) is the percentage of subjects who are `ed_level` `i` given that they are `sex` `j`. For example, for the cell in which `ed_level ` = 4 and `sex` = 2, SAS tells us that: 229 of the subjects in the data set are `ed_level` 4 and `sex` 2; 35.89% of the subjects in the data set are `ed_level` 4 and `sex` 2; 91.97% of the subjects in the data set who are `ed_level` 4 are `sex` 2; and 39.35% of the subjects in the data set who are `sex` 2 are `ed_level` 4.

**A little note about shortcuts.** If you have many different two-way tables to create, you can use a variety of shortcuts. For example, the TABLES statement:

`tables a*(b c);`

tells SAS to create a two-way table between variables `a` and `b` (a*b) and a two-way table between variables `a `and `c` (a*c). The TABLES statement:

`tables (a b)*(c d);`

tells SAS to create four two-way tables, namely: a*c, b*c, a*d, and b*d. The TABLES statement:

`tables (a b c)*d;`

tells SAS to create three two-way tables, namely: a*d, b*d, and c*d.

## Example 12.7

For a frequency analysis of more than two variables, we can use the FREQ procedure to create `n`-way crosstabulation tables. In that case, a series of two-way tables is created, with a table for each level of the other variable(s). The following program creates a three-way table of `sex`, `job_chng`, and `ed_level`:

```
PROC FREQ data=icdb.back;
title '3-way Table of Sex, Job Change, and Ed. Level';
tables sex*job_chng*ed_level;
RUN;
```

As you can see, to tell SAS to create a three-way table of `sex`,` job_chng, and ed_level, we use an asterisk (*) to join the three variables in the TABLES statement. The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the rows and columns of the two-way tables. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables. So, in this case, we should expect SAS to create two two-way tables of job_chng and ed_level, one for when sex = 1 and one for when sex = 2. `

Launch and run * * the SAS program, and review the output to convince yourself that SAS created the three-way table as described. You should see that, indeed, SAS created one two-way table of `job_chng` and `ed_level` for when `sex` = 1:

##### Crosstabulation of Sex, Job Change, and Education Level

##### The FREQ Procedure

##### Table 1 of job_chng by ed_level

##### Controlling for sex=1

job_chng ed_level

Frequency | ||||||
---|---|---|---|---|---|---|

Percent | ||||||

Row Pct | ||||||

Col Pct | 1 | 2 | 3 | 4 | 5 | Total |

0 | 4 | 6 | 11 | 15 | 12 | 48 |

7.84 | 11.76 | 21.57 | 29.41 | 23.53 | 94.12 | |

8.33 | 12.50 | 22.92 | 31.25 | 25.00 | ||

100.00 | 100.00 | 100.00 | 88.24 | 92.31 | ||

1 | 0 | 0 | 0 | 2 | 1 | 3 |

0.00 | 0.00 | 0.00 | 3.92 | 1.96 | 5.88 | |

0.00 | 0.00 | 0.00 | 66.67 | 33.33 | ||

0.00 | 0.00 | 0.00 | 11.76 | 7.69 | ||

Total | 4 | 6 | 11 | 17 | 13 | 51 |

7.84 | 11.76 | 21.57 | 33.33 | 25.49 | 100.00 |

Frequency Missing = 107

and one two-way table of `job_chng` and `ed_level` for when `sex` = 2:

##### table 2 of job_chng by ed_level

##### Controlling for sex=2

job_chng ed_level

Frequency | ||||||
---|---|---|---|---|---|---|

Percent | ||||||

Row Pct | ||||||

Col Pct | 1 | 2 | 3 | 4 | 5 | Total |

0 | 4 | 14 | 148 | 163 | 74 | 403 |

0.84 | 2.95 | 31.16 | 34.32 | 15.58 | 84.84 | |

0.99 | 3.47 | 36.72 | 40.45 | 18.36 | ||

80.00 | 82.35 | 85.06 | 83.16 | 89.16 | ||

1 | 1 | 3 | 26 | 33 | 9 | 72 |

0.21 | 0.63 | 5.47 | 6.95 | 1.89 | 15.16 | |

1.39 | 4.17 | 36.11 | 45.83 | 12.50 | ||

20.00 | 17.65 | 14.94 | 16.84 | 10.84 | ||

Total | 5 | 17 | 174 | 196 | 83 | 475 |

1.05 | 3.58 | 36.63 | 41.26 | 17.47 | 100.00 |

Frequency Missing = 107

It probably goes without saying that, in general, `n`-way tables can generate lots of output.

# 12.3 - Cell Statistics

12.3 - Cell StatisticsAs you now know, when SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the (joint) cell percentage, the conditional row percentage, and the conditional column percentage. As usual, you don't have to accept the default ... you can tell SAS to suppress some of the default statistics it displays ... or you can tell SAS to display alternative statistics.

For *n*-way crosstabulations, you can suppress the default output in each of the cells by using any (or all) of the following table options:

**NOFREQ**suppresses the printing of the cell freqencies**NOROW**suppresses the printing of the row percentages**NOCOL**suppresses the printing of the column percentages, and**NOPERCENT**suppresses the printing of the (joint) cell percentages

## Example 12.8

The following SAS program illustrates the NOROW, NOCOL, and NOPERCENT table options:

```
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: No percents';
tables race*sex/norow nocol nopercent;
RUN;
```

##### Crosstabulation of SEX and RACE: No percents

##### The FREQ Procedure

##### Table of race by sex

race | sex | ||
---|---|---|---|

Frequency | 1 | 2 | Total |

1 | 0 | 0 | 2 |

2 | 2 | 3 | 7 |

3 | 1 | 28 | 29 |

4 | 51 | 542 | 593 |

5 | 0 | 3 | 3 |

6 | 0 | 2 | 2 |

7 | 1 | 0 | 1 |

8 | 0 | 1 | 1 |

Total | 56 | 582 | 638 |

Launch and run * * the SAS program. Review the output to convince yourself that SAS did indeed suppress the default row percentages, column percentages, and cell percentages. Note, too, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the lone number reported in each cell is the cell frequency.

Instead of suppressing output in an *n*-way table, we can request additional output in each of the cells by using the following options:

- EXPECTED, to print the expected cell frequencies under the null hypothesis of independence
- DEVIATION, to print the deviation of the cell frequency from the expected cell frequency (under the null hypothesis of independence)
- CELLCHI2, to print each cell's contribution to the total chi-squared statistic

That is, CELLCHI2 is defined as:

\(\text{CELLCHI2} = \dfrac{\left(\text{frequency} - \text{expected}\right)^2}{\text{expected}}\)

## Example 12.9

In creating the two-way table between `race` and `sex`, the following FREQ procedure requests that the EXPECTED and CELLCHI2 statistics be printed, while at the same time suppressing the printing of the joint, row, and column percentages:

```
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: With Expecteds';
tables race*sex/expected cellchi2 norow nocol nopercent;
RUN;
```

##### Table of race by sex

race | sex | ||
---|---|---|---|

Frequency | |||

Expected | |||

Cell Chi-Square | 1 | 2 | Total |

1 | 0 | 2 | 2 |

0.1755 | 1.8245 | ||

0.1755 | 0.0169 | ||

2 | 3 | 4 | 7 |

0.6144 | 6.3856 | ||

9.2624 | 0.8912 | ||

3 | 1 | 28 | 29 |

2.5455 | 26.455 | ||

0.9383 | 0.0903 | ||

4 | 51 | 542 | 593 |

52.05 | 540.95 | ||

0.0212 | 0.002 | ||

5 | 0 | 3 | 3 |

0.2633 | 2.7367 | ||

0.2633 | 0.0253 | ||

6 | 0 | 2 | 2 |

0.1755 | 1.8245 | ||

0.1755 | 0.0169 | ||

7 | 0 | 2 | 2 |

0.0878 | 0.9122 | ||

9.4806 | 09.122 | ||

8 | 0 | 1 | 1 |

0.0878 | 0.9122 | ||

0.0878 | 0.0084 | ||

Total | 56 | 582 | 638 |

Launch and run * * the SAS program. Review the output to convince yourself that SAS did indeed suppress the printing of the joint, row, and column percentages, while adding the expected cell frequencies, as well as the cell's contribution to the chi-square statistic, to the cell output. Note, again, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the second number in each cell is the expected cell frequency and the third number is the cell's contribution to the chi-square statistic.

# 12.4 - Changing the Table Format

12.4 - Changing the Table FormatWhen three or more variables are specified, the multiple levels of *n*-way tables can produce lots of output. Such bulky and complex crosstabulations are often easier to read as a continuous list. To tell SAS to generate list output for crosstabulations rather than the default tables, simply use the LIST tables option.

## Example 12.10

The following FREQ procedure illustrates the LIST option:

```
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: In List Format';
tables sex*race/list;
RUN;
```

##### Crosstabulation of SEX and RACE: in List Format

##### The FREQ Procedure

sex | race | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|---|

1 | 2 | 3 | 0.47 | 3 | 0.47 |

1 | 3 | 1 | 0.16 | 4 | 0.63 |

1 | 4 | 51 | 7.99 | 55 | 8.62 |

1 | 7 | 1 | 0.16 | 56 | 8.78 |

2 | 1 | 2 | 0.31 | 58 | 9.09 |

2 | 2 | 4 | 0.63 | 62 | 9.72 |

2 | 3 | 28 | 4.39 | 90 | 14.11 |

2 | 4 | 542 | 84.95 | 632 | 99.06 |

2 | 5 | 3 | 0.47 | 635 | 99.53 |

2 | 6 | 2 | 0.31 | 637 | 99.84 |

2 | 8 | 1 | 0.16 | 638 | 100.00 |

Launch and run * * the SAS program. Review the output to convince yourself that rather than creating a two-way table, SAS prints the requested two-way table as a list. I tend to prefer the LIST format for such cross-tabulations, as long as I'm not in need of the (conditional) row and column percentages (which, as you can see, are not reported in LIST format).

Incidentally, the LIST option cannot be used when the TABLES statement requests statistical tests or measures of association to be included in the output. (We'll be discussing such tests and measures later in this lesson.) Also, expected cell frequencies are not printed when the LIST option is specified, even if the EXPECTED option is invoked.

An alternative to the LIST option is the CROSSLIST table option. When you add the CROSSLIST option to your TABLES statement, SAS displays crosstabulation tables in the standard Output Delivery System (ODS) column format. The benefit of doing so probably won't make much sense until we learn about ODS in Stat 481. The option creates a table that has a table definition that you can customize by using the TEMPLATE procedure. For now, let's just be satisfied with looking at an example to see what the output looks like when you use the CROSSLIST option.

## Example 12.11

The following FREQ procedure illustrates the CROSSLIST option:

```
PROC FREQ data=icdb.back;
title 'Crosstabulation of SEX and RACE: In Crosslist Format';
tables sex*race/crosslist;
RUN;
```

sex | race | Frequency | Percent | Row Percent | Column Percent |
---|---|---|---|---|---|

1 |
1 | 0 | 0.00 | 0.00 | 0.00 |

2 | 3 | 0.47 | 5.36 | 42.86 | |

3 | 1 | 0.16 | 1.79 | 3.45 | |

4 | 51 | 7.99 | 91.07 | 8.60 | |

5 | 0 | 0.00 | 0.00 | 0.00 | |

6 | 0 | 0.00 | 0.00 | 0.00 | |

7 | 1 | 0.16 | 1.79 | 100.00 | |

8 | 0 | 0.00 | 0.00 | 0.00 | |

Total |
56 | 8.78 | 100.00 | ||

------------------------------------------------------------------- |
|||||

2 |
1 | 2 | 0.31 | 0.34 | 100.00 |

2 | 4 | 0.63 | 0.69 | 57.14 | |

3 | 28 | 4.36 | 4.81 | 96.55 | |

4 | 542 | 84.95 | 93.13 | 91.40 | |

5 | 3 | 0.47 | 0.52 | 100.00 | |

6 | 2 | 0.31 | 0.34 | 100.00 | |

7 | 0 | 0.00 | 0.00 | 0.00 | |

8 | 1 | 0.16 | 0.17 | 100.00 | |

Total |
582 | 91.22 | 100.00 | ||

------------------------------------------------------------------- |
|||||

Total |
1 | 2 | 0.31 | 100.00 | |

2 | 7 | 1.10 | 100.00 | ||

3 | 29 | 4.55 | 100.00 | ||

4 | 593 | 92.95 | 100.00 | ||

5 | 3 | 0.47 | 100.00 | ||

6 | 2 | 0.31 | 100.00 | ||

7 | 1 | 0.16 | 100.00 | ||

8 | 1 | 0.16 | 100.00 | ||

Total |
638 | 100.00 |

Launch and run * * the SAS program, and review the output to familiarize yourself with the appearance of the output when you use the CROSSLIST option.

# 12.5 - Creating Output Data Sets

12.5 - Creating Output Data SetsJust as the UNIVARIATE, MEANS and SUMMARY procedures can be used to create new SAS data sets containing summary statistics of numeric variables, the FREQ procedure can be used to create new SAS data sets containing summary statistics of categorical variables. This can be a very useful tool when manipulating and subsequently analyzing data.

## Example 12.12

The following FREQ procedure tells SAS to create an output data set which contains the counts and percentages for each combination of the variables `sex` and `race`:

```
PROC FREQ data=icdb.back;
tables sex*race/out=sexfreq noprint;
RUN;
PROC PRINT;
title 'Output Dataset: SEXFREQ';
RUN;
```

##### Output Dataset: SEXFREQ

Obs | sex | race | COUNT | PERCENT |
---|---|---|---|---|

1 | 1 | 2 | 3 | 0.4702 |

2 | 1 | 3 | 1 | 0.1567 |

3 | 1 | 4 | 51 | 7.9937 |

4 | 1 | 7 | 1 | 0.1567 |

5 | 2 | 1 | 2 | 0.3135 |

6 | 2 | 2 | 4 | 0.6270 |

7 | 2 | 3 | 28 | 4.3887 |

8 | 2 | 4 | 542 | 84.9530 |

9 | 2 | 5 | 3 | 0.4702 |

10 | 2 | 6 | 2 | 0.3135 |

11 | 2 | 8 | 1 | 0.1567 |

The NOPRINT tables option tells SAS to suppress printing the two-way crosstabulation of `sex`*`race`. Instead, the results are dumped to the temporary data set `sexfreq` that is specified in the OUT= tables option. The data set in the OUT= option becomes the current data set. Therefore, since the PRINT procedure does not have a DATA= option to tell SAS the name of the data set to print, SAS prints the current data set, namely `sexfreq`.

Launch and run * * the SAS program. Review the output to convince yourself that the `sexfreq` data set contains one record for each combination of `sex` and `race` that occurs in the data set. And, note that the data set contains four variables, `sex`, `race`, `COUNT` (whose value is the number of subjects in the data set of that sex and race), and `PERCENT` (whose value is the percentage of subjects in the data set of that sex and race)**.**

## Example 12.13

The SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data. **This option affects printouts under the LIST option and output data sets only**. The following SAS code illustrates use of the SPARSE option in the creation of an output data set called `sexfreq`:

```
PROC FREQ data=icdb.back;
tables sex*race/out=sexfreq noprint sparse;
RUN;
PROC PRINT;
title 'Output Dataset: SEXFREQ with SPARSE option';
RUN;
```

##### Output Dataset: SEXFREQ with SPARSE option

Obs | sex | race | COUNT | PERCENT |
---|---|---|---|---|

1 | 1 | 1 | 0 | 0.0000 |

2 | 1 | 2 | 3 | 0.4702 |

3 | 1 | 3 | 1 | 0.1567 |

4 | 1 | 4 | 51 | 7.9937 |

5 | 1 | 5 | 0 | 0.0000 |

6 | 1 | 6 | 0 | 0.0000 |

7 | 1 | 7 | 1 | 0.1567 |

8 | 1 | 8 | 0 | 0.0000 |

9 | 2 | 1 | 2 | 0.3135 |

10 | 2 | 2 | 4 | 0.6270 |

11 | 2 | 3 | 28 | 4.3887 |

12 | 2 | 4 | 542 | 84.9530 |

13 | 2 | 5 | 3 | 0.4702 |

14 | 2 | 6 | 2 | 0.3135 |

15 | 2 | 7 | 0 | 0.0000 |

16 | 2 | 8 | 1 | 0.1567 |

The only difference between this code and the previous code is the inclusion of the SPARSE option. Launch and run * * the SAS program. Review the output to convince yourself that the `sexfreq` data set now contains one record for each possible combination of `sex` and `race` regardless of whether or not the combination occurs in the data set. That is, since there are 2 sexes and 8 races, the `sexfreq` data set contains 16 records — one for each of the combinations. For example, there is now a record in the `sexfreq` data set corresponding to sex = 1 and race = 1 even though there are no subjects in the data set of that sex and race combination (`COUNT` = 0).

# 12.6 - Table Statistics

12.6 - Table StatisticsThere are a few possible statistical table options available in the FREQ procedure. Unfortunately, we don't have the luxury of digressing and discussing in this course when it is appropriate to use each of the statistics. We merely make ourselves aware of their existence in SAS should we need them in analyzing our data in the future. The options include:

- CHISQ, which requests a chi-square test of homogeneity or independence. Reported measures include the phi coefficient, the contingency coefficient, and Cramer's V. For 2×2 tables, Fisher's exact test is also included.
- MEASURES, which requests a basic set of measures of association and their standard errors. Measures include Pearson and Spearman correlation coefficients, gamma, Kendall's tau-b, Stuart's tau-c, Somer's D, lambda, and uncertainty coefficients. For 2×2 tables, odds ratios and risk ratios (with their confidence intervals) are reported.
- CMH, which requests Cochran-Mantel-Haenszel statistsics, which test for association between a row and column variable after adjusting for all of the other variables in the tables statement. For 2×2 tables, Breslow's test for homogeneity of odds ratios is reported as well.
- ALL, which requests all of the tests and measures provided by the CHISQ, MEASURE, and CMH options.
- EXACT, which requests Fisher's Exact Test for tables larger than 2×2 tables. Note that this can take painfully long to compute for large tables.

Of course, you can always refer to the SAS Help and Documentation for the technical details of the statistical tests and measures.

## Example 12.14

For the remaining examples in this lesson, we need an analysis data set with which to work. The analysis data set (click to save!) that we'll use contains (just!) four variables pulled from the ICDB background data set (with which we've already been working!), the ICDB cystoscopy data set and the ICDB symptoms data set. The following program merely displays the variable names of, and the first 15 observations in, the `analysis` data set:

```
OPTIONS NOFMTERR;
PROC CONTENTS data = icdb.analysis position;
title 'The Analysis data set';
RUN;
PROC PRINT data = icdb.analysis (OBS = 15);
title 'The Analysis data set';
RUN;
```

Launch and run * * the SAS program. From the CONTENTS procedure output, you should see that the `analysis` data set contains 638 observations and the following four variables:

##### Variables in Creation Order

# | Variable | Type | Len | Format | Informat | Label |
---|---|---|---|---|---|---|

1 | subj | Num | 8 | 11. | 11. | patient ID number |

2 | CYST_HB | Num | 8 | |||

3 | SYM_1 | Num | 8 | SYMPTSEV. | 4. | severity of urinary symptoms |

4 | ctr | Num | 8 |

Besides the variable names, this list tells us that the sym_1 variable has a permanent format called `symptsev.` associated with it. You might recall that if we don't have access to the permanent formats catalog — which we don't — then we'll run into trouble when trying to use the sym_1 variable. That's why we specified the NOFMTERR option before trying to use the analysis data set. The NOFMTERR option allows us to use the `analysis` data set. We just can't take advantage of its variables' permanent formats.

Perhaps you'd appreciate an explanation of what the four variables are? Here goes:

`subj`: subject ID number`ctr`: clinical center number (1, 2, 3, 4, 5)`cyst_hb`: indicates whether or not a hydrodistention and/or biopsy was performed (0 = a local cystoscopy with no hydrodistention or biopsy, 1 = hydrodistention and biopsy, and 2 = hydrodistention only)`sym_1`: which indicates the subject's general assessment of the severity of his/her urinary symptoms (1 = not severe at all to 5 = extremely severe)

It might also help to review the first 15 observations just to get a feel for what the data look like:

##### The Analysis data set

Obs | subj | CYST_HB | SYM_1 | ctr |
---|---|---|---|---|

1 | 110027 | 1 | 4 | 1 |

2 | 110029 | . | 4 | 1 |

3 | 110039 | . | 3 | 1 |

4 | 110040 | . | 3 | 1 |

5 | 110045 | . | 3 | 1 |

6 | 11049 | . | 3 | 1 |

7 | 110051 | . | 4 | 1 |

8 | 110052 | 1 | 4 | 1 |

9 | 110053 | . | 3 | 1 |

10 | 110055 | . | 3 | 1 |

11 | 110057 | . | . | 1 |

12 | 110058 | 1 | 4 | 1 |

13 | 110059 | . | 4 | 1 |

14 | 110060 | . | 3 | 1 |

15 | 110062 | 1 | 4 | 1 |

Now, as promised, let's use the `analysis` data set to illustrate some of the statistical tests and measures of association you can request from SAS while using the FREQ procedure.

## Example 12.15

Some clinical centers may be more likely to perform a cystoscopy (which is a fairly invasive procedure) on their patients than other clinical centers. One could imagine this happening for a variety of reasons. For instance, one clinical center might have more severe patients thereby justifying more invasive procedures. Or perhaps the patients attending a particular clinical center might be better off financially and therefore more willing to pay for additional procedures. In any case, suppose we are interested in testing whether or not there is an association between performing a cystoscopy (`cyst_hb`) and clinical center (`ctr`). A chi-square test between the two variables would help us answer our research question. The following FREQ procedure illustrates the use of the CHISQ tables option in order to obtain the value of the chi-square test statistic and its associated `P`-value (as well as a few other useful statistics and `P`-values):

```
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: CHISQ';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing chisq;
RUN;
```

##### Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES

##### The FREQ Procedure

##### Table of ctr by CYST_HB

ctr | CYST_HB | ||||
---|---|---|---|---|---|

Frequency | |||||

Row Pct | Nothing | Local | Both | Hydro | Total |

1 | 96 | 0 | 10 | 0 | 106 |

90.57 | 0.00 | 9.43 | 0.00 | ||

2 | 77 | 0 | 14 | 0 | 52 |

48.73 | 5.70 | 44.30 | 1.27 | ||

3 | 38 | 0 | 14 | 0 | 52 |

73.08 | 0.00 | 26.92 | 0.00 | ||

4 | 101 | 16 | 65 | 5 | 187 |

54.01 | 8.56 | 34.76 | 2.67 | ||

5 | 86 | 1 | 41 | 7 | 135 |

63.70 | 0.74 | 30.37 | 5.19 | ||

Total | 398 | 26 | 200 | 14 | 638 |

##### Statistics for Table of ctr by CYST_HB

Statistics | DF | Value | Prob |
---|---|---|---|

Chi-Square | 12 | 77.2068 | <.001 |

Likelihood Ratio Chi-Square | 12 | 89.2250 | <.001 |

Mantel-Haenszel Chi-Square | 1 | . | . |

Phi Coefficient | 0.3479 | ||

contingency Coefficient | 0.3286 | ||

Cramer's V | 0.2008 |

WARNING: 35% of the cell's have expected counts less then 5. Chi-square may not be a valid test

Sample Size 638

Of course, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the `cyst_hb` variables more meaningful to us.

The TABLES statement first requests a two-way table between the (row) variable `ctr` and the (column) variable `cyst_hb`. Because we are interested in including the missing values ("Nothing") in the analysis, we include the MISSING tables option. When doing this kind of analysis, I like to use the NOPERCENT and NOCOL tables options to help declutter the two-way frequency table. Finally, the CHISQ tables option tells SAS to calculate the chi-square statistics and its `P`-value (as well as a few other statistics).

Launch and run * * the SAS program, and review the resulting output. You should first see a two-way table (with five rows for `ctr` and four columns for `cyst_hb`). Then, you should see a list of six different statistics, of which the first one is the chi-square test statistic. Here, the value of the statistic is 77.2 (rounded) with 12 degrees of freedom (DF) and a `P`-value that is less than 0.0001. (Assuming that there is enough non-missing data and therefore that the chi-square test is valid), the `P`-value tells us that it is highly unlikely that we would observe the data we did under the assumption that `ctr` and `cyst_hb` are not associated. Therefore, we conclude that `ctr `and `cyst_hb` are associated.

## Example 12.16

If we are interested in quantifying the association between clinical center and performing a cystoscopy, then we would want to take advantage of the MEASURES tables option. The following FREQ procedure illustrates the use of the MEASURES tables option to obtain a basic set of measures of association and their standard errors:

```
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES';
format cyst_hb cystfmt.;
tables ctr*cyst_hb/nopercent nocol missing measures;
RUN;
```

Again, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the *cyst_hb* variables more meaningful to us.

Launch and run * * the SAS program, and review the resulting output. You should again first see a two-way table (with five rows for *ctr* and four columns for *cyst_hb*):

##### Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES

##### The FREQ Procedure

##### Table of ctr by CYST_HB

ctr | CYST_HB | ||||
---|---|---|---|---|---|

Frequency | |||||

Row Pct | Nothing | Local | Both | Hydro | Total |

1 | 4 | 7 | 11 | ||

7.14 | 1.20 | ||||

2 | 7 | 22 | 29 | ||

12.50 | 3.78 | ||||

3 | 12 | 220 | 232 | ||

21.43 | 37.80 | ||||

4 | 20 | 229 | 249 | ||

35.71 | 39.35 | ||||

5 | 13 | 104 | 117 | ||

23.21 | 17.87 | ||||

Total | 8.78 | 91.22 | 100.00 |

Then, you should see a list of thirteen different statistics lumped into five groups:

##### Statistics for Table of ctr by CYST_HB

Statistic | Value | ASE |
---|---|---|

Gama | 0.1697 | 0.0534 |

Kendall's Tau-c | 0.1071 | 0.0339 |

Stuart's Tau-c | 0.0897 | 0.0286 |

Somers' D C|R | 0.0870 | 0.0276 |

Somers' D R|C | 0.1318 | 0.0417 |

Pearson Correlation | . | . |

Spearman Correlation | 0.1206 | 0.0381 |

Lambda Asymmetric C|R | 0.0000 | 0.0000 |

Lambda Asymmetric R|C | 0.0155 | 0.0267 |

Lambda Symmetric | 0.0101 | 0.0175 |

Uncertainty Coefficient C|R | 0.0802 | 0.0133 |

Uncertainty Coefficient R|C | 0.0455 | 0.0079 |

Uncertainty Coefficient Symmetric | 0.0581 | 0.0099 |

The first statistic is Gamma, the second is Kendall's Tau-b, ..., and the last is Uncertainty Coefficient Symmetric. The column labeled "Value" is the value calculated for the given statistic for this data set, and the column labeled "ASE" is the calculated (asymptotic) standard error of the statistic.

## Example 12.17

The Cochran-Mantel-Haenszel test allows us to test for the association between two categorical variables while adjusting for a third categorical variable. To request that SAS performs such a test, we must use the CMH tables option. The following CMH tables option requests the Cochran-Mantel-Haenszel to test for association between the two variables `cyst_hb` and `sym_1` while adjusting for the third variable `ctr`:

```
OPTIONS NOFMTERR;
PROC FORMAT;
value cystfmt 0 = 'Local'
1 = 'Both'
2 = 'Hydro'
OTHER = 'Nothing';
RUN;
PROC FREQ data=icdb.analysis;
title 'Chi-square Test of Hospital and Cystoscopy Procedure: CMH';
title2 'Adjusting for Ctr';
format cyst_hb cystfmt.;
tables ctr*cyst_hb*sym_1/nopercent nocol cmh;
RUN;
```

As always, we put the stratifying variable — in this case, `ctr` — in the first position of the tables statement. Then, we put the two variables between which we are testing for association — in this case, `cyst_hb `and` sym_1` — in the second and third positions. Again, the NOPERCENT and NOCOL tables options are used to help declutter the resulting frequency tables. Finally, the CMH tables option gives us the Cochran-Mantel-Haenszel test for association.

Launch and run * * the SAS program, and review the resulting output. You should first see a set of five two-way tables (one for each level of `ctr`). Here's what the table looks like for `ctr` = 2:

##### Table 2 of CYST_HB by SYM_1

##### Controlling for ctr=2

CYST_HB | SYM_1 (severity of urinary symptoms) | |||||
---|---|---|---|---|---|---|

Frequency | ||||||

Row Pct | 1 | 2 | 3 | 4 | 5 | Total |

Local |
1 | 2 | 2 | 3 | 1 | 9 |

11.11 | 22.22 | 22.22 | 33.33 | 11.11 | ||

Both |
1 | 5 | 22 | 30 | 12 | 70 |

1.43 | 7.14 | 31.43 | 42.86 | 17.14 | ||

Hydro |
0 | 0 | 0 | 2 | 0 | 2 |

0.00 | 0.00 | 0.00 | 100.00 | 0.00 | ||

Total |
2 | 7 | 24 | 35 | 13 | 81 |

Then, you should see a set of three Cochran-Mantel-Haenszel statistics — the first one is labeled Nonzero Correlation, the second one is labeled Row Mean Scores Differ, and the third one is labeled General Association:

##### Chi-square test of Hospital and Cystoscopy Procedure: CMH

##### Adjusting for Ctr

##### The FREQ Procedure

##### Summary Statistics for CYST_HB by SYM_1

##### Controlling for ctr

##### Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistics | Alternative Hypothesis | DF | Value | Prob |
---|---|---|---|---|

1 | Nonzero Correlation | 1 | 2.6013 | 0.1068 |

2 | Row Mean Scores Differ | 2 | 2.9560 | 0.2281 |

3 | General Association | 8 | 6.9245 | 0.5448 |

The column labeled "DF" contains the given statistic's degrees of freedom, the column labeled "Value" contains the value calculated for the given statistic for the data set, and the column labeled "Prob" contains the statistic's `P`-value.

# 12.7 - Summary

12.7 - SummaryIn this lesson, we learned about various features and options of the FREQ procedure.

The homework for this lesson will give you practice using these features and options.