Lesson 12: Summarizing Categorical Data

Lesson 12: Summarizing Categorical Data

Overview

In this lesson, we'll investigate the FREQ procedure as a tool for summarizing and analyzing categorical data. The procedure is a descriptive procedure, as well as a statistical procedure. It allows you to produce one-way to n-way frequency and cross-tabulation tables. For two-way tables, the FREQ procedure also computes chi-square tests and measures of association. And, for n-way tables, the FREQ procedure also performs stratified analyses, computing statistics within and across strata. The FREQ procedure can also output summary statistics, such as counts and percentages, to a SAS data set.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to use the FREQ procedure to summarize a data set numerically in a variety of ways, including:

  • create simple one-way, two-way, ... and n-way table summaries
  • suppress the printing of cumulative statistics in a table using the NOCUM option
  • print only one table per page using the PAGE option
  • read the values from a two-way table created by the FREQ procedure
  • create two-way (and in general, n-way) tables using the available shortcuts
  • suppress some of the default output in each of the cells of an n-way table using the NOROW, NOCOL, and NOPERCENT options
  • request additional output in each of the cells of an n-way table, such as EXPECTED, DEVIATION, and CELLCHI2
  • print n-way tables in a list format rather than as crosstabulation tables using the LIST and CROSSLIST tables options
  • perform an operation for each level of the BY group using a BY statement
  • treat missing values as non-missing values, and therefore include them in the calculation of the statistics using the MISSING tables option 
  • treat missing values as non-missing values when printing the frequencies, but do not include them in the calculation of the statistics using the MISSPRINT option 
  • create new SAS data sets containing summary statistics of categorical variables using the FREQ procedure
  • suppress printing the n-way crosstabulation using the NOPRINT tables option
  • print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data using the SPARSE tables option
  • invoke statistics table options, such as CHISQ, MEASURES, CMH, ALL, and EXACT

12.1 - A Basic One-Way Table

12.1 - A Basic One-Way Table

By default, the FREQ procedure creates a one-way table that contains the frequency, percent, cumulative frequency, and cumulative percent of every value of every variable in the input data set. Every is italicized with good reason ... the FREQ procedure doesn't care whether the variable is a character variable or a numeric variable. And, if a variable is numeric, the FREQ procedure doesn't care if it is a discrete numeric variable with just a few possible outcomes (number of siblings, say) or a continuous numeric variable with an infinite number of possible outcomes (weight, say). That means then if you rely on the default version of the FREQ procedure, it is possible to create lots and lots and lots of output. That's why we'll skip the default version and will jump right to the more practical version in which you restrict the number of tables SAS creates by using a TABLES statement.

The FREQ procedure takes the following generic form:

PROC FREQ options;
    tables ... /options;
RUN;

The TABLES statement tells SAS the specific frequency table(s) that you want to create. If you don't include a TABLES statement, then SAS creates a one-way frequency table for every variable in your input data set.

As you can see, there are two types of options, namely procedure options and table options. Procedure options, such as the typical "DATA=" option, must follow the PROC FREQ statement. Table options must be specified after a forward slash (/) in the TABLES statement. In either case, you can specify as many options as you would like.

Throughout this lesson, we'll use the ICDB background data set to illustrate the FREQ procedure. Right-click the link to save the data set to a convenient location on your computer.

Example 12.1

The following FREQ procedure illustrates the simplest practical example, namely a one-way frequency table of the variable sex, with no bells or whistles added:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
LIBNAME icdb 'C:\simon\icdb\data';
PROC FREQ data=icdb.back;
   title 'Frequency Count of SEX';
   tables sex;
RUN;

Launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the background data set. Then, run  the program and review the output. You should see something along the lines of this basic one-way frequency table, in which, as promised, SAS reports the frequency, percent, cumulative frequency, and cumulative percent of each value of the sex variable:

Frequency Count of SEX
The FREQ Procedure

sex

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

56

8.78

56

8.78

2

582

91.22

638

100.00

This output tells us, for example, that 56 or 8.78% of the subjects in the ICDB Study are male (coded as sex = 1).

Example 12.2

Again, by default, SAS outputs frequency counts, percents, cumulative frequencies, and cumulative percents. The NOCUM table option suppresses the printing of the cumulative frequencies and cumulative percentages for one-way frequency tables. The following SAS code illustrates the NOCUM table option:

 PROC FREQ data=icdb.back;
    title 'Frequency Count of SEX: No Cumulative Stats';
    tables sex/nocum;
 RUN;

Frequency Count of SEX: No Cumulative Stats
The FREQ Procedure

sex

Frequency

Percent

1

56

8.78

2

528

91.22

Launch and run  the SAS program. Review the output to convince yourself that indeed the cumulative frequencies and cumulative percentages are not printed in the table. The table contains only the number and percentage of each sex.

In any FREQ procedure, you can specify many variables in a TABLES statement. If the list is long, you may be able to use a shortcut to specify the list of variables. If you specify a TABLES statement using a numbered range of variables, such as:

tables var1-var4;

then SAS will create a one-way frequency table for the four variables named var1, var2, var3, and var4. If instead in your TABLES statement, you specify a range of variables by their position in the data set, such as:

tables sex--race;

then SAS will create a one-way frequency table for every variable that appears between the sex and race variables in the data set, namely in the case of the background data set, sex, state, country, and race. Recall that if you're not sure of the position of the variables in your data set, you can use the VARNUM option of the CONTENTS procedure to determine the position of the variables in a data set. (Incidentally, that is not a typo in the second TABLES statement ... it takes two dashes to specify a range of variables by their position in the data set.)

Rather than specifying many variables in a TABLES statement, you can specify many TABLES statements in a FREQ procedure. However you tell SAS to make multiple tables, you can use the PAGE option to tell SAS to print only one table per page. Otherwise, the FREQ procedure prints multiple tables per page as space permits.

Example 12.3

The following SAS program illustrates the creation of two one-way frequency tables in conjunction with the PAGE option:

 PROC FREQ data=icdb.back page;
    title 'Frequency Count of SEX and RACE';
    tables sex race;
 RUN;

Launch and run  the SAS program. Review the output to convince yourself that indeed SAS creates two one-way frequency tables — one for the categorical variable sex and the other for the categorical variable race. Because the PAGE option was invoked, each table should be printed on a separate page. The first page should contain the frequency table for the sex variable:

Frequency Count of SEX and RACE
The FREQ Procedure

sex

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

56

8.78

56

8.78

2

528

91.22

638

100.00

and the second page should contain the frequency table for the race variable:

Frequency Count of SEX and RACE
The FREQ Procedure

race

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

2

0.31

2

0.31

2

7

1.10

9

1.41

3

29

4.55

38

5.96

4

593

92.95

631

98.90

5

3

0.47

634

99.37

6

2

0.31

636

99.69

7

1

0.16

637

99.87

8

1

0.16

638

100.00

Incidentally, you might also want to notice that, not surprisingly, the order in which the variables appear in the TABLES statement determines the order in which they appear in the output.

Example 12.4

As is the case for many SAS procedures, you can use a BY statement to tell SAS to perform an operation for each level of a BY group. The following program tells SAS to create a one-way frequency table for the variable ed_level for each level of the variable sex:

 PROC SORT data=icbd.back out=s_back;
    by sex;
 RUN;
    
PROC FREQ data=s_back;
    title 'Frequency Count of Education Level within Each Level of Sex';
    tables ed_level;
    by sex;
RUN;
 

As is always the case, the SORT procedure merely prepares the background data set for BY-group processing. The SORT procedure tells SAS to sort the icdb.back data set by sex, and to store the results in a new data set called s_back. Then, as you can see, the FREQ procedure is invoked with a BY statement ("by sex") in addition to the TABLES statement ("tables ed_level"). Launch and run  the SAS program. Review the output to convince yourself that SAS creates two one-way frequency tables of education level (ed_level) — one for males (sex= 1):

Frequency Count of Education Level within Each Level of Sex
sex = 1

ed_level

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

4

7.14

4

7.14

2

7

12.50

11

19.64

3

12

21.43

23

41.07

4

20

35.71

43

76.79

5

13

23.21

56

100.00

and one for females (sex = 2):

Frequency Count of Education Level within Each Level of Sex
sex = 2

ed_level

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

7

1.20

7

1.20

2

22

3.78

29

4.98

3

220

37.80

249

42.78

4

229

39.35

478

82.13

5

104

17.87

582

100.00

So far, in each of the examples we have looked at, no missing values exist. When they do exist, SAS by default excludes them from your requested frequency tables. Instead, SAS prints the "Frequency Missing" below each table. You can instead opt to use the MISSING tables option, in which you tell SAS to treat missing values as non-missing values and to therefore include them in the calculation of percentages and other statistics. Or you can opt to use the MISSPRINT option to tell SAS to treat missing values as non-missing values when printing the frequencies but do not include them in the calculation of the statistics.

Example 12.5

The following SAS program illustrates the MISSING and MISSPRINT options on the variable state in the background data set:

 PROC FREQ data=icdb.back;
        title 'One-way Table of State: with MISSING Option';
        tables state/missing;
     RUN;
     PROC FREQ data=icdb.back;
        title 'One-way Table of State: with MISSPRINT Option';
        tables state/missprint;
     RUN;
    

Launch and run  the SAS program, and review the resulting output. The first few rows of output from the FREQ procedure with the MISSING tables option should look something like this:

One-way Table of State: with MISSING Option
The FREQ Procedure

sex

Frequency

Percent

Cumulative Frequency

Cumulative Percent

.

42

6.58

42

6.58

1

4

0.63

46

7.21

3

1

0.16

47

7.37

4

5

0.78

52

8.15

As you can see, the first row tells us that 42 subjects did not report the state in which they live. Because the MISSING option was used, SAS also tells us that 42 subjects comprise 6.58% of the subjects in the data set. SAS also includes the 42 subjects in the calculation of the cumulative percentage.

On the other hand, the first few rows of output from the FREQ procedure with the MISSPRINT tables option should look something like this:

One-way Table of State: with MISSPRINT Option
The FREQ Procedure

sex

Frequency

Percent

Cumulative Frequency

Cumulative Percent

.

42

.

.

.

1

4

0.67

4

0.67

3

1

0.17

5

0.84

4

5

0.84

10

1.68

As you can see, the first row again tells us that 42 subjects did not report the state in which they live. In this case, however, because the MISSPRINT option was specified, SAS stops there. That is, SAS does not include the subjects in any of its calculations of the percent, cumulative frequency, or cumulative percent.


12.2 - Two-way and N-way Tables

12.2 - Two-way and N-way Tables

So far, we have only used the FREQ procedure to create one-way frequency tables. However, it is often helpful to create crosstabulation tables in which the frequencies are determined for more than one variable at a time. For example, if we were interested in knowing how the percentage of voters favoring Barack Obama differs for various age groups, we'd need to create a two-way crosstabulation table between the two categorical variables candidate (Obama, McCain) and agegroup (18-29, 30-44, 45-54, 55-70, more than 70). We'll investigate such tables in this section.

Example 12.6

The following FREQ procedure illustrates the simplest example of telling SAS to create a two-way table, for the variables sex and ed_level, with no bells and whistles added:

PROC FREQ data=icdb.back;
   title 'Crosstabulation of Education Level and Sex';
   tables ed_level*sex;
RUN;

 
Crosstabulation of Education Level and RACE
The FREQ Procedure

Frequency

Percent

Row Pct

Col Pct


Table of ed_level by sex

ed_level

sex

1

2

Total

1

4

0.63

36.36

7.14

7

1.10

63.64

1.20

11

1.72

 
 

2

7

1.10

24.14

12.50

22

3.45

75.86

3.78

29

4.55

 
 

3

12

1.88

5.17

21.43

220

34.48

94.83

37.80

232

36.36

 
 

4

20

3.13

8.03

35.71

229

35.89

91.97

39.35

249

39.03

 
 

5

13

2.04

11.11

23.21

104

16.30

88.89

17.87

117

18.34

 
 

Total

56

8.78

582

91.22

638

100.00

As you can see, to tell SAS to create a two-way table of ed_level and sex, we merely use an asterisk (*) to join the two variables in the TABLES statement.

Launch and run  the SAS program. Review the output to convince yourself that SAS created the requested two-way table. In general, the values of the variable appearing before the asterisk form the rows of the table, and the values of the variable appearing after the asterisk form the columns of the table. In this case, since ed_level appears before the asterisk in the TABLES statement, its values form the rows of the table. And, since sex appears after the asterisk, its values form the columns of the table.

When SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the cell percentage of the total frequency, the cell percentage of the row frequency, and the cell percentage of the column frequency. You might want to review the numbers in each of the cells of the table to make sure you understand what is what. In the upper left-hand corner of the table, SAS always prints a guide to the numbers appearing in each of the cells in the table. Here, SAS tells us that the first number in cell (i,j) is the number of subjects of ed_level i and sex j; the second number in cell (i,j) is the percentage of subjects of ed_level i and sex j; the third number in cell (i,j) is the percentage of subjects who are sex j given that they are ed_level i; and the fourth number in cell (i,j) is the percentage of subjects who are ed_level i given that they are sex j. For example, for the cell in which ed_level = 4 and sex = 2, SAS tells us that: 229 of the subjects in the data set are ed_level 4 and sex 2; 35.89% of the subjects in the data set are ed_level 4 and sex 2; 91.97% of the subjects in the data set who are ed_level 4 are sex 2; and 39.35% of the subjects in the data set who are sex 2 are ed_level 4.

A little note about shortcuts. If you have many different two-way tables to create, you can use a variety of shortcuts. For example, the TABLES statement:

tables a*(b c);

tells SAS to create a two-way table between variables a and b (a*b) and a two-way table between variables a and c (a*c). The TABLES statement:

tables (a b)*(c d);

tells SAS to create four two-way tables, namely: a*c, b*c, a*d, and b*d. The TABLES statement:

tables (a b c)*d;

tells SAS to create three two-way tables, namely: a*d, b*d, and c*d.

Example 12.7

For a frequency analysis of more than two variables, we can use the FREQ procedure to create n-way crosstabulation tables. In that case, a series of two-way tables is created, with a table for each level of the other variable(s). The following program creates a three-way table of sex, job_chng, and ed_level:

PROC FREQ data=icdb.back;
   title '3-way Table of Sex, Job Change, and Ed. Level';
   tables sex*job_chng*ed_level;
RUN;

As you can see, to tell SAS to create a three-way table of sex, job_chng, and ed_level, we use an asterisk (*) to join the three variables in the TABLES statement. The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the rows and columns of the two-way tables. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables. So, in this case, we should expect SAS to create two two-way tables of job_chng and ed_level, one for when sex = 1 and one for when sex = 2.

Launch and run  the SAS program, and review the output to convince yourself that SAS created the three-way table as described. You should see that, indeed, SAS created one two-way table of job_chng and ed_level for when sex = 1:

Crosstabulation of Sex, Job Change, and Education Level
The FREQ Procedure

Table 1 of job_chng by ed_level
Controlling for sex=1

Frequency

 

Percent

 

Row Pct

 

Col Pct

1

2

3

4

5

Total

0

4

6

11

15

12

48

7.84

11.76

21.57

29.41

23.53

94.12

8.33

12.50

22.92

31.25

25.00

 

100.00

100.00

100.00

88.24

92.31

 

1

0

0

0

2

1

3

0.00

0.00

0.00

3.92

1.96

5.88

0.00

0.00

0.00

66.67

33.33

 

0.00

0.00

0.00

11.76

7.69

 

Total

4

6

11

17

13

51

7.84

11.76

21.57

33.33

25.49

100.00

Frequency Missing = 107

and one two-way table of job_chng and ed_level for when sex = 2:

table 2 of job_chng by ed_level
Controlling for sex=2

job_chng ed_level
 

Frequency

 

Percent

 

Row Pct

 

Col Pct

1

2

3

4

5

Total

0

4

14

148

163

74

403

0.84

2.95

31.16

34.32

15.58

84.84

0.99

3.47

36.72

40.45

18.36

 

80.00

82.35

85.06

83.16

89.16

 

1

1

3

26

33

9

72

0.21

0.63

5.47

6.95

1.89

15.16

1.39

4.17

36.11

45.83

12.50

 

20.00

17.65

14.94

16.84

10.84

 

Total

5

17

174

196

83

475

1.05

3.58

36.63

41.26

17.47

100.00

Frequency Missing = 107

It probably goes without saying that, in general, n-way tables can generate lots of output.


12.3 - Cell Statistics

12.3 - Cell Statistics

As you now know, when SAS creates two-way tables, each cell of the table contains, by default, the cell frequency, the (joint) cell percentage, the conditional row percentage, and the conditional column percentage. As usual, you don't have to accept the default ... you can tell SAS to suppress some of the default statistics it displays ... or you can tell SAS to display alternative statistics.

For n-way crosstabulations, you can suppress the default output in each of the cells by using any (or all) of the following table options:

  • NOFREQ suppresses the printing of the cell frequencies
  • NOROW suppresses the printing of the row percentages
  • NOCOL suppresses the printing of the column percentages, and
  • NOPERCENT suppresses the printing of the (joint) cell percentages

Example 12.8

The following SAS program illustrates the NOROW, NOCOL, and NOPERCENT table options:

 PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: No percents';
    tables race*sex/norow nocol nopercent;
 RUN;

Crosstabulation of SEX and RACE: No percents
The FREQ Procedure
Table of race by sex

Frequency

sex

race

1

2

Total

1

0

0

2

2

2

3

7

3

1

28

29

4

51

542

593

5

0

3

3

6

0

2

2

7

1

0

1

8

0

1

1

Total

56

582

638

Launch and run  the SAS program. Review the output to convince yourself that SAS did indeed suppress the default row percentages, column percentages, and cell percentages. Note, too, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the lone number reported in each cell is the cell frequency.

Instead of suppressing output in an n-way table, we can request additional output in each of the cells by using the following options:

  • EXPECTED, to print the expected cell frequencies under the null hypothesis of independence
  • DEVIATION, to print the deviation of the cell frequency from the expected cell frequency (under the null hypothesis of independence)
  • CELLCHI2, to print each cell's contribution to the total chi-squared statistic

That is, CELLCHI2 is defined as:

\(\text{CELLCHI2} = \dfrac{\left(\text{frequency} - \text{expected}\right)^2}{\text{expected}}\)

Example 12.9

In creating the two-way table between race and sex, the following FREQ procedure requests that the EXPECTED and CELLCHI2 statistics be printed, while at the same time suppressing the printing of the joint, row, and column percentages:

 PROC FREQ data=icdb.back;
   title 'Crosstabulation of SEX and RACE: With Expecteds';
   tables race*sex/expected cellchi2 norow nocol nopercent;
 RUN;

Table of race by sex

race

sex

Frequency

 

Expected

 

Cell Chi-Square

1

2

Total

1

0

2

2

0.1755

1.8245

 

0.1755

0.0169

 

2

3

4

7

0.6144

6.3856

 

9.2624

0.8912

 

3

1

28

29

2.5455

26.455

 

0.9383

0.0903

 

4

51

542

593

52.05

540.95

 

0.0212

0.002

 

5

0

3

3

0.2633

2.7367

 

0.2633

0.0253

 

6

0

2

2

0.1755

1.8245

 

0.1755

0.0169

 

7

0

2

2

0.0878

0.9122

 

9.4806

09.122

 

8

0

1

1

0.0878

0.9122

 

0.0878

0.0084

 

Total

56

582

638

Launch and run  the SAS program. Review the output to convince yourself that SAS did indeed suppress the printing of the joint, row, and column percentages, while adding the expected cell frequencies, as well as the cell's contribution to the chi-square statistic, to the cell output. Note, again, that SAS changed the guide in the upper left-hand corner of the table to reflect the new situation — the guide now tells us that the second number in each cell is the expected cell frequency and the third number is the cell's contribution to the chi-square statistic.


12.4 - Changing the Table Format

12.4 - Changing the Table Format

When three or more variables are specified, the multiple levels of n-way tables can produce lots of output. Such bulky and complex crosstabulations are often easier to read as a continuous list. To tell SAS to generate list output for crosstabulations rather than the default tables, simply use the LIST tables option.

Example 12.10

The following FREQ procedure illustrates the LIST option:

PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: In List Format';
    tables sex*race/list;
 RUN;

Crosstabulation of SEX and RACE: in List Format
The FREQ Procedure

sex

race

Frequency

Percent

Cumulative Frequency

Cumulative Percent

1

2

3

0.47

3

0.47

1

3

1

0.16

4

0.63

1

4

51

7.99

55

8.62

1

7

1

0.16

56

8.78

2

1

2

0.31

58

9.09

2

2

4

0.63

62

9.72

2

3

28

4.39

90

14.11

2

4

542

84.95

632

99.06

2

5

3

0.47

635

99.53

2

6

2

0.31

637

99.84

2

8

1

0.16

638

100.00

Launch and run  the SAS program. Review the output to convince yourself that rather than creating a two-way table, SAS prints the requested two-way table as a list. I tend to prefer the LIST format for such cross-tabulations, as long as I'm not in need of the (conditional) row and column percentages (which, as you can see, are not reported in LIST format).

Incidentally, the LIST option cannot be used when the TABLES statement requests statistical tests or measures of association to be included in the output. (We'll be discussing such tests and measures later in this lesson.) Also, expected cell frequencies are not printed when the LIST option is specified, even if the EXPECTED option is invoked.

An alternative to the LIST option is the CROSSLIST table option. When you add the CROSSLIST option to your TABLES statement, SAS displays crosstabulation tables in the standard Output Delivery System (ODS) column format. The benefit of doing so probably won't make much sense until we learn about ODS in Stat 481. The option creates a table that has a table definition that you can customize by using the TEMPLATE procedure. For now, let's just be satisfied with looking at an example to see what the output looks like when you use the CROSSLIST option.

Example 12.11

The following FREQ procedure illustrates the CROSSLIST option:

 PROC FREQ data=icdb.back;
    title 'Crosstabulation of SEX and RACE: In Crosslist Format';
    tables sex*race/crosslist;
 RUN;

Crosstabulation of SEX and RACE: In Crosslist Format
The FREQ Procedure

Table of sex by race

sex

race

Frequency

Percent

Row
Percent

Column
Percent

1

1

0

0.00

0.00

0.00

 

2

3

0.47

5.36

42.86

 

3

1

0.16

1.79

3.45

 

4

51

7.99

91.07

8.60

 

5

0

0.00

0.00

0.00

 

6

0

0.00

0.00

0.00

 

7

1

0.16

1.79

100.00

 

8

0

0.00

0.00

0.00

 

Total

56

8.78

100.00

 

2

1

2

0.31

0.34

100.00

 

2

4

0.63

0.69

57.14

 

3

28

4.39

4.81

96.55

 

4

542

84.95

93.13

91.40

 

5

3

0.47

0.52

100.00

 

6

2

0.31

0.34

100.00

 

7

0

0.00

0.00

0.00

 

8

1

0.16

0.17

100.00

 

Total

582

91.22

100.00

 

Total

1

2

0.31

 

100.00

 

2

7

1.10

 

100.00

 

3

29

4.55

 

100.00

 

4

593

92.95

 

100.00

 

5

3

0.47

 

100.00

 

6

2

0.31

 

100.00

 

7

1

0.16

 

100.00

 

8

1

0.16

 

100.00

 

Total

638

100.00

  

Launch and run  the SAS program, and review the output to familiarize yourself with the appearance of the output when you use the CROSSLIST option.


12.5 - Creating Output Data Sets

12.5 - Creating Output Data Sets

Just as the UNIVARIATE, MEANS and SUMMARY procedures can be used to create new SAS data sets containing summary statistics of numeric variables, the FREQ procedure can be used to create new SAS data sets containing summary statistics of categorical variables. This can be a very useful tool when manipulating and subsequently analyzing data.

Example 12.12

The following FREQ procedure tells SAS to create an output data set that contains the counts and percentages for each combination of the variables sex and race:

 PROC FREQ data=icdb.back;
     tables sex*race/out=sexfreq noprint;
 RUN;
 
 PROC PRINT;
    title 'Output Dataset: SEXFREQ';
 RUN;

Output Dataset: SEXFREQ

Obs

sex

race

COUNT

PERCENT

1

1

2

3

0.4702

2

1

3

1

0.1567

3

1

4

51

7.9937

4

1

7

1

0.1567

5

2

1

2

0.3135

6

2

2

4

0.6270

7

2

3

28

4.3887

8

2

4

542

84.9530

9

2

5

3

0.4702

10

2

6

2

0.3135

11

2

8

1

0.1567

The NOPRINT tables option tells SAS to suppress printing the two-way crosstabulation of sex*race. Instead, the results are dumped to the temporary data set sexfreq that is specified in the OUT= tables option. The data set in the OUT= option becomes the current data set. Therefore, since the PRINT procedure does not have a DATA= option to tell SAS the name of the data set to print, SAS prints the current data set, namely sexfreq.

Launch and run  the SAS program. Review the output to convince yourself that the sexfreq data set contains one record for each combination of sex and race that occurs in the data set. Note that the data set contains four variables, sex, race, COUNT (whose value is the number of subjects in the data set of that sex and race), and PERCENT (whose value is the percentage of subjects in the data set of that sex and race).

Example 12.13

The SPARSE tables option tells SAS to print information about all possible combinations of levels of the variables in the table request, even when some combinations of levels do not occur in the data. This option affects printouts under the LIST option and output data sets only. The following SAS code illustrates the use of the SPARSE option in the creation of an output data set called sexfreq:

PROC FREQ data=icdb.back;
     tables sex*race/out=sexfreq noprint sparse;
 RUN;
 
 PROC PRINT;
    title 'Output Dataset: SEXFREQ with SPARSE option';
 RUN;

Output Dataset: SEXFREQ with SPARSE option

Obs

sex

race

COUNT

PERCENT

1

1

1

0

0.0000

2

1

2

3

0.4702

3

1

3

1

0.1567

4

1

4

51

7.9937

5

1

5

0

0.0000

6

1

6

0

0.0000

7

1

7

1

0.1567

8

1

8

0

0.0000

9

2

1

2

0.3135

10

2

2

4

0.6270

11

2

3

28

4.3887

12

2

4

542

84.9530

13

2

5

3

0.4702

14

2

6

2

0.3135

15

2

7

0

0.0000

16

2

8

1

0.1567

The only difference between this code and the previous code is the inclusion of the SPARSE option. Launch and run  the SAS program. Review the output to convince yourself that the sexfreq data set now contains one record for each possible combination of sex and race regardless of whether or not the combination occurs in the data set. That is, since there are 2 sexes and 8 races, the sexfreq data set contains 16 records — one for each of the combinations. For example, there is now a record in the sexfreq data set corresponding to sex = 1 and race = 1 even though there are no subjects in the data set of that sex and race combination (COUNT = 0).


12.6 - Table Statistics

12.6 - Table Statistics

There are a few possible statistical table options available in the FREQ procedure. Unfortunately, we don't have the luxury of digressing and discussing in this course when it is appropriate to use each of the statistics. We merely make ourselves aware of their existence in SAS should we need them in analyzing our data in the future. The options include:

  • CHISQ, which requests a chi-square test of homogeneity or independence. Reported measures include the phi coefficient, the contingency coefficient, and Cramer's V. For 2×2 tables, Fisher's exact test is also included.
  • MEASURES, which requests a basic set of measures of association and their standard errors. Measures include Pearson and Spearman correlation coefficients, gamma, Kendall's tau-b, Stuart's tau-c, Somer's D, lambda, and uncertainty coefficients. For 2×2 tables, odds ratios and risk ratios (with their confidence intervals) are reported.
  • CMH, which requests Cochran-Mantel-Haenszel statistics, which test for association between a row and column variable after adjusting for all of the other variables in the tables statement. For 2×2 tables, Breslow's test for homogeneity of odds ratios is reported as well.
  • ALL, which requests all of the tests and measures provided by the CHISQ, MEASURE, and CMH options.
  • EXACT, which requests Fisher's Exact Test for tables larger than 2×2 tables. Note that this can take painfully long to compute for large tables.

Of course, you can always refer to the SAS Help and Documentation for the technical details of the statistical tests and measures.

Example 12.14

For the remaining examples in this lesson, we need an analysis data set with which to work. The analysis data set (click to save!) that we'll use contains (just!) four variables pulled from the ICDB background data set (with which we've already been working!), the ICDB cystoscopy data set, and the ICDB symptoms data set. The following program merely displays the variable names of, and the first 15 observations in, the analysis data set:

 OPTIONS NOFMTERR;
 PROC CONTENTS data = icdb.analysis position;
   title 'The Analysis data set';
 RUN;
 PROC PRINT data = icdb.analysis (OBS = 15);
   title 'The Analysis data set';
 RUN;

Launch and run  the SAS program. From the CONTENTS procedure output, you should see that the analysis data set contains 638 observations and the following four variables:

Variables in Creation Order

#

Variable

Type

Len

Format

Informat

Label

1

subj

Num

8

11.

11.

patient ID number

2

CYST_HB

Num

8

   

3

SYM_1

Num

8

SYMPTSEV.

4.

severity of urinary symptoms

4

ctr

Num

8

   

Besides the variable names, this list tells us that the sym_1 variable has a permanent format called symptsev. associated with it. You might recall that if we don't have access to the permanent formats catalog — which we don't — then we'll run into trouble when trying to use the sym_1 variable. That's why we specified the NOFMTERR option before trying to use the analysis data set. The NOFMTERR option allows us to use the analysis data set. We just can't take advantage of its variables' permanent formats.

Perhaps you'd appreciate an explanation of what the four variables are:

  • subj: subject ID number
  • ctr: clinical center number (1, 2, 3, 4, 5)
  • cyst_hb: indicates whether or not a hydrodistention and/or biopsy was performed (0 = a local cystoscopy with no hydrodistention or biopsy, 1 = hydrodistention and biopsy, and 2 = hydrodistention only)
  • sym_1: which indicates the subject's general assessment of the severity of his/her urinary symptoms (1 = not severe at all to 5 = extremely severe)

It might also help to review the first 15 observations just to get a feel for what the data looks like:

The Analysis data set

Obs

subj

CYST_HB

SYM_1

ctr

1

110027

1

4

1

2

110029

.

4

1

3

110039

.

3

1

4

110040

.

3

1

5

110045

.

3

1

6

11049

.

3

1

7

110051

.

4

1

8

110052

1

4

1

9

110053

.

3

1

10

110055

.

3

1

11

110057

.

.

1

12

110058

1

4

1

13

110059

.

4

1

14

110060

.

3

1

15

110062

1

4

1

Now, as promised, let's use the analysis data set to illustrate some of the statistical tests and measures of association you can request from SAS while using the FREQ procedure.

Example 12.15

Some clinical centers may be more likely to perform a cystoscopy (which is a fairly invasive procedure) on their patients than other clinical centers. One could imagine this happening for a variety of reasons. For instance, one clinical center might have more severe patients thereby justifying more invasive procedures. Or perhaps the patients attending a particular clinical center might be better off financially and therefore more willing to pay for additional procedures. In any case, suppose we are interested in testing whether or not there is an association between performing a cystoscopy (cyst_hb) and clinical center (ctr). A chi-square test between the two variables would help us answer our research question. The following FREQ procedure illustrates the use of the CHISQ tables option in order to obtain the value of the chi-square test statistic and its associated P-value (as well as a few other useful statistics and P-values):

 PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;
 PROC FREQ data=icdb.analysis;
   title 'Chi-square Test of Hospital and Cystoscopy Procedure: CHISQ';
   format cyst_hb cystfmt.;
   tables ctr*cyst_hb/nopercent nocol missing chisq;
 RUN;

Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES
The FREQ Procedure
Table of ctr by CYST_HB

ctr

CYST_HB

Frequency

 

Row Pct

Nothing

Local

Both

Hydro

Total

1

96

0

10

0

106

90.57

0.00

9.43

0.00

 

2

77

0

14

0

52

48.73

5.70

44.30

1.27

 

3

38

0

14

0

52

73.08

0.00

26.92

0.00

 

4

101

16

65

5

187

54.01

8.56

34.76

2.67

 

5

86

1

41

7

135

63.70

0.74

30.37

5.19

 

Total

398

26

200

14

638

Statistics for Table of ctr by CYST_HB

Statistics

DF

Value

Prob

Chi-Square

12

77.2068

<.001

Likelihood Ratio Chi-Square

12

89.2250

<.001

Mantel-Haenszel Chi-Square

1

.

.

Phi Coefficient

 

0.3479

 

contingency Coefficient

 

0.3286

 

Cramer's V

 

0.2008

 

WARNING: 35% of the cell's have expected counts less then 5. Chi-square may not be a valid test

Sample Size 638

Of course, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.

The TABLES statement first requests a two-way table between the (row) variable ctr and the (column) variable cyst_hb. Because we are interested in including the missing values ("Nothing") in the analysis, we include the MISSING tables option. When doing this kind of analysis, I like to use the NOPERCENT and NOCOL tables options to help declutter the two-way frequency table. Finally, the CHISQ tables option tells SAS to calculate the chi-square statistics and its P-value (as well as a few other statistics).

Launch and run  the SAS program, and review the resulting output. You should first see a two-way table (with five rows for ctr and four columns for cyst_hb). Then, you should see a list of six different statistics, of which the first one is the chi-square test statistic. Here, the value of the statistic is 77.2 (rounded) with 12 degrees of freedom (DF) and a P-value that is less than 0.0001. (Assuming that there is enough non-missing data and therefore that the chi-square test is valid), the P-value tells us that it is highly unlikely that we would observe the data we did under the assumption that ctr and cyst_hb are not associated. Therefore, we conclude that ctr and cyst_hb are associated.

Example 12.16

If we are interested in quantifying the association between clinical center and performing a cystoscopy, then we would want to take advantage of the MEASURES tables option. The following FREQ procedure illustrates the use of the MEASURES tables option to obtain a basic set of measures of association and their standard errors:

PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;
PROC FREQ data=icdb.analysis;
    title 'Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES';
	format cyst_hb cystfmt.;
    tables ctr*cyst_hb/nopercent nocol missing measures;
 RUN;

Again, the FORMAT procedure and the FORMAT statement that appears in the FREQ procedure are used just to make the displayed values of the cyst_hb variables more meaningful to us.

Launch and run  the SAS program, and review the resulting output. You should again first see a two-way table (with five rows for ctr and four columns for cyst_hb):

Chi-square Test of Hospital and Cystoscopy Procedure: MEASURES
The FREQ Procedure
Table of ctr by CYST_HB

ctr

CYST_HB

Frequency

 

Row Pct

Nothing

Local

Both

Hydro

Total

1

4

7

  

11

7.14

1.20

   

2

7

22

  

29

12.50

3.78

   

3

12

220

  

232

21.43

37.80

   

4

20

229

  

249

35.71

39.35

   

5

13

104

  

117

23.21

17.87

   

Total

8.78

91.22

  

100.00

Then, you should see a list of thirteen different statistics lumped into five groups:

Statistics for Table of ctr by CYST_HB

Statistic

Value

ASE

Gama

0.1697

0.0534

Kendall's Tau-c

0.1071

0.0339

Stuart's Tau-c

0.0897

0.0286

   

Somers' D C|R

0.0870

0.0276

Somers' D R|C

0.1318

0.0417

   

Pearson Correlation

.

.

Spearman Correlation

0.1206

0.0381

   

Lambda Asymmetric C|R

0.0000

0.0000

Lambda Asymmetric R|C

0.0155

0.0267

Lambda Symmetric

0.0101

0.0175

   

Uncertainty Coefficient C|R

0.0802

0.0133

Uncertainty Coefficient R|C

0.0455

0.0079

Uncertainty Coefficient Symmetric

0.0581

0.0099

The first statistic is Gamma, the second is Kendall's Tau-b, ..., and the last is Uncertainty Coefficient Symmetric. The column labeled "Value" is the value calculated for the given statistic for this data set, and the column labeled "ASE" is the calculated (asymptotic) standard error of the statistic.

Example 12.17

The Cochran-Mantel-Haenszel test allows us to test for the association between two categorical variables while adjusting for a third categorical variable. To request that SAS performs such a test, we must use the CMH tables option. The following CMH tables option requests the Cochran-Mantel-Haenszel to test for association between the two variables cyst_hb and sym_1 while adjusting for the third variable ctr:

OPTIONS NOFMTERR;
PROC FORMAT;
    value cystfmt 0     = 'Local'
                  1     = 'Both'
                  2     = 'Hydro'
                  OTHER = 'Nothing';
 RUN;
PROC FREQ data=icdb.analysis;
   title 'Chi-square Test of Hospital and Cystoscopy Procedure: CMH';
   title2 'Adjusting for Ctr';
   format cyst_hb cystfmt.;
   tables ctr*cyst_hb*sym_1/nopercent nocol cmh;
 RUN;

As always, we put the stratifying variable — in this case, ctr — in the first position of the tables statement. Then, we put the two variables between which we are testing for association — in this case, cyst_hb and sym_1 — in the second and third positions. Again, the NOPERCENT and NOCOL tables options are used to help declutter the resulting frequency tables. Finally, the CMH tables option gives us the Cochran-Mantel-Haenszel test for association.

Launch and run  the SAS program, and review the resulting output. You should first see a set of five two-way tables (one for each level of ctr). Here's what the table looks like for ctr = 2:

Table 2 of CYST_HB by SYM_1
Controlling for ctr=2

CYST_HB

SYM_1 (severity of urinary symptoms)

Frequency

 

Row Pct

1

2

3

4

5

Total

Local

1

2

2

3

1

9

11.11

22.22

22.22

33.33

11.11

 

Both

1

5

22

30

12

70

1.43

7.14

31.43

42.86

17.14

 

Hydro

0

0

0

2

0

2

0.00

0.00

0.00

100.00

0.00

 

Total

2

7

24

35

13

81

Then, you should see a set of three Cochran-Mantel-Haenszel statistics — the first one is labeled Nonzero Correlation, the second one is labeled Row Mean Scores Differ, and the third one is labeled General Association:

Chi-square test of Hospital and Cystoscopy Procedure: CMH
Adjusting for Ctr
The FREQ Procedure
Summary Statistics for CYST_HB by SYM_1
Controlling for ctr
Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistics

Alternative Hypothesis

DF

Value

Prob

1

Nonzero Correlation

1

2.6013

0.1068

2

Row Mean Scores Differ

2

2.9560

0.2281

3

General Association

8

6.9245

0.5448

The column labeled "DF" contains the given statistic's degrees of freedom, the column labeled "Value" contains the value calculated for the given statistic for the data set, and the column labeled "Prob" contains the statistic's P-value.


12.7 - Summary

12.7 - Summary

In this lesson, we learned about various features and options of the FREQ procedure.

The homework for this lesson will give you practice using these features and options.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility