# Lesson 11: Summarizing Continuous Data

Lesson 11: Summarizing Continuous Data## Overview

As you now know, in addition to creating pretty reports, the REPORT procedure can be used to calculate some basic descriptive statistics. There are a number of procedures that are available in SAS, however, that are designed specifically to produce a variety of different descriptive statistics and to display them in meaningful reports. The four procedures in particular, of which I am thinking, are the MEANS, SUMMARY, UNIVARIATE, and FREQ procedures.

The FREQ procedure is used to summarize discrete data values, and therefore can be used to calculate summary statistics such as the percentage of people with blue eyes and the number of elm trees succumbing to Dutch elm disease. We'll learn about the FREQ procedure in the next (and final!) lesson.

The MEANS, SUMMARY, and UNIVARIATE procedures are used to summarize continuous numeric values, and therefore can be used to calculate statistics, such as mean height, median salary, and minimum mileage. We'll learn about these three procedures in this (the next to final!) lesson.

We'll work mostly with the MEANS procedure. Then, since the SUMMARY and UNIVARIATE procedures have similar options and statements as the MEANS procedure, we'll spend less time on them. The greatest difference between the three procedures is that the UNIVARIATE procedure calculates a few more additional statistics not available in the MEANS and SUMMARY procedures. If you do not need to calculate the additional statistics that are available in UNIVARIATE, however, it is much more efficient to use the MEANS and SUMMARY procedures.

All three of the procedures take the following generic form:

```
PROC PROCNAME options;
statement1;
statement2;
etc;
RUN;
```

where, not surprisingly, PROCNAME stands for the name of the procedure, and is therefore — either MEANS, SUMMARY, or UNIVARIATE.

## Objectives

Upon completing this lesson, you should be able to use the three procedures that are available in SAS — MEANS, and SUMMARY, and UNIVARIATE — to perform various basic descriptive statistics on the numeric variables in a data set, including:

- use the VAR statement to tell SAS which numeric variables to analyze
- use the various statistic keywords to tell SAS which summary statistics to calcuate
- use the MAXDEC= and FW= options to tell SAS how to format the report containing the summary statistics
- use the NOPRINT option of the MEANS procedure to suppress printing of the default report
- use the PRINT option of the SUMMARY procedure to generate a report containing the summary statistics
- use the BY statement to tell SAS to perform a separate analysis for each BY-group created by the variables appearing in the BY statement
- use the CLASS statement in the MEANS and SUMMARY procedures to tell SAS to form subgroups before calculating summary statistics
- use the OUTPUT statement to create a data set containing summary statistics rather than the standard printed output
- use the MEANS and PLOT procedures to create a quick-and-dirty interaction plot
- use the NORMAL option of the UNIVARIATE procedure to tell SAS to compute four different "test for normality" statistics
- use the PLOT option of the UNIVARIATE procedure to tell SAS to create a histogram, boxplot, and a normal probability plot
- use the ID statement in the UNIVARIATE procedure to tell SAS to use the values of the variable indicated in the ID statement to indicate the five largest and five smallest observations

# 11.1 - The MEANS and SUMMARY Procedures

11.1 - The MEANS and SUMMARY ProceduresIn this section, we'll learn the syntax of the simplest MEANS and SUMMARY procedures, as well as familiarize ourselves with the output they generate.

## Example 11.1

Throughout our investigation of the MEANS, SUMMARY, and UNIVARIATE procedures, we'll use the hematology data set arising from the ICDB Study. The following program tells SAS to display the contents, and print the first 15 observations, of the data set:

```
OPTIONS PS = 58 LS = 80 NODATE NONUMBER;
LIBNAME icdb 'C:\Simon\Stat480WC\fa08\11means\sasndata';
PROC CONTENTS data = icdb.hem2 position;
RUN;
PROC PRINT data = icdb.hem2 (OBS = 15);
RUN;
```

##### National Parks

Obs | subj | hosp | wbc | rbc | hemog | hcrit | mcv | mch | mchc |
---|---|---|---|---|---|---|---|---|---|

1 | 110027 | 11 | 7.5 | 4.38 | 13.8 | 40.9 | 93.3 | 31.5 | 33.7 |

2 | 11027 | 11 | 7.6 | 5.20 | 15.2 | 45.8 | 88.0 | 29.2 | 33.1 |

3 | 110039 | 11 | 7.5 | 4.33 | 13.1 | 39.4 | 91.0 | 30.2 | 33.2 |

4 | 110040 | 11 | 8.3 | 4.52 | 12.4 | 38.1 | 84.2 | 27.4 | 32.5 |

5 | 110045 | 11 | 8.9 | 4.72 | 14.6 | 42.7 | 90.4 | 90.9 | 34.1 |

6 | 110049 | 11 | 6.2 | 4.71 | 13.8 | 41.7 | 88.5 | 29.2 | 33.0 |

7 | 110051 | 11 | 6.4 | 4.56 | 13.0 | 37.9 | 83.1 | 28.5 | 34.3 |

8 | 110052 | 11 | 7.1 | 3.69 | 12.5 | 35.6 | 97.2 | 33.8 | 33.8 |

9 | 110053 | 11 | 7.4 | 4.47 | 14.4 | 43.6 | 97.2 | 32.2 | 33.0 |

10 | 110055 | 11 | 6.1 | 4.34 | 12.8 | 38.2 | 88.1 | 29.6 | 33.6 |

11 | 110057 | 11 | 9.5 | 4.70 | 13.4 | 40.5 | 86.0 | 28.4 | 33.0 |

12 | 110058 | 11 | 6.5 | 3.76 | 11.6 | 34.2 | 91.0 | 30.7 | 33.8 |

13 | 110059 | 11 | 7.5 | 4.29 | 12.3 | 36.8 | 85.7 | 28.6 | 33.4 |

14 | 110060 | 11 | 7.6 | 4.57 | 13.8 | 42.0 | 91.8 | 30.1 | 32.8 |

15 | 110062 | 11 | 4.6 | 4.87 | 13.9 | 42.9 | 88.2 | 28.5 | 32.3 |

First, click the link to save the hematology data set to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Finally, run * * the program. You may recall that the CONTENTS procedure's POSITION option tells SAS to display the contents of the data set in the order in which the variables appear in the data set. Therefore, you should see output that looks something like this:

##### The CONTENS Procedure

###### Variables in Creation Order

# | Variable | Type | Len |
---|---|---|---|

1 | subj | Num | 8 |

2 | hosp | Num | 8 |

3 | wbc | Num | 8 |

4 | rbc | Num | 8 |

5 | hemog | Num | 8 |

6 | hcrit | Num | 8 |

7 | mcv | Num | 8 |

8 | mch | Num | 8 |

9 | mchc | Num | 8 |

The first two variables, `subj` and `hosp`, tell us the subject number and at what hospital the subject's data were collected. The remaining variables, `wbc`, `rbc`, `hemog`, ... are the blood data variables of most interest. For example, the variables `wbc` and `rbc` contain the subject's white blood cell and red blood cell counts, respectively. The really important thing to note when reviewing the output is that all of the blood data variables are continuous numeric variables, which lend themselves perfectly to a descriptive analysis using the MEANS procedure.

## Example 11.2

The MEANS procedure can include many statements and options for specifying the desired statistics. For the sake of simplicity, we'll start out with the most basic form of the MEANS procedure. The following program simply tells SAS to display basic summary statistics for each numeric variable in the *icdb.hem2* data set:

```
PROC MEANS data = icdb.hem2;
RUN;
```

##### The MEANS Procedure

Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|

subj | 635 | 327199.50 | 144410.20 | 10027.00 | 520098.00 |

hosp | 635 | 32.7133858 | 14.4426330 | 11.0000000 | 52.0000000 |

wbc | 635 | 7.1276850 | 1.9019097 | 3.0000000 | 14.2000000 |

rbc | 635 | 4.4350079 | 0.3941710 | 3.1200000 | 5.9500000 |

hemog | 635 | 13.4696063 | 1.9019097 | 3.0000000 | 14.2000000 |

hcrit | 635 | 39.4653543 | 3.1623819 | 29.7000000 | 51.4000000 |

mcv | 635 | 89.1184252 | 4.5190963 | 65.0000000 | 106.0000000 |

mch | 634 | 30.4537855 | 1.7232248 | 22.0000000 | 37.0000000 |

mchc | 634 | 34.1524290 | 0.7562054 | 31.6000000 | 36.7000000 |

Launch and run * * the SAS program, and review the output to familiarize yourself with the summary statistics that the MEANS procedure calculates by default. As you can see, in its most basic form, the MEANS procedure prints N (the number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in the data set.

In most cases, you probably don't want SAS to calculate summary statistics for every numeric variable in your data set. Instead, you'll probably just want to focus on a few important variables. For our hematology data set, for example, it doesn't make much sense for SAS to calculate summary statistics for the `subj` and `hosp` variables. After all, how does it help us to know that the average `subj` number is 327199.5?

## Example 11.3

The following program uses the MEANS procedure's VAR statement to restrict SAS to summarizing just the seven blood data variables in the *icdb.hem2* data set:

```
PROC MEANS data = icdb.hem2;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
```

##### The MEANS Procedure

Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|

wbc | 635 | 7.1276850 | 1.9019097 | 3.0000000 | 14.2000000 |

rbc | 635 | 4.4350079 | 0.3941710 | 3.1200000 | 5.9500000 |

hemog | 635 | 13.4696063 | 1.9019097 | 3.0000000 | 14.2000000 |

hcrit | 635 | 39.4653543 | 3.1623819 | 29.7000000 | 51.4000000 |

mcv | 635 | 89.1184252 | 4.5190963 | 65.0000000 | 106.0000000 |

mch | 634 | 30.4537855 | 1.7232248 | 22.0000000 | 37.0000000 |

mchc | 634 | 34.1524290 | 0.7562054 | 31.6000000 | 36.7000000 |

Launch and run * * the SAS program, and review the output to convince yourself that the `subj and hosp variables have been excluded from the analysis.`

The other thing you might notice about the output is that there are many more decimal places displayed than are necessary. By default, SAS uses the `best.` format to display values in reports created by the MEANS procedure. In a technical sense, it means that SAS chooses the format that provides the most information about the summary statistics while maintaining a default field width of 12. In a practical sense, it means that often too many decimal places are displayed.

## Example 11.4

The following program uses the MEANS procedure's MAXDEC = option to set the maximum number of decimal places displayed to 2, and the FW= option to set the maximum field width printed to 10:

```
PROC MEANS data = icdb.hem2 MAXDEC = 2 FW = 10;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
```

##### The MEANS Procedure

Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|

wbc | 635 | 7.13 | 1.90 | 3.00 | 14.20 |

rbc | 635 | 4.44 | 0.39 | 3.12 | 5.95 |

hemog | 635 | 13.47 | 1.11 | 9.90 | 17.70 |

hcrit | 635 | 39.47 | 3.16 | 29.70 | 51.40 |

mcv | 635 | 89.12 | 4.52 | 65.00 | 106.00 |

mch | 634 | 30.45 | 1.72 | 22.00 | 37.00 |

mchc | 634 | 34.15 | 0.76 | 31.60 | 36.70 |

Launch and run * * the SAS program, and review the output to convince yourself that the maximum number of decimal places and field widths have been modified as claimed. Let's check out the SUMMARY procedure now.

## Example 11.5

The following program is identical to the program in the previous example *except* for two things:

- The MEANS keyword has been replaced with the SUMMARY keyword
- The PRINT option has been added to the PROC statement:

```
PROC SUMMARY data = icdb.hem2 MAXDEC = 2 FW = 10 PRINT;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
```

##### The SUMMARY Procedure

Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|

wbc | 635 | 7.13 | 1.90 | 3.00 | 14.20 |

rbc | 635 | 4.44 | 0.39 | 3.12 | 5.95 |

hemog | 635 | 13.47 | 1.11 | 9.90 | 17.70 |

hcrit | 635 | 39.47 | 3.16 | 29.70 | 51.40 |

mcv | 635 | 89.12 | 4.52 | 65.00 | 106.00 |

mch | 634 | 30.45 | 1.72 | 22.00 | 37.00 |

mchc | 634 | 34.15 | 0.76 | 31.60 | 36.70 |

The MEANS and SUMMARY procedures perform the same functions `except` for the default setting of the PRINT option. By default, the MEANS procedure produces printed output, while the SUMMARY procedure does not. With the MEANS procedure, you have to use the NOPRINT option to suppress printing, while with the SUMMARY procedure, you have to use the PRINT option to get a printed report.

Launch and run * * the SAS program, and review the output to convince yourself that there is no difference between the two reports created by the MEANS and SUMMARY procedures.

Oooops, wait a second here .... if you're not careful, there is actually a difference. The VAR statement in the above program tells SAS which of the (numeric) variables to summarize. If you do not include a VAR statement in the SUMMARY procedure, SAS merely gives a simple count of the number of observations in the data set. To convince yourself of this, delete the VAR statement, and re-run * * the SAS program. You should see output that looks something like this:

##### The SUMMARY Procedure

N Obs |
---|

635 |

# 11.2 - Specifying Statistics

11.2 - Specifying StatisticsThe default statistics that the MEANS procedure produces — N, mean, standard deviation, minimum, and maximum — might not be the ones that you actually need. You might prefer to limit your output to just the mean and standard deviation of the values. Or you might want to compute a completely different statistic, such as the median or range of values.

In order to tell SAS to calculate summary statistics other than those calculated by default, simply place the desired statistics keywords as options in the PROC MEANS statement.

## Example 11.6

The following program tells SAS to calculate and display the sum, range and median of the red blood cell counts appearing in the *icdb.hem2* data set:

```
PROC MEANS data=icdb.hem2 fw=10 maxdec=2 sum range median;
var rbc;
RUN;
```

##### The MEANS Procedure

###### Analysis variable: rbc

Sum | Range | Median |
---|---|---|

2816.23 | 2.83 | 4.41 |

Launch and run * * the SAS program, and review the output to convince yourself that the report is generated as described. You might want to note, in particular, that when you specify a statistic in the PROC MEANS statement, the default statistics are not produced. Incidentally, you can generate the exact same report using the SUMMARY procedure, providing you again add the PRINT option to the end of the PROC statement.

The following keywords can be used with the MEANS and SUMMARY procedures to compute statistics:

##### Descriptive Statistics

Keyword |
Description |

CLM | Two-sided confidence limit for the mean |

CSS | Corrected sum of squares |

CV | Coefficient of variation |

KURT | Kurtosis |

LCLM | One-sided confidence limit below the mean |

MAX | Maximum value |

MEAN | Average value |

MIN | Minimum value |

N | No. of observations with non-missing values |

NMISS | No. of observations with missing values |

RANGE | Range |

SKEW | Skewness |

STD | Standard deviation |

STDERR | Standard error of the mean |

SUM | Sum |

SUMWGT | Sum of the Weight variable values |

UCLM | One-sided confidence limit above the mean |

USS | Uncorrected sum of squares |

VAR | Variance |

##### Quantile Statistics

Keyword |
Description |

MEDIAN or P50 | Median or 50th percentile |

P1 | 1st percentile |

P5 | 5th percentile |

P10 | 10th percentile |

Q1 or P25 | Lower quartile or 25th percentile |

Q3 or P75 | Upper quartile or 75th percentile |

P90 | 90th percentile |

P95 | 95th percentile |

P99 | 99th percentile |

QRANGE | Difference between upper and lower quartiles: Q3-Q1 |

##### Hypothesis Testing

Keyword |
Description |

PROBT | Probability of a greater absolute value for the t value |

T | Student's t for testing that the population mean is 0 |

# 11.3 - Group Processing

11.3 - Group ProcessingAll of the examples we've looked at so far have involved summarizing all of the observations in the data set. In many cases, we'll instead want to tell SAS to calculate summary statistics for certain subgroups. For example, it makes more sense to calculate the average height for males and females separately rather than calculating an average height of all individuals together. In this section, we'll investigate two ways of producing summary statistics for subgroups. One approach involves using a CLASS statement, and the other involves using a BY statement. As you'll soon see, the approach you choose to use will depend most on how you'd like your final report to look.

## Example 11.7

The following program uses the VAR and CLASS statements to tell SAS to calculate the default summary statistics of the `rbc`, `wbc`, and `hcrit` variables separately for each of the nine `hosp` values:

` ````
PROC MEANS data=icdb.hem2 fw=10 maxdec=2;
var rbc wbc hcrit;
class hosp;
RUN;
```

##### The MEANS Procedure

hosp | N Obs | Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|---|---|

11 | 106 | rbc | 106 | 4.41 | 0.42 | 3.47 | 5.55 |

wbc | 106 | 7.11 | 1.92 | 3.30 | 13.10 | ||

hcrit | 106 | 39.78 | 3.30 | 32.80 | 48.90 | ||

21 | 108 | rbc | 108 | 4.43 | 0.39 | 3.33 | 5.35 |

wbc | 108 | 7.37 | 1.94 | 3.10 | 12.60 | ||

hcrit | 108 | 39.28 | 2.90 | 33.00 | 48.00 | ||

22 | 42 | rbc | 42 | 4.40 | 0.43 | 3.12 | 5.12 |

wbc | 42 | 7.37 | 2.15 | 3.90 | 14.20 | ||

hcrit | 42 | 38.80 | 3.09 | 31.00 | 45.30 | ||

23 | 6 | rbc | 6 | 4.28 | 0.45 | 3.64 | 4.94 |

wbc | 6 | 5.17 | 1.37 | 3.30 | 6.60 | ||

hcrit | 6 | 39.37 | 3.31 | 35.90 | 44.80 | ||

31 | 52 | rbc | 52 | 4.42 | 0.41 | 3.55 | 5.62 |

wbc | 52 | 7.50 | 1.87 | 4.00 | 13.20 | ||

hcrit | 52 | 39.28 | 3.34 | 33.90 | 47.80 | ||

41 | 92 | rbc | 92 | 4.50 | 0.44 | 3.54 | 5.49 |

wbc | 92 | 7.11 | 1.92 | 3.30 | 13.10 | ||

hcrit | 92 | 40.19 | 3.51 | 32.50 | 49.90 | ||

42 | 95 | rbc | 95 | 4.40 | 0.33 | 3.63 | 5.95 |

wbc | 95 | 7.01 | 1.79 | 3.30 | 12.00 | ||

hcrit | 95 | 39.14 | 2.66 | 33.90 | 51.40 | ||

51 | 65 | rbc | 65 | 4.50 | 0.42 | 3.80 | 5.70 |

wbc | 65 | 7.01 | 1.79 | 3.30 | 12.00 | ||

hcrit | 65 | 39.14 | 2.66 | 33.90 | 51.40 | ||

52 | 69 | rbc | 69 | 4.43 | 0.32 | 3.75 | 5.40 |

wbc | 69 | 6.74 | 1.66 | 3.90 | 10.20 | ||

hcrit | 69 | 38.81 | 2.89 | 29.70 | 45.00 |

First, you should note that the variables appearing in the CLASS statement need not be character variables. Here, we use the numeric variable `hosp` to break up the 635 observations in the *icdb.hem2* data set into nine subgroups. When CLASS variables are numeric, they should of course contain a limited number of discrete values that represent meaningful subgroups. Otherwise, you will be certain to generate an awful lot of useless output.

Now, launch and run * * the SAS program, and review the output to convince yourself that the report is generated as described. As you can see, the MEANS procedure does not generate statistics for the CLASS variables. Their values are instead used only to categorize the data.

Let's see what happens when our CLASS statement contains more than one variable.

## Example 11.8

The following program reads some data on national parks into a temporary SAS data set called `parks`, and then uses the MEANS procedure's VAR and CLASS statements to tell SAS to sum the number of musems and camping facilities for each combination of the `Type `and `Region` variables:

```
DATA parks;
input ParkName $ 1-21 Type $ Region $ Museums Camping;
DATALINES;
Dinosaur NM West 2 6
Ellis Island NM East 1 0
Everglades NP East 5 2
Grand Canyon NP West 5 3
Great Smoky Mountains NP East 3 10
Hawaii Volcanoes NP West 2 2
Lava Beds NM West 1 1
Statue of Liberty NM East 1 0
Theodore Roosevelt NP West 2 2
Yellowstone NP West 9 11
Yosemite NP West 2 13
;
RUN;
PROC MEANS data = parks fw = 10 maxdec = 0 sum;
var museums camping;
class type region;
RUN;
```

##### The MEANS procedure

Type | Region | N Obs | Variable | Sum |
---|---|---|---|---|

NM | East | 2 | Museums | 2 |

Camping | 0 | |||

West | 2 | Museums | 3 | |

Camping | 7 | |||

NP | East | 2 | Museums | 8 |

Camping | 12 | |||

West | 5 | Museums | 20 | |

Camping | 31 |

Now, launch and run * * the SAS program, and review the output. You should see that, for example, SAS determined that the number of museums in National Monuments in the East is 2. The number of museums in National Monuments in the West is 3. And so on.

It is probably actually more important here to note how SAS processed the CLASS statement. As you can see, the `Type` variable appears first and the `Region` variable appears second in the CLASS statement. For that reason, the `Type` variable appears first and the `Region` variable appears second in the output. In general, the order of the variables in the CLASS statement determines their order in the output table. To convince yourself of this, you might want to change the order of the variables as they appear in the CLASS statement, and re-run * * the SAS program to see what you get.

## Example 11.9

Like the CLASS statement, the BY statement specifies variables to use for categorizing observations. The following program uses the MEANS procedure's BY statement to categorize the observations in the `parks` data set into four subgroups, as determined by the `Type` and `Region` variables, before calculating the sum, minimum and maximum of the `museums` and `camping` values for each of the subgroups:

```
PROC SORT data = parks out = srtdparks;
by type region;
RUN;
PROC MEANS data = srtdparks fw = 10 maxdec = 0 sum min max;
var museums camping;
by type region;
RUN;
```

##### ---Type=NM Region=East---

###### The MEANS Procedure

Variable | Sum | Minimum | Maximum |
---|---|---|---|

Museums | 2 | 1 | 1 |

Camping | 0 | 0 | 0 |

##### ---Type=NM Region=West---

Variable | Sum | Minimum | Maximum |
---|---|---|---|

Museums | 3 | 1 | 2 |

Camping | 7 | 1 | 6 |

##### ---Type=NP Region=East---

Variable | Sum | Minimum | Maximum |
---|---|---|---|

Museums | 8 | 3 | 5 |

Camping | 12 | 2 | 10 |

##### ---Type=NP Region=West---

Variable | Sum | Minimum | Maximum |
---|---|---|---|

Museums | 20 | 2 | 9 |

Camping | 31 | 2 | 13 |

You might want to just go ahead and launch and run * * the SAS program to see what the report looks like when you use a BY statement instead of a CLASS statement to form the subgroups. You might recall that when you use a CLASS statement, SAS generates a single large table containing all of the summary statistics. As you can see in the output here, when you instead use a BY statement, SAS generates a table for each combination of the `Type` and `Region` variables. To be more specific, SAS creates four tables here — one for `Type` = NM and `Region` = East, one for `Type` = NP and `Region` = East, one for `Type` = NM and `Region` = West, and one for `Type` = NP and `Region` = West.

Of course, there is one thing we've not addressed so far in this code ... that's the SORT procedure. Unlike CLASS processing, BY-group processing requires that your data be sorted in the order of the variables that appear in the BY statement. If the observations in your data set are not sorted in order by the variables appearing in the BY statement, then you have to use the SORT procedure to sort your data set before using it in the MEANS procedure. Don't forget that if you don't specify an output data set using the OUT= option, then the SORT procedure overwrites your initial data set with the newly sorted observations. Here, our SORT procedure tells SAS to sort the `parks` data set by the `Type `and `Region` variables, and to store the sorted data set in a new data set called `srtdparks`.

In closing off our discussion of group processing, we should probably discuss which approach — the CLASS statement or the BY statement — is more appropriate. My personal opinion is that it's all a matter of preference. If you prefer to see your summary statistics in one large table, then you should use the CLASS statement. If you instead prefer to see your summary statistics in a bunch of smaller tables, then you should use the BY statement. My personal opinion doesn't take into account the efficiency of your program, however. The advantage of the CLASS statement is that it is easier to use since you need not sort the data first. The advantage of the BY statement is that it can be more efficient when you are categorizing data by many variables.

# 11.4 - Creating Summarized Data Sets

11.4 - Creating Summarized Data SetsThere are many situations, when performing statistical analyses on continuous data, in which you want to create a data set whose observations contain summary statistics rather than observations containing the original raw data. For example, you might want to create a graph that compares the average weight loss of subjects at, say, ten different weight loss clinics. One way of creating such a graph is to first create a data set that contains ten observations — one for each of the clinics — and an average weight loss variable. The MEANS procedure's OUTPUT statement, in conjunction with the NOPRINT option, provides the mechanism to create such a data set rather than the standard printed output.

The NOPRINT option tells SAS to suppress all printed output. The OUTPUT statement, which tells SAS to create the output data set, in general, takes the form:

`OUTPUT OUT=`*dsn* keyword1=name1 keyword2=name2 ....;

where *dsn* is the name of the data set you want to create, and **keyword1** is the first statistic you want dumped to the output data set and *name1* is the name you want to call the variable in the data set representing that first statistic. Similarly, **keyword2** is the second statistic you want dumped to the output data set and *name2* is the name you want to call the variable in the data set representing that second statistic. And so on. When you use the OUTPUT statement without specifying any keywords, the default summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in the VAR statement.

## Example 11.10

The following program uses the MEANS procedure's OUTPUT statement (and NOPRINT option) to create a temporary data set called `hospsummary` that has one observation for each hospital that contains summary statistics for the hospital:

` ````
PROC MEANS data=icdb.hem2 NOPRINT;
var rbc wbc hcrit;
class hosp;
output out = hospsummary
mean = MeanRBC MeanWBC MeanHCRIT
median = MedianRBC MedianWBC MedianHCRIT;
RUN;
PROC PRINT;
title 'Hospital Statistics';
RUN;
```

##### Hospital Statistics

Obs | hosp | _Type_ | _FREQ_ | MeanRBC | MeanWBC | MeanHCRIT | MedianRBC | MedianWBC | MedianHCRIT |
---|---|---|---|---|---|---|---|---|---|

1 | . | 0 | 635 | 4.43501 | 7.12769 | 39.4654 | 4.410 | 7.0 | 39.30 |

2 | 11 | 1 | 106 | 4.41321 | 7.10660 | 39.7821 | 4.405 | 7.2 | 39.35 |

3 | 21 | 1 | 108 | 4.42991 | 7.36944 | 39.2769 | 4.440 | 7.6 | 39.00 |

4 | 22 | 1 | 42 | 4.39571 | 7.37071 | 38.8024 | 4.445 | 7.1 | 39.40 |

5 | 23 | 1 | 6 | 4.2767 | 5.16667 | 39.3667 | 4.235 | 5.4 | 39.10 |

6 | 31 | 1 | 52 | 4.42135 | 7.50212 | 39.2846 | 4.375 | 7.5 | 39.10 |

7 | 41 | 1 | 92 | 4.50207 | 7.00435 | 40.1859 | 4.455 | 6.8 | 40.20 |

8 | 42 | 1 | 95 | 4.39726 | 7.00632 | 39.1358 | 4.350 | 6.9 | 39.30 |

9 | 51 | 1 | 65 | 4.50000 | 7.24615 | 39.9969 | 4.500 | 7.1 | 40.00 |

10 | 52 | 1 | 69 | 4.42580 | 6.74203 | 38.8145 | 4.410 | 6.7 | 38.50 |

Let's first review the code. The VAR statement tells SAS the three variables — `rbc`, `wbc`, and `hcrit` — that we want summarized. The CLASS statement tells SAS that we want to categorize the observations by the value of the `hosp` variable. The OUT= portion of the OUTPUT statement tells SAS that we want to create a temporary data set called `hospsummary`. The MEAN= portion of the OUTPUT statement tells SAS to calculate the average of the `rbc`, `wbc`, and `hcrit` values and store the results, respectively, in three new variables called `MeanRBC`, `MeanWBC`, and `MeanHCRIT`. The MEDIAN= portion of the OUTPUT statement tells SAS to calculate the median of the `rbc`, `wbc`, and `hcrit` values and store the results, respectively, in three new variables called `MedianRBC`, `MedianWBC`, and `MedianHCRIT`. Note that, for each keyword, the variables must be listed in the same order as they appear in the VAR statement.

The NOPRINT option of the PROC MEANS statement tells SAS to suppress printing of the summary statistics. We must use the PRINT procedure then to tell SAS to print contents of the `hospsummary` data set. Because the PROC PRINT statement contains no DATA= option, SAS prints the current data set. The data set created by the MEANS procedure becomes the most recent data set. Therefore, the `hospsummary `data set is the one that is printed.

Now, launch and run * * the SAS program and review the output to make sure you understand the summarized data set we created. As we'd expect, the data set contains the `hosp` variable and the six requested variables, `MeanRBC`, `MeanWBC`, ..., `MedianHCRIT`, that contain the summary statistics. As you can see, the data set also contains two additional variables, `_TYPE_` and `_FREQ_`.

Whenever you use a CLASS statement to create an output data set containing statistics on subgroups, SAS automatically creates these two additional variables. Not surprisingly, the `_FREQ_` variable indicates the number of observations contributing to each of the statistics calculated. The `_TYPE_` variable indicates what kind of a summary statistic each of the observations in `hospsummary` contains. You can see that, here, `_TYPE_` takes on two possible values 0 and 1. When `_TYPE_` = 1, it means that the summary statistic is at the subgroup (`hosp`) level. That's why you'll see that `_TYPE_` = 1 for nine of the observations in `hospsummary` — one for each hospital. All we really wanted here were these nine observations, but SAS had to complicate matters by giving us this "bonus" observation in which `_TYPE_` = 0. When `_TYPE_` = 0, it means that the summary statistics are overall summary statistics. That's why for the one observation in which `_TYPE_` = 0, you'll see that `_FREQ_` = 635. That tells us that all of the observations in `icdb.hem2` went into calculating the means and medians for that observation in `hospsummary`. It should also make sense then that `hosp` = . for that observation. Ugh, this is sounding messy!

## Example 11.11

You can also create a summarized data set using the SUMMARY procedure. The following program is identical to the program in the previous example *except* for two things:

- The MEANS keyword has been replaced with the SUMMARY keyword
- The NOPRINT option has been removed from the PROC statement:

```
PROC SUMMARY data=icdb.hem2;
var rbc wbc hcrit;
class hosp;
output out = hospsummary
mean = MeanRBC MeanWBC MeanHCRIT
median = MedianRBC MedianWBC MedianHCRIT;
RUN;
PROC PRINT;
title 'Hospital Statistics';
RUN;
```

##### Hospital Statistics

Obs | hosp | _Type_ | _FREQ_ | MeanRBC | MeanWBC | MeanHCRIT | MedianRBC | MedianWBC | MedianHCRIT |
---|---|---|---|---|---|---|---|---|---|

1 | . | 0 | 635 | 4.43501 | 7.12769 | 39.4654 | 4.410 | 7.0 | 39.30 |

2 | 11 | 1 | 106 | 4.41321 | 7.10660 | 39.7821 | 4.405 | 7.2 | 39.35 |

3 | 21 | 1 | 108 | 4.42991 | 7.36944 | 39.2769 | 4.440 | 7.6 | 39.00 |

4 | 22 | 1 | 42 | 4.39571 | 7.37071 | 38.8024 | 4.445 | 7.1 | 39.40 |

5 | 23 | 1 | 6 | 4.2767 | 5.16667 | 39.3667 | 4.235 | 5.4 | 39.10 |

6 | 31 | 1 | 52 | 4.42135 | 7.50212 | 39.2846 | 4.375 | 7.5 | 39.10 |

7 | 41 | 1 | 92 | 4.50207 | 7.00435 | 40.1859 | 4.455 | 6.8 | 40.20 |

8 | 42 | 1 | 95 | 4.39726 | 7.00632 | 39.1358 | 4.350 | 6.9 | 39.30 |

9 | 51 | 1 | 65 | 4.50000 | 7.24615 | 39.9969 | 4.500 | 7.1 | 40.00 |

10 | 52 | 1 | 69 | 4.42580 | 6.74203 | 38.8145 | 4.410 | 6.7 | 38.50 |

There's nothing really new here. This example should just reinforce the fundamental difference between the SUMMARY and MEANS procedure. The SUMMARY procedure by default does not print output. That's why it is not necessary to use a NOPRINT option to tell SAS to suppress printing of output. This example should also reinforce the fundamental similarity between the SUMMARY and MEANS procedure, namely that the two procedures use identical syntax and produce identical output. Launch and run * * the SAS program, and review the output to convince yourself that there is no difference between the two data sets created by the MEANS and SUMMARY procedures.

## Example 11.12

You can also create a summarized data set similar to the hospsummary data set created in the previous two examples by using a BY statement instead of a CLASS statement. The following program does just that:

```
PROC SORT data = icdb.hem2 out = srtdhem2;
by hosp;
RUN;
PROC MEANS data=srtdhem2 NOPRINT;
var rbc wbc hcrit;
by hosp;
output out = hospsummary
mean = MeanRBC MeanWBC MeanHCRIT
median = MedianRBC MedianWBC MedianHCRIT;
RUN;
PROC PRINT;
title 'Hospital Statistics';
RUN;
```

##### Hospital Statistics

Obs | hosp | _Type_ | _FREQ_ | MeanRBC | MeanWBC | MeanHCRIT | MedianRBC | MedianWBC | MedianHCRIT |
---|---|---|---|---|---|---|---|---|---|

1 | 11 | 0 | 106 | 4.41321 | 7.10660 | 39.7821 | 4.405 | 7.2 | 39.35 |

2 | 21 | 0 | 108 | 4.42991 | 7.36944 | 39.2769 | 4.440 | 7.6 | 39.00 |

3 | 22 | 0 | 42 | 4.39571 | 7.37071 | 38.8024 | 4.445 | 7.1 | 39.40 |

4 | 23 | 0 | 6 | 4.27667 | 5.16667 | 39.3667 | .235 | 5.4 | 39.10 |

5 | 31 | 0 | 52 | 4.42135 | 7.50212 | 39.2846 | 4.375 | 7.5 | 39.10 |

6 | 41 | 0 | 92 | 4.50207 | 7.00435 | 40.1859 | 4.455 | 6.8 | 40.20 |

7 | 42 | 0 | 95 | 4.39726 | 7.00632 | 39.1358 | 4.350 | 6.9 | 39.30 |

8 | 51 | 0 | 65 | 4.50000 | 7.24615 | 39.9969 | 4.500 | 7.1 | 40.00 |

8 | 52 | 0 | 69 | 4.42580 | 6.74203 | 38.8145 | 4.410 | 6.7 | 38.50 |

As you can see, the only difference between this program and that in Example 11.10 is that the CLASS statement was replaced by a BY statement, and of course because of that, we had to add a SORT procedure so we could sort the data in *icdb.hem2* by `hosp`. Launch and run * * the SAS program, and review the output to convince yourself that there is not much of a difference between the resulting `hospsummary` data set here and that in Examples 11.10 and 11.11.

Well, okay, here `_TYPE_` = 0 means that all of the observations here contain summary statistics at the subgroup level. The meaning of `_TYPE_` should now seem very confusing to you. Fortunately, we don't need to worry about it much! There is always SAS Help and Documentation available if you're dying to learn more about it. The more important thing to note here is that the MEANS procedure summarizes each BY group as an independent subset of the input data, and therefore, SAS does not produce any sort of overall summarization as it does when using the CLASS statement.

# 11.5 - Interaction Plots

11.5 - Interaction PlotsA great example of being in a situation in which you need to create a summarized data set is when you want to create an interaction plot. We'll take a look at such an example in this section. If you haven't taken a course on analysis of variance yet, such as Stat 502, and therefore don't yet know what an interaction plot is, don't fret. You'll get the basic idea here.

## Example 11.13

The following program uses data from the ICDB Background data set to illustrate how to create a simple plot to depict whether an interaction exists between two class variables, `sex` and `race`, when the analysis variable of interest is education level (`ed_level`):

```
PROC SORT data=icdb.back out=back;
by sex race;
RUN;
PROC MEANS data=back noprint;
by sex race;
var ed_level;
output out=meaned mean=mn_edlev;
RUN;
PROC PRINT;
title 'Mean Education Level for Sex and Race combinations';
RUN;
PROC PLOT data=meaned;
title 'Interaction Plot of SEX, RACE, and Mean Education Level';
plot mn_edlev*race=sex;
RUN;
```

Let's review the code. The SORT procedure merely prepares the Background data set for BY-group processing. The MEANS procedure calculates the mean education level ("var ed_level") for each `sex` and `race` combination ("by sex race"). The OUTPUT statement tells SAS to dump the results into a new data set called `meaned`. The PRINT procedure of course tells SAS to print the `meaned` data set, which as you'll see when you run the code, looks like this:

##### Mean Educational Level for Sex and race combinations

Obs | sex | race | _TYPE_ | _FREQ_ | mn_edlev |
---|---|---|---|---|---|

1 | 1 | 2 | 0 | 3 | 4.66667 |

2 | 1 | 3 | 0 | 1 | 5.00000 |

3 | 1 | 4 | 0 | 51 | 3.47059 |

4 | 1 | 7 | 0 | 1 | 3.00000 |

5 | 2 | 1 | 0 | 2 | 4.00000 |

6 | 2 | 2 | 0 | 4 | 3.75000 |

7 | 2 | 3 | 0 | 28 | 3.42857 |

8 | 2 | 4 | 0 | 542 | 3.70849 |

9 | 2 | 5 | 0 | 3 | 3.33333 |

10 | 2 | 6 | 0 | 2 | 2.50000 |

11 | 2 | 8 | 0 | 1 | 3.00000 |

As we'd expect, the data set contains one observation for each `sex` and `race` combination. The primary variable is `mn_edlev`, the average education level of the subjects of that `sex` and `race` combination. Once the `meaned `data set is created, all we need to do is use the means in the data set to create an interaction plot. The PLOT procedure tells SAS to plot the mean education level (`mn_edlev`) on the `y`-axis and race (`race`) on the `x`-axis. The "=sex" part of the PLOT statement tells SAS to label the `x-y` (`race`-`edlevel`) coordinates with the value of the variable `sex`.

Before you run this program, you'll need to right-click on the link for the background data set to download and save it to your computer. You should store it in the same location that you've saved the permanent *icdb.hem2* data set. Launch and run * * the SAS program, and review the resulting plot. You should see the interaction plot as advertised.

# 11.6 - The UNIVARIATE procedure

11.6 - The UNIVARIATE procedureIn this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.

## Example 11.14

The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (`rbc`) variable in the *icdb.hem2* data set:

```
PROC UNIVARIATE data = icdb.hem2;
title 'Univariate Analysis of RBC';
var rbc;
RUN;
```

The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable `rbc`.

Launch and run * * the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: **Moments**, **Basic Statistical Measures**, **Tests for Location Mu0 = 0**, **Quantiles**, and **Extreme Observations**. Here's what the first three sections of the output look like:

##### Univariate Analysis of RBC

##### The UNIVARIATE Procedure

##### variable: rbc

##### Moments

N | 635 | Sum Weights | 635 |

Mean | 4.43500787 | Sum Observations | 2816.23 |

Std Deviation | 0.394171 | variance | 0.15537078 |

Skewness | 0.29025297 | Kurtosis | 0.51198988 |

Uncorrected SS | 12588.5073 | Corrected SS | 98.505075 |

Coeff Variation | 8.88771825 | Std Error Mean | 0.0156422 |

and the fourth section:

##### Quantiles (Definition 5)

Quantile | Estimate |
---|---|

100% Max | 5.95 |

99% | 5.41 |

95% | 5.12 |

90% | 4.92 |

75% Q3 | 4.69 |

50% Median | 4.41 |

25% Q1 | 4.17 |

10% | 3.97 |

5% | 3.82 |

1% | 3.55 |

0% Min | 3.12 |

and the fifth and final section:

##### ----Lowest----

Value | Obs |
---|---|

3.12 | 218 |

3.33 | 152 |

3.35 | 227 |

3.47 | 72 |

3.54 | 365 |

##### ----Highest----

Value | Obs |
---|---|

5.59 | 369 |

5.55 | 33 |

5.62 | 286 |

5.70 | 517 |

5.95 | 465 |

With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average ("Mean") red blood cell count of the 635 subjects ("N") in the data set is 4.435 with a standard deviation of 0.394. The median ("50% Median") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).

## Example 11.15

When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are: **Shapiro-Wilk**, **Kolmogorov-Smirnov**, **Cramer-von Mises**, and **Anderson-Darling**.

When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement. (Note that in SAS 9.4, you need to disable "ODS Graphics" to get the plots in the listing output. To do so, go to menu `Tools`> `Options` > `Preferences`, and uncheck the box `Use ODS Graphics` under the tab `Results`. Click `OK`.)

If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.

The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable` rbc` of the hematology data set:

```
PROC UNIVARIATE data = icdb.hem2 NORMAL PLOT;
title 'Univariate Analysis of RBC with NORMAL and PLOT Options';
var rbc;
RUN;
```

Launch and run * * the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called **Tests for Normality** that contains the four "test for normality" test statistics and corresponding P-values:

##### Tests for Normality

Test | --Statistic-- | ----p Value----- | ||
---|---|---|---|---|

Sharpiro-Wilk | W | 0.992948 | Pr < W | 0.0044 |

Kolmogorov-Smirnov | D | 0.033851 | Pr > D | 0.0771 |

Cramer-von Mises | W-Sq | 0.145326 | Pr > W-Sq | 0.0279 |

Anderson-Darling | A-Sq | 1.070646 | Pr > A-Sq | 0.0085 |

At the end of the output, you should see the histogram and box plot:

as well as the normal probability plot for the `rbc` variable:

## Example 11.16

When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (`subj`) to indicate extreme values of red blood cell count (`rbc`):

```
PROC UNIVARIATE data = icdb.hem2;
title 'Univariate Analysis of RBC with ID Option';
var rbc;
id subj;
RUN;
```

Launch and run * * the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In Example 11.14, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable ("id subj"):

##### ------Lowest------

Value | SUBJ | Obs |
---|---|---|

3.12 | 220007 | 218 |

3.33 | 210057 | 152 |

3.35 | 220021 | 227 |

3.47 | 110134 | 72 |

3.54 | 410059 | 365 |

##### ------Highest------

Value | SUBJ | Obs |
---|---|---|

5.59 | 410063 | 369 |

5.55 | 110086 | 33 |

5.62 | 310092 | 286 |

5.70 | 510026 | 517 |

5.95 | 420074 | 465 |

SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.

You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.

# 11.7 - Summary

11.7 - SummaryIn this lesson, we learned about three procedures available in SAS — UNIVARIATE, MEAN and SUMMARY — to calculate basic descriptive statistics.

The homework for this lesson will give you practice with these techniques.