11.1 - The MEANS and SUMMARY Procedures

11.1 - The MEANS and SUMMARY Procedures

In this section, we'll learn the syntax of the simplest MEANS and SUMMARY procedures, as well as familiarize ourselves with the output they generate.

Example 11.1

Throughout our investigation of the MEANS, SUMMARY, and UNIVARIATE procedures, we'll use the hematology data set arising from the ICDB Study. The following program tells SAS to display the contents, and print the first 15 observations, of the data set:

OPTIONS PS = 58 LS = 80 NODATE NONUMBER;

LIBNAME icdb 'C:\Simon\Stat480WC\fa08\11means\sasndata';

PROC CONTENTS data = icdb.hem2 position;
RUN;

PROC PRINT data = icdb.hem2 (OBS = 15);
RUN;
National Parks
Obs subj hosp wbc rbc hemog hcrit mcv mch mchc
1 110027 11 7.5 4.38 13.8 40.9 93.3 31.5 33.7
2 11027 11 7.6 5.20 15.2 45.8 88.0 29.2 33.1
3 110039 11 7.5 4.33 13.1 39.4 91.0 30.2 33.2
4 110040 11 8.3 4.52 12.4 38.1 84.2 27.4 32.5
5 110045 11 8.9 4.72 14.6 42.7 90.4 90.9 34.1
6 110049 11 6.2 4.71 13.8 41.7 88.5 29.2 33.0
7 110051 11 6.4 4.56 13.0 37.9 83.1 28.5 34.3
8 110052 11 7.1 3.69 12.5 35.6 97.2 33.8 33.8
9 110053 11 7.4 4.47 14.4 43.6 97.2 32.2 33.0
10 110055 11 6.1 4.34 12.8 38.2 88.1 29.6 33.6
11 110057 11 9.5 4.70 13.4 40.5 86.0 28.4 33.0
12 110058 11 6.5 3.76 11.6 34.2 91.0 30.7 33.8
13 110059 11 7.5 4.29 12.3 36.8 85.7 28.6 33.4
14 110060 11 7.6 4.57 13.8 42.0 91.8 30.1 32.8
15 110062 11 4.6 4.87 13.9 42.9 88.2 28.5 32.3

First, click the link to save the hematology data set to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Finally, run the program. You may recall that the CONTENTS procedure's POSITION option tells SAS to display the contents of the data set in the order in which the variables appear in the data set. Therefore, you should see output that looks something like this:

The CONTENS Procedure
Variables in Creation Order
# Variable Type Len
1 subj Num 8
2 hosp Num 8
3 wbc Num 8
4 rbc Num 8
5 hemog Num 8
6 hcrit Num 8
7 mcv Num 8
8 mch Num 8
9 mchc Num 8

The first two variables, subj and hosp, tell us the subject number and at what hospital the subject's data were collected. The remaining variables, wbc, rbc, hemog, ... are the blood data variables of most interest. For example, the variables wbc and rbc contain the subject's white blood cell and red blood cell counts, respectively. The really important thing to note when reviewing the output is that all of the blood data variables are continuous numeric variables, which lend themselves perfectly to a descriptive analysis using the MEANS procedure.

Example 11.2

The MEANS procedure can include many statements and options for specifying the desired statistics. For the sake of simplicity, we'll start out with the most basic form of the MEANS procedure. The following program simply tells SAS to display basic summary statistics for each numeric variable in the icdb.hem2 data set:

PROC MEANS data = icdb.hem2;
RUN;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
subj 635 327199.50 144410.20 10027.00 520098.00
hosp 635 32.7133858 14.4426330 11.0000000 52.0000000
wbc 635 7.1276850 1.9019097 3.0000000 14.2000000
rbc 635 4.4350079 0.3941710 3.1200000 5.9500000
hemog 635 13.4696063 1.9019097 3.0000000 14.2000000
hcrit 635 39.4653543 3.1623819 29.7000000 51.4000000
mcv 635 89.1184252 4.5190963 65.0000000 106.0000000
mch 634 30.4537855 1.7232248 22.0000000 37.0000000
mchc 634 34.1524290 0.7562054 31.6000000 36.7000000

Launch and run the SAS program, and review the output to familiarize yourself with the summary statistics that the MEANS procedure calculates by default. As you can see, in its most basic form, the MEANS procedure prints N (the number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in the data set.

In most cases, you probably don't want SAS to calculate summary statistics for every numeric variable in your data set. Instead, you'll probably just want to focus on a few important variables. For our hematology data set, for example, it doesn't make much sense for SAS to calculate summary statistics for the subj and hosp variables. After all, how does it help us to know that the average subj number is 327199.5?

Example 11.3

The following program uses the MEANS procedure's VAR statement to restrict SAS to summarizing just the seven blood data variables in the icdb.hem2 data set:

PROC MEANS data = icdb.hem2;
   var wbc rbc hemog hcrit mcv mch mchc;
RUN;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
wbc 635 7.1276850 1.9019097 3.0000000 14.2000000
rbc 635 4.4350079 0.3941710 3.1200000 5.9500000
hemog 635 13.4696063 1.9019097 3.0000000 14.2000000
hcrit 635 39.4653543 3.1623819 29.7000000 51.4000000
mcv 635 89.1184252 4.5190963 65.0000000 106.0000000
mch 634 30.4537855 1.7232248 22.0000000 37.0000000
mchc 634 34.1524290 0.7562054 31.6000000 36.7000000

Launch and run the SAS program, and review the output to convince yourself that the subj and hosp variables have been excluded from the analysis.

The other thing you might notice about the output is that there are many more decimal places displayed than are necessary. By default, SAS uses the best. format to display values in reports created by the MEANS procedure. In a technical sense, it means that SAS chooses the format that provides the most information about the summary statistics while maintaining a default field width of 12. In a practical sense, it means that often too many decimal places are displayed.

Example 11.4

The following program uses the MEANS procedure's MAXDEC = option to set the maximum number of decimal places displayed to 2, and the FW= option to set the maximum field width printed to 10:

PROC MEANS data = icdb.hem2 MAXDEC = 2 FW = 10;
   var wbc rbc hemog hcrit mcv mch mchc;
RUN;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
wbc 635 7.13 1.90 3.00 14.20
rbc 635 4.44 0.39 3.12 5.95
hemog 635 13.47 1.11 9.90 17.70
hcrit 635 39.47 3.16 29.70 51.40
mcv 635 89.12 4.52 65.00 106.00
mch 634 30.45 1.72 22.00 37.00
mchc 634 34.15 0.76 31.60 36.70

Launch and run the SAS program, and review the output to convince yourself that the maximum number of decimal places and field widths have been modified as claimed. Let's check out the SUMMARY procedure now.

Example 11.5

The following program is identical to the program in the previous example except for two things:

  1. The MEANS keyword has been replaced with the SUMMARY keyword
  2. The PRINT option has been added to the PROC statement:
PROC SUMMARY data = icdb.hem2 MAXDEC = 2 FW = 10 PRINT;
   var wbc rbc hemog hcrit mcv mch mchc;
RUN;
The SUMMARY Procedure
Variable N Mean Std Dev Minimum Maximum
wbc 635 7.13 1.90 3.00 14.20
rbc 635 4.44 0.39 3.12 5.95
hemog 635 13.47 1.11 9.90 17.70
hcrit 635 39.47 3.16 29.70 51.40
mcv 635 89.12 4.52 65.00 106.00
mch 634 30.45 1.72 22.00 37.00
mchc 634 34.15 0.76 31.60 36.70

The MEANS and SUMMARY procedures perform the same functions except for the default setting of the PRINT option. By default, the MEANS procedure produces printed output, while the SUMMARY procedure does not. With the MEANS procedure, you have to use the NOPRINT option to suppress printing, while with the SUMMARY procedure, you have to use the PRINT option to get a printed report.

Launch and run the SAS program, and review the output to convince yourself that there is no difference between the two reports created by the MEANS and SUMMARY procedures.

Oooops, wait a second here .... if you're not careful, there is actually a difference. The VAR statement in the above program tells SAS which of the (numeric) variables to summarize. If you do not include a VAR statement in the SUMMARY procedure, SAS merely gives a simple count of the number of observations in the data set. To convince yourself of this, delete the VAR statement, and re-run the SAS program. You should see output that looks something like this:

The SUMMARY Procedure
N Obs
635

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility