11.1 - The MEANS and SUMMARY Procedures
11.1 - The MEANS and SUMMARY ProceduresIn this section, we'll learn the syntax of the simplest MEANS and SUMMARY procedures, as well as familiarize ourselves with the output they generate.
Example 11.1
Throughout our investigation of the MEANS, SUMMARY, and UNIVARIATE procedures, we'll use the hematology data set arising from the ICDB Study. The following program tells SAS to display the contents, and print the first 15 observations, of the data set:
OPTIONS PS = 58 LS = 80 NODATE NONUMBER;
LIBNAME icdb 'C:\Yourdrivename\Stat480WC\sasndata';
PROC CONTENTS data = icdb.hem2 position;
RUN;
PROC PRINT data = icdb.hem2 (OBS = 15);
RUN;
Obs | subj | hosp | wbc | rbc | hemog | hcrit | mcv | mch | mchc |
---|---|---|---|---|---|---|---|---|---|
1 | 110027 | 11 | 7.5 | 4.38 | 13.8 | 40.9 | 93.3 | 31.5 | 33.7 |
2 | 11027 | 11 | 7.6 | 5.20 | 15.2 | 45.8 | 88.0 | 29.2 | 33.1 |
3 | 110039 | 11 | 7.5 | 4.33 | 13.1 | 39.4 | 91.0 | 30.2 | 33.2 |
4 | 110040 | 11 | 8.3 | 4.52 | 12.4 | 38.1 | 84.2 | 27.4 | 32.5 |
5 | 110045 | 11 | 8.9 | 4.72 | 14.6 | 42.7 | 90.4 | 90.9 | 34.1 |
6 | 110049 | 11 | 6.2 | 4.71 | 13.8 | 41.7 | 88.5 | 29.2 | 33.0 |
7 | 110051 | 11 | 6.4 | 4.56 | 13.0 | 37.9 | 83.1 | 28.5 | 34.3 |
8 | 110052 | 11 | 7.1 | 3.69 | 12.5 | 35.6 | 97.2 | 33.8 | 33.8 |
9 | 110053 | 11 | 7.4 | 4.47 | 14.4 | 43.6 | 97.2 | 32.2 | 33.0 |
10 | 110055 | 11 | 6.1 | 4.34 | 12.8 | 38.2 | 88.1 | 29.6 | 33.6 |
11 | 110057 | 11 | 9.5 | 4.70 | 13.4 | 40.5 | 86.0 | 28.4 | 33.0 |
12 | 110058 | 11 | 6.5 | 3.76 | 11.6 | 34.2 | 91.0 | 30.7 | 33.8 |
13 | 110059 | 11 | 7.5 | 4.29 | 12.3 | 36.8 | 85.7 | 28.6 | 33.4 |
14 | 110060 | 11 | 7.6 | 4.57 | 13.8 | 42.0 | 91.8 | 30.1 | 32.8 |
15 | 110062 | 11 | 4.6 | 4.87 | 13.9 | 42.9 | 88.2 | 28.5 | 32.3 |
First, click the link to save the hematology data set to a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Finally, run the program. You may recall that the CONTENTS procedure's POSITION option tells SAS to display the contents of the data set in the order in which the variables appear in the data set. Therefore, you should see an output that looks something like this:
# | Variable | Type | Len |
---|---|---|---|
1 | subj | Num | 8 |
2 | hosp | Num | 8 |
3 | wbc | Num | 8 |
4 | rbc | Num | 8 |
5 | hemog | Num | 8 |
6 | hcrit | Num | 8 |
7 | mcv | Num | 8 |
8 | mch | Num | 8 |
9 | mchc | Num | 8 |
The first two variables, subj, and hosp, tell us the subject number and at what hospital the subject's data were collected. The remaining variables, wbc, rbc, hemog, ... are the blood data variables of most interest. For example, the variables wbc and rbc contain the subject's white blood cell and red blood cell counts, respectively. The really important thing to note when reviewing the output is that all of the blood data variables are continuous numeric variables, which lend themselves perfectly to a descriptive analysis using the MEANS procedure.
Example 11.2
The MEANS procedure can include many statements and options for specifying the desired statistics. For the sake of simplicity, we'll start out with the most basic form of the MEANS procedure. The following program simply tells SAS to display basic summary statistics for each numeric variable in the icdb.hem2 data set:
PROC MEANS data = icdb.hem2;
RUN;
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
subj | 635 | 327199.50 | 144410.20 | 10027.00 | 520098.00 |
hosp | 635 | 32.7133858 | 14.4426330 | 11.0000000 | 52.0000000 |
wbc | 635 | 7.1276850 | 1.9019097 | 3.0000000 | 14.2000000 |
rbc | 635 | 4.4350079 | 0.3941710 | 3.1200000 | 5.9500000 |
hemog | 635 | 13.4696063 | 1.9019097 | 3.0000000 | 14.2000000 |
hcrit | 635 | 39.4653543 | 3.1623819 | 29.7000000 | 51.4000000 |
mcv | 635 | 89.1184252 | 4.5190963 | 65.0000000 | 106.0000000 |
mch | 634 | 30.4537855 | 1.7232248 | 22.0000000 | 37.0000000 |
mchc | 634 | 34.1524290 | 0.7562054 | 31.6000000 | 36.7000000 |
Launch and run the SAS program, and review the output to familiarize yourself with the summary statistics that the MEANS procedure calculates by default. As you can see, in its most basic form, the MEANS procedure prints N (the number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in the data set.
In most cases, you probably don't want SAS to calculate summary statistics for every numeric variable in your data set. Instead, you'll probably just want to focus on a few important variables. For our hematology data set, for example, it doesn't make much sense for SAS to calculate summary statistics for the subj and hosp variables. After all, how does it help us to know that the average subj number is 327199.5?
Example 11.3
The following program uses the MEANS procedure's VAR statement to restrict SAS to summarizing just the seven blood data variables in the icdb.hem2 data set:
PROC MEANS data = icdb.hem2;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
wbc | 635 | 7.1276850 | 1.9019097 | 3.0000000 | 14.2000000 |
rbc | 635 | 4.4350079 | 0.3941710 | 3.1200000 | 5.9500000 |
hemog | 635 | 13.4696063 | 1.9019097 | 3.0000000 | 14.2000000 |
hcrit | 635 | 39.4653543 | 3.1623819 | 29.7000000 | 51.4000000 |
mcv | 635 | 89.1184252 | 4.5190963 | 65.0000000 | 106.0000000 |
mch | 634 | 30.4537855 | 1.7232248 | 22.0000000 | 37.0000000 |
mchc | 634 | 34.1524290 | 0.7562054 | 31.6000000 | 36.7000000 |
Launch and run the SAS program, and review the output to convince yourself that the subj and hosp variables have been excluded from the analysis.
The other thing you might notice about the output is that there are many more decimal places displayed than are necessary. By default, SAS uses the best. format to display values in reports created by the MEANS procedure. In a technical sense, it means that SAS chooses the format that provides the most information about the summary statistics while maintaining a default field width of 12. In a practical sense, it means that often too many decimal places are displayed.
Example 11.4
The following program uses the MEANS procedure's MAXDEC = option to set the maximum number of decimal places displayed to 2, and the FW= option to set the maximum field width printed to 10:
PROC MEANS data = icdb.hem2 MAXDEC = 2 FW = 10;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
wbc | 635 | 7.13 | 1.90 | 3.00 | 14.20 |
rbc | 635 | 4.44 | 0.39 | 3.12 | 5.95 |
hemog | 635 | 13.47 | 1.11 | 9.90 | 17.70 |
hcrit | 635 | 39.47 | 3.16 | 29.70 | 51.40 |
mcv | 635 | 89.12 | 4.52 | 65.00 | 106.00 |
mch | 634 | 30.45 | 1.72 | 22.00 | 37.00 |
mchc | 634 | 34.15 | 0.76 | 31.60 | 36.70 |
Launch and run the SAS program, and review the output to convince yourself that the maximum number of decimal places and field widths have been modified as claimed. Let's check out the SUMMARY procedure now.
Example 11.5
The following program is identical to the program in the previous example except for two things:
- The MEANS keyword has been replaced with the SUMMARY keyword
- The PRINT option has been added to the PROC statement:
PROC SUMMARY data = icdb.hem2 MAXDEC = 2 FW = 10 PRINT;
var wbc rbc hemog hcrit mcv mch mchc;
RUN;
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
wbc | 635 | 7.13 | 1.90 | 3.00 | 14.20 |
rbc | 635 | 4.44 | 0.39 | 3.12 | 5.95 |
hemog | 635 | 13.47 | 1.11 | 9.90 | 17.70 |
hcrit | 635 | 39.47 | 3.16 | 29.70 | 51.40 |
mcv | 635 | 89.12 | 4.52 | 65.00 | 106.00 |
mch | 634 | 30.45 | 1.72 | 22.00 | 37.00 |
mchc | 634 | 34.15 | 0.76 | 31.60 | 36.70 |
The MEANS and SUMMARY procedures perform the same functions except for the default setting of the PRINT option. By default, the MEANS procedure produces printed output, while the SUMMARY procedure does not. With the MEANS procedure, you have to use the NOPRINT option to suppress printing, while with the SUMMARY procedure, you have to use the PRINT option to get a printed report.
Launch and run the SAS program, and review the output to convince yourself that there is no difference between the two reports created by the MEANS and SUMMARY procedures.
Wait a second here .... if you're not careful, there is actually a difference. The VAR statement in the above program tells SAS which of the (numeric) variables to summarize. If you do not include a VAR statement in the SUMMARY procedure, SAS merely gives a simple count of the number of observations in the data set. To convince yourself of this, delete the VAR statement, and re-run the SAS program. You should see an output that looks something like this:
N Obs |
---|
635 |