11.4 - Creating Summarized Data Sets

11.4 - Creating Summarized Data Sets

There are many situations, when performing statistical analyses on continuous data, in which you want to create a data set whose observations contain summary statistics rather than observations containing the original raw data. For example, you might want to create a graph that compares the average weight loss of subjects at, say, ten different weight loss clinics. One way of creating such a graph is to first create a data set that contains ten observations — one for each of the clinics — and an average weight loss variable. The MEANS procedure's OUTPUT statement, in conjunction with the NOPRINT option, provides the mechanism to create such a data set rather than the standard printed output.

The NOPRINT option tells SAS to suppress all printed output. The OUTPUT statement, which tells SAS to create the output data set, in general, takes the form:

OUTPUT OUT=dsn keyword1=name1 keyword2=name2 ....;

where dsn is the name of the data set you want to create, and keyword1 is the first statistic you want dumped to the output data set and name1 is the name you want to call the variable in the data set representing that first statistic. Similarly, keyword2 is the second statistic you want dumped to the output data set and name2 is the name you want to call the variable in the data set representing that second statistic. And so on. When you use the OUTPUT statement without specifying any keywords, the default summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in the VAR statement.

Example 11.10

The following program uses the MEANS procedure's OUTPUT statement (and NOPRINT option) to create a temporary data set called hospsummary that has one observation for each hospital that contains summary statistics for the hospital:

  PROC MEANS data=icdb.hem2 NOPRINT;
      var rbc wbc hcrit;
    class hosp;
    output out = hospsummary
           mean = MeanRBC MeanWBC MeanHCRIT
         median = MedianRBC MedianWBC MedianHCRIT;
  RUN;
  
  PROC PRINT;
    title 'Hospital Statistics';
  RUN;
Hospital Statistics
Obs hosp _Type_ _FREQ_ MeanRBC MeanWBC MeanHCRIT MedianRBC MedianWBC MedianHCRIT
1 . 0 635 4.43501 7.12769 39.4654 4.410 7.0 39.30
2 11 1 106 4.41321 7.10660 39.7821 4.405 7.2 39.35
3 21 1 108 4.42991 7.36944 39.2769 4.440 7.6 39.00
4 22 1 42 4.39571 7.37071 38.8024 4.445 7.1 39.40
5 23 1 6 4.2767 5.16667 39.3667 4.235 5.4 39.10
6 31 1 52 4.42135 7.50212 39.2846 4.375 7.5 39.10
7 41 1 92 4.50207 7.00435 40.1859 4.455 6.8 40.20
8 42 1 95 4.39726 7.00632 39.1358 4.350 6.9 39.30
9 51 1 65 4.50000 7.24615 39.9969 4.500 7.1 40.00
10 52 1 69 4.42580 6.74203 38.8145 4.410 6.7 38.50

Let's first review the code. The VAR statement tells SAS the three variables — rbc, wbc, and hcrit — that we want summarized. The CLASS statement tells SAS that we want to categorize the observations by the value of the hosp variable. The OUT= portion of the OUTPUT statement tells SAS that we want to create a temporary data set called hospsummary. The MEAN= portion of the OUTPUT statement tells SAS to calculate the average of the rbc, wbc, and hcrit values and store the results, respectively, in three new variables called MeanRBC, MeanWBC, and MeanHCRIT. The MEDIAN= portion of the OUTPUT statement tells SAS to calculate the median of the rbc, wbc, and hcrit values and store the results, respectively, in three new variables called MedianRBC, MedianWBC, and MedianHCRIT. Note that, for each keyword, the variables must be listed in the same order as they appear in the VAR statement.

The NOPRINT option of the PROC MEANS statement tells SAS to suppress printing of the summary statistics. We must use the PRINT procedure then to tell SAS to print contents of the hospsummary data set. Because the PROC PRINT statement contains no DATA= option, SAS prints the current data set. The data set created by the MEANS procedure becomes the most recent data set. Therefore, the hospsummary data set is the one that is printed.

Now, launch and run the SAS program and review the output to make sure you understand the summarized data set we created. As we'd expect, the data set contains the hosp variable and the six requested variables, MeanRBC, MeanWBC, ..., MedianHCRIT, that contain the summary statistics. As you can see, the data set also contains two additional variables, _TYPE_ and _FREQ_.

Whenever you use a CLASS statement to create an output data set containing statistics on subgroups, SAS automatically creates these two additional variables. Not surprisingly, the _FREQ_ variable indicates the number of observations contributing to each of the statistics calculated. The _TYPE_ variable indicates what kind of a summary statistic each of the observations in hospsummary contains. You can see that, here, _TYPE_ takes on two possible values 0 and 1. When _TYPE_ = 1, it means that the summary statistic is at the subgroup (hosp) level. That's why you'll see that _TYPE_ = 1 for nine of the observations in hospsummary — one for each hospital. All we really wanted here were these nine observations, but SAS had to complicate matters by giving us this "bonus" observation in which _TYPE_ = 0. When _TYPE_ = 0, it means that the summary statistics are overall summary statistics. That's why for the one observation in which _TYPE_ = 0, you'll see that _FREQ_ = 635. That tells us that all of the observations in icdb.hem2 went into calculating the means and medians for that observation in hospsummary. It should also make sense then that hosp = . for that observation. Ugh, this is sounding messy!

Example 11.11

You can also create a summarized data set using the SUMMARY procedure. The following program is identical to the program in the previous example except for two things:

  1. The MEANS keyword has been replaced with the SUMMARY keyword
  2. The NOPRINT option has been removed from the PROC statement:
PROC SUMMARY data=icdb.hem2;
    var rbc wbc hcrit;
	class hosp;
	output out = hospsummary 
	       mean = MeanRBC MeanWBC MeanHCRIT
		   median = MedianRBC MedianWBC MedianHCRIT;
RUN;

PROC PRINT;
  title 'Hospital Statistics';
RUN;
Hospital Statistics
Obs hosp _Type_ _FREQ_ MeanRBC MeanWBC MeanHCRIT MedianRBC MedianWBC MedianHCRIT
1 . 0 635 4.43501 7.12769 39.4654 4.410 7.0 39.30
2 11 1 106 4.41321 7.10660 39.7821 4.405 7.2 39.35
3 21 1 108 4.42991 7.36944 39.2769 4.440 7.6 39.00
4 22 1 42 4.39571 7.37071 38.8024 4.445 7.1 39.40
5 23 1 6 4.2767 5.16667 39.3667 4.235 5.4 39.10
6 31 1 52 4.42135 7.50212 39.2846 4.375 7.5 39.10
7 41 1 92 4.50207 7.00435 40.1859 4.455 6.8 40.20
8 42 1 95 4.39726 7.00632 39.1358 4.350 6.9 39.30
9 51 1 65 4.50000 7.24615 39.9969 4.500 7.1 40.00
10 52 1 69 4.42580 6.74203 38.8145 4.410 6.7 38.50

There's nothing really new here. This example should just reinforce the fundamental difference between the SUMMARY and MEANS procedure. The SUMMARY procedure by default does not print output. That's why it is not necessary to use a NOPRINT option to tell SAS to suppress printing of output. This example should also reinforce the fundamental similarity between the SUMMARY and MEANS procedure, namely that the two procedures use identical syntax and produce identical output. Launch and run the SAS program, and review the output to convince yourself that there is no difference between the two data sets created by the MEANS and SUMMARY procedures.

Example 11.12

You can also create a summarized data set similar to the hospsummary data set created in the previous two examples by using a BY statement instead of a CLASS statement. The following program does just that:

PROC SORT data = icdb.hem2 out = srtdhem2;
   by hosp;
RUN;

PROC MEANS data=srtdhem2 NOPRINT;
    var rbc wbc hcrit;
	by hosp;
	output out = hospsummary 
	       mean = MeanRBC MeanWBC MeanHCRIT
		   median = MedianRBC MedianWBC MedianHCRIT;
RUN;

PROC PRINT;
  title 'Hospital Statistics';
RUN;
Hospital Statistics
Obs hosp _Type_ _FREQ_ MeanRBC MeanWBC MeanHCRIT MedianRBC MedianWBC MedianHCRIT
1 11 0 106 4.41321 7.10660 39.7821 4.405 7.2 39.35
2 21 0 108 4.42991 7.36944 39.2769 4.440 7.6 39.00
3 22 0 42 4.39571 7.37071 38.8024 4.445 7.1 39.40
4 23 0 6 4.27667 5.16667 39.3667 .235 5.4 39.10
5 31 0 52 4.42135 7.50212 39.2846 4.375 7.5 39.10
6 41 0 92 4.50207 7.00435 40.1859 4.455 6.8 40.20
7 42 0 95 4.39726 7.00632 39.1358 4.350 6.9 39.30
8 51 0 65 4.50000 7.24615 39.9969 4.500 7.1 40.00
8 52 0 69 4.42580 6.74203 38.8145 4.410 6.7 38.50

As you can see, the only difference between this program and that in Example 11.10 is that the CLASS statement was replaced by a BY statement, and of course because of that, we had to add a SORT procedure so we could sort the data in icdb.hem2 by hosp. Launch and run the SAS program, and review the output to convince yourself that there is not much of a difference between the resulting hospsummary data set here and that in Examples 11.10 and 11.11.

Well, okay, here _TYPE_ = 0 means that all of the observations here contain summary statistics at the subgroup level. The meaning of _TYPE_ should now seem very confusing to you. Fortunately, we don't need to worry about it much! There is always SAS Help and Documentation available if you're dying to learn more about it. The more important thing to note here is that the MEANS procedure summarizes each BY group as an independent subset of the input data, and therefore, SAS does not produce any sort of overall summarization as it does when using the CLASS statement.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility