11.3 - Group Processing

All of the examples we've looked at so far have involved summarizing all of the observations in the data set. In many cases, we'll instead want to tell SAS to calculate summary statistics for certain subgroups. For example, it makes more sense to calculate the average height for males and females separately rather than calculating an average height of all individuals together. In this section, we'll investigate two ways of producing summary statistics for subgroups. One approach involves using a CLASS statement, and the other involves using a BY statement. As you'll soon see, the approach you choose to use will depend most on how you'd like your final report to look.

Example 11.7 Section

The following program uses the VAR and CLASS statements to tell SAS to calculate the default summary statistics of the rbc, wbc, and hcrit variables separately for each of the nine hosp values:

  PROC MEANS data=icdb.hem2 fw=10 maxdec=2;
      var rbc wbc hcrit;
    class hosp;
  RUN;
The MEANS Procedure
hosp N Obs Variable N Mean Std Dev Minimum Maximum
11 106 rbc 106 4.41 0.42 3.47 5.55
wbc 106 7.11 1.92 3.30 13.10
hcrit 106 39.78 3.30 32.80 48.90
21 108 rbc 108 4.43 0.39 3.33 5.35
wbc 108 7.37 1.94 3.10 12.60
hcrit 108 39.28 2.90 33.00 48.00
22 42 rbc 42 4.40 0.43 3.12 5.12
wbc 42 7.37 2.15 3.90 14.20
hcrit 42 38.80 3.09 31.00 45.30
23 6 rbc 6 4.28 0.45 3.64 4.94
wbc 6 5.17 1.37 3.30 6.60
hcrit 6 39.37 3.31 35.90 44.80
31 52 rbc 52 4.42 0.41 3.55 5.62
wbc 52 7.50 1.87 4.00 13.20
hcrit 52 39.28 3.34 33.90 47.80
41 92 rbc 92 4.50 0.44 3.54 5.49
wbc 92 7.11 1.92 3.30 13.10
hcrit 92 40.19 3.51 32.50 49.90
42 95 rbc 95 4.40 0.33 3.63 5.95
wbc 95 7.01 1.79 3.30 12.00
hcrit 95 39.14 2.66 33.90 51.40
51 65 rbc 65 4.50 0.42 3.80 5.70
wbc 65 7.01 1.79 3.30 12.00
hcrit 65 39.14 2.66 33.90 51.40
52 69 rbc 69 4.43 0.32 3.75 5.40
wbc 69 6.74 1.66 3.90 10.20
hcrit 69 38.81 2.89 29.70 45.00

First, you should note that the variables appearing in the CLASS statement need not be character variables. Here, we use the numeric variable hosp to break up the 635 observations in the icdb.hem2 data set into nine subgroups. When CLASS variables are numeric, they should of course contain a limited number of discrete values that represent meaningful subgroups. Otherwise, you will be certain to generate an awful lot of useless output.

Now, launch and run the SAS program, and review the output to convince yourself that the report is generated as described. As you can see, the MEANS procedure does not generate statistics for the CLASS variables. Their values are instead used only to categorize the data.

Let's see what happens when our CLASS statement contains more than one variable.

Example 11.8 Section

The following program reads some data on national parks into a temporary SAS data set called parks, and then uses the MEANS procedure's VAR and CLASS statements to tell SAS to sum the number of musems and camping facilities for each combination of the Type and Region variables:

DATA parks;
     input ParkName $ 1-21 Type $ Region $ Museums Camping;
     DATALINES;
Dinosaur              NM West 2  6
Ellis Island          NM East 1  0
Everglades            NP East 5  2
Grand Canyon          NP West 5  3
Great Smoky Mountains NP East 3 10
Hawaii Volcanoes      NP West 2  2
Lava Beds             NM West 1  1
Statue of Liberty     NM East 1  0
Theodore Roosevelt    NP West 2  2
Yellowstone           NP West 9 11
Yosemite              NP West 2 13
	 ;
RUN;

PROC MEANS data = parks fw = 10 maxdec = 0 sum;
   var museums camping;
   class type region;
RUN;
The MEANS procedure
Type Region N Obs Variable Sum
NM East 2 Museums 2
Camping 0
West 2 Museums 3
Camping 7
NP East 2 Museums 8
Camping 12
West 5 Museums 20
Camping 31

Now, launch and run the SAS program, and review the output. You should see that, for example, SAS determined that the number of museums in National Monuments in the East is 2. The number of museums in National Monuments in the West is 3. And so on.

It is probably actually more important here to note how SAS processed the CLASS statement. As you can see, the Type variable appears first and the Region variable appears second in the CLASS statement. For that reason, the Type variable appears first and the Region variable appears second in the output. In general, the order of the variables in the CLASS statement determines their order in the output table. To convince yourself of this, you might want to change the order of the variables as they appear in the CLASS statement, and re-run the SAS program to see what you get.

Example 11.9 Section

Like the CLASS statement, the BY statement specifies variables to use for categorizing observations. The following program uses the MEANS procedure's BY statement to categorize the observations in the parks data set into four subgroups, as determined by the Type and Region variables, before calculating the sum, minimum and maximum of the museums and camping values for each of the subgroups:

PROC SORT data = parks out = srtdparks;
   by type region;
RUN;

PROC MEANS data = srtdparks fw = 10 maxdec = 0 sum min max;
   var museums camping;
   by type region;
RUN;
---Type=NM Region=East---
The MEANS Procedure
Variable Sum Minimum Maximum
Museums 2 1 1
Camping 0 0 0
---Type=NM Region=West---
Variable Sum Minimum Maximum
Museums 3 1 2
Camping 7 1 6
---Type=NP Region=East---
Variable Sum Minimum Maximum
Museums 8 3 5
Camping 12 2 10
---Type=NP Region=West---
Variable Sum Minimum Maximum
Museums 20 2 9
Camping 31 2 13

You might want to just go ahead and launch and run the SAS program to see what the report looks like when you use a BY statement instead of a CLASS statement to form the subgroups. You might recall that when you use a CLASS statement, SAS generates a single large table containing all of the summary statistics. As you can see in the output here, when you instead use a BY statement, SAS generates a table for each combination of the Type and Region variables. To be more specific, SAS creates four tables here — one for Type = NM and Region = East, one for Type = NP and Region = East, one for Type = NM and Region = West, and one for Type = NP and Region = West.

Of course, there is one thing we've not addressed so far in this code ... that's the SORT procedure. Unlike CLASS processing, BY-group processing requires that your data be sorted in the order of the variables that appear in the BY statement. If the observations in your data set are not sorted in order by the variables appearing in the BY statement, then you have to use the SORT procedure to sort your data set before using it in the MEANS procedure. Don't forget that if you don't specify an output data set using the OUT= option, then the SORT procedure overwrites your initial data set with the newly sorted observations. Here, our SORT procedure tells SAS to sort the parks data set by the Type and Region variables, and to store the sorted data set in a new data set called srtdparks.

In closing off our discussion of group processing, we should probably discuss which approach — the CLASS statement or the BY statement — is more appropriate. My personal opinion is that it's all a matter of preference. If you prefer to see your summary statistics in one large table, then you should use the CLASS statement. If you instead prefer to see your summary statistics in a bunch of smaller tables, then you should use the BY statement. My personal opinion doesn't take into account the efficiency of your program, however. The advantage of the CLASS statement is that it is easier to use since you need not sort the data first. The advantage of the BY statement is that it can be more efficient when you are categorizing data by many variables.