11.3 - Group Processing

All of the examples we've looked at so far have involved summarizing all of the observations in the data set. In many cases, we'll instead want to tell SAS to calculate summary statistics for certain subgroups. For example, it makes more sense to calculate the average height for males and females separately rather than calculating the average height of all individuals together. In this section, we'll investigate two ways of producing summary statistics for subgroups. One approach involves using a CLASS statement, and the other involves using a BY statement. As you'll soon see, the approach you choose to use will depend most on how you'd like your final report to look.

Example 11.7 Section

The following program uses the VAR and CLASS statements to tell SAS to calculate the default summary statistics of the rbc, wbc, and hcrit variables separately for each of the nine hosp values:

PROC MEANS data=icdb.hem2 fw=10 maxdec=2;
      var rbc wbc hcrit;
    class hosp;
  RUN;

The MEANS Procedure

hosp

N Obs

Variable

N

Mean

Std Dev

Minimum

Maximum

11

106

rbc
wbc
hcrit

106

4.41

7.11

39.78

0.42

1.92

3.30

3.47

3.30

32.80

5.55

13.10

48.90

21

108

rbc
wbc
hcrit

108

4.43

7.37

39.28

0.39

1.94

2.90

3.33

3.10

33.00

5.35

12.60

48.00

22

42

rbc
wbc
hcrit

42

4.40

7.37

38.80

0.43

2.15

3.09

3.12

3.90

31.00

5.12

14.20

45.30

23

6

rbc
wbc
hcrit

6

4.28

5.17

39.37

0.45

1.37

3.31

3.64

3.30

35.90

4.94

6.60

44.80

31

52

rbc
wbc
hcrit

52

4.42

7.50

39.28

0.41

1.87

3.34

3.55

4.00

33.90

5.62

13.20

47.80

41

92

rbc
wbc
hcrit

92

4.50

7.00

40.19

0.44

1.93

3.51

3.54

3.00

32.50

5.49

12.50

49.90

42

95

rbc
wbc
hcrit

95

4.40

7.01

39.14

0.33

1.79

2.66

3.63

3.30

33.90

5.95

12.00

51.40

51

65

rbc
wbc
hcrit

65

4.50

7.25

40.00

0.42

1.95

3.50

3.80

3.50

33.00

5.70

11.50

49.00

52

69

rbc
wbc
hcrit

69

4.43

6.74

38.81

0.32

1.66

2.89

3.75

3.90

29.70

5.40

10.20

45.00

First, you should note that the variables appearing in the CLASS statement need not be character variables. Here, we use the numeric variable hosp to break up the 635 observations in the icdb.hem2 data set into nine subgroups. When CLASS variables are numeric, they should of course contain a limited number of discrete values that represent meaningful subgroups. Otherwise, you will be certain to generate an awful lot of useless output.

Now, launch and run the SAS program, and review the output to convince yourself that the report is generated as described. As you can see, the MEANS procedure does not generate statistics for the CLASS variables. Their values are instead used only to categorize the data.

Let's see what happens when our CLASS statement contains more than one variable.

Example 11.8 Section

The following program reads some data on national parks into a temporary SAS data set called parks, and then uses the MEANS procedure's VAR and CLASS statements to tell SAS to sum the number of museums and camping facilities for each combination of the Type and Region variables:

DATA parks;
     input ParkName $ 1-21 Type $ Region $ Museums Camping;
     DATALINES;
Dinosaur              NM West 2  6
Ellis Island          NM East 1  0
Everglades            NP East 5  2
Grand Canyon          NP West 5  3
Great Smoky Mountains NP East 3 10
Hawaii Volcanoes      NP West 2  2
Lava Beds             NM West 1  1
Statue of Liberty     NM East 1  0
Theodore Roosevelt    NP West 2  2
Yellowstone           NP West 9 11
Yosemite              NP West 2 13
	 ;
RUN;
PROC MEANS data = parks fw = 10 maxdec = 0 sum;
   var museums camping;
   class type region;
RUN;

National Parks
The MEANS procedure
Type	Region	N Obs	Variable	Sum
NM	East	2	Museums	2
	East	2	Camping	0
	West	2	Museums	3
	West	2	Camping	7
NP	East	2	Museums	8
	East	2	Camping	12
	West	5	Museums	20
	West	5	Camping	31

Now, launch and run the SAS program, and review the output. You should see that, for example, SAS determined that the number of museums in National Monuments in the East is 2. The number of museums in National Monuments in the West is 3. And so on.

It is probably actually more important here to note how SAS processed the CLASS statement. As you can see, the Type variable appears first and the Region variable appears second in the CLASS statement. For that reason, the Type variable appears first and the Region variable appears second in the output. In general, the order of the variables in the CLASS statement determines their order in the output table. To convince yourself of this, you might want to change the order of the variables as they appear in the CLASS statement and re-run the SAS program to see what you get.

Example 11.9 Section

Like the CLASS statement, the BY statement specifies variables to use for categorizing observations. The following program uses the MEANS procedure's BY statement to categorize the observations in the parks data set into four subgroups, as determined by the Type and Region variables, before calculating the sum, minimum, and maximum of the museums and camping values for each of the subgroups:

PROC SORT data = parks out = srtdparks;
   by type region;
RUN;
PROC MEANS data = srtdparks fw = 10 maxdec = 0 sum min max;
   var museums camping;
   by type region;
RUN;

The MEANS Procedure
---Type=NM Region=East---
Variable	Sum	Minimum	Maximum
Museums	2	1	1
Camping	0	0	0

---Type=NM Region=West---
Variable	Sum	Minimum	Maximum
Museums	3	1	2
Camping	7	1	6

---Type=NP Region=East---
Variable	Sum	Minimum	Maximum
Museums	8	3	5
Camping	12	2	10

---Type=NP Region=West---
Variable	Sum	Minimum	Maximum
Museums	20	2	9
Camping	31	2	13

You might want to just go ahead and launch and run the SAS program to see what the report looks like when you use a BY statement instead of a CLASS statement to form the subgroups. You might recall that when you use a CLASS statement, SAS generates a single large table containing all of the summary statistics. As you can see in the output here, when you instead use a BY statement, SAS generates a table for each combination of the Type and Region variables. To be more specific, SAS creates four tables here — one for Type = NM and Region = East, one for Type = NP and Region = East, one for Type = NM and Region = West, and one for Type = NP and Region = West.

Of course, there is one thing we've not addressed so far in this code ... that's the SORT procedure. Unlike CLASS processing, BY-group processing requires that your data be sorted in the order of the variables that appear in the BY statement. If the observations in your data set are not sorted in order by the variables appearing in the BY statement, then you have to use the SORT procedure to sort your data set before using it in the MEANS procedure. Don't forget that if you don't specify an output data set using the OUT= option, then the SORT procedure overwrites your initial data set with the newly sorted observations. Here, our SORT procedure tells SAS to sort the parks data set by the Type and Region variables and to store the sorted data set in a new data set called srtdparks.

In closing off our discussion of group processing, we should probably discuss which approach — the CLASS statement or the BY statement — is more appropriate. My personal opinion is that it's all a matter of preference. If you prefer to see your summary statistics in one large table, then you should use the CLASS statement. If you prefer to see your summary statistics in a bunch of smaller tables, then you should use the BY statement. My personal opinion doesn't take into account the efficiency of your program, however. The advantage of the CLASS statement is that it is easier to use since you need not sort the data first. The advantage of the BY statement is that it can be more efficient when you are categorizing data by many variables.