32.5 - Summarizing and Grouping Data

In previous sections, we use the SQL procedure to generate detailed reports. Sometimes, the summarized report is also necessary for us to explore data. To do that, we will need summary functions and/or the GROUP BY clause in PROC SQL.

Many summary functions that are used in other SAS steps can also work well in PROC SQL. Below is the table of summary functions you can request:

Summary function Description
AVG, MEAN mean or average of values
COUNT, FREQ, N number of non-missing values
CSS the corrected sum of squares
CV coefficient of variation(percent)
MAX largest value
MIN smallest value
NMISS number of missing values
PRT probability of a greater absolute value of the student's t
RANGE range of values
STD standard deviation
STDERR standard error of the mean
SUM sum of values
T student's t-value for testing the hypothesis
USS the uncorrected sum of squares
VAR variance

Note: some functions have multiple names. The first listed is the SQL name.

Next, we will work through examples to see how these functions perform calculations in PROC SQL. Along the way, the GROUP BY clause will be introduced and work with the functions.

In this section, the data from survey.sas7bdat will be used for all examples. First, make sure the file is downloaded to your computer and revise the libname as needed.

Example 32.16 Section

The following program uses the AVG() function to calculate the mean scores of SAT math and verbal test:

PROC SQL;
	select avg(SATM) as average_Math,
		avg(SATV) as average_Verbal
	from stat482.survey;
QUIT

average_Math average_Verbal
599.0046 580.3256

First launch and run the SAS program. When checking the output you will see two overall average scores have been calculated for SATM and SATV separately. There is only one observation in the output window.

Let’s review the function in the code. In this case, either MEAN() or AVG() can be used to calculate the average. Only one argument (column) is inside the function AVG(). So the statistic is calculated across all rows for one column.

AVG(STAM)—the overall average score of SATM

AVG(STAV)—the overall average score of SATV

arrows pointing down along the STAM and STAV average columns in the SAS tableview

Quite simple, right? Let’s add one more argument to the function. Can you guess how many observations will be in the output?

Example 32.17 Section

In the following program, two columns are the arguments of the function MEAN():

PROC SQL;
	select mean(SATM, SATV) as average
	from stat482.survey;
QUIT;

average
700
600
470
635
560
665
690
595
655
560
690
.
515
700
500
620
650
675
660
615
450
650
675
665
620
560
575
605
680
450
630
350
625
650
525
645
450
665
560
.
690
640
615
565
695
620
635
620
630
450
600
610
455
605
680
550
600
587.5
775
545
615
670
575
625
587.5
425
600
580
545
605
655
570
575
475
600
580
585
530
670
660
610
620
570
690
710
555
575
.
550
445
725
575
560
675
670
650
535
530
650
505
530
625
475
.
595
560
580
525
495
650
555
.
600
530
600
555
525
675
635
675
475
590
545
435
490
645
625
610
520
640
765
420
500
625
585
625
610
570
565
570
.
610
525
620
475
590
600
630
670
640
590
550
590
485
.
660
400
540
540
700
615
710
595
625
530
610
535
605
660
600
500
680
585
640
620
595
510
650
660
610
575
550
505
.
600
530
720
610
590
540
530
.
500
620
555
520
650
525
630
620
710
500
555
640
550
625
655
475
580
625
610
525
425
545
640
640
510
550
610
480
600
595
.
640
635
575

We changed the program a little bit. Both SATM and SATV are put inside the function as arguments. Launch and run the SAS program. You will see there are 226 observations, which is the same as in the original survey data.

If you add more than one column as arguments of summary functions, SAS will perform the calculation across the columns for each row to generate the above output.

arrows pointing to the right across the SATM and SATV rows

In this case, the summary function is not performing aggregation anymore. SAS then looks for a like-named function in BASE SAS. If yes, the calculation will be performed for each row; if not, an error message will be output in the log window. You can try to change MEAN() to AVG() to see what will happen.

ERROR: Function AVG could not be located.

Example 32.18 Section

The following program uses only one argument for MEAN(), but add one more column in the SELECT clause:

PROC SQL;
	select Gender,
		mean(SATM) as average_Math
	from stat482.survey;
QUIT;

Gender average_Math
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Female 599.0046
Female 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Male 599.0046
Female 599.0046
Male 599.0046

In the above program, the the SELECT statement changed again. This time, only one argument is for the MEAN() function to calculate the overall average score of SAT math grades. Outside the function, another column has been selected as well. What output will it produce?

Launch and run the SAS program. You may be surprised that the output contains 226 rows. Review the output you will see two things that have been done by the above code:

  1. It calculated the overall average math score for all students.
  2. It displayed all rows in the output because Gender is not an argument of MEAN() function.

Note that the overall average math score is just repeated for each row. You can find a message like the one below in the log window. When you submit such a program, SAS calculate the statistic first. Then merge it back with other columns. That’s how “remerging” happens.

NOTE: The query requires remerging summary statistics back with the original data

The above result is not what we wanted. Now, let’s see how to use the GROUP BY clause to make it reasonable.

Example 32.19 Section

The following example calculates the average SAT math score for each gender group:

PROC SQL;
	select Gender,
				mean(SATM) as average_Math
	from stat482.survey
	group by Gender;
QUIT;

Gender average_Math
Female 589.5082
Male 611.3298

The above program seems identical to the program in the previous example except for one more clause: GROUP BY. Finally, we get it right and obtain the desired result: the average SAT math scores for female and male students. Of course, you can make further use of GROUP BY by adding multiple columns. Let’s find out with the next example.

Example 32.20 Section

The following program uses both Gender and SmokeCigarettes in the GROUP BY clause to calculate the average SAT math scores:

PROC SQL;
	select Gender,
		SmokeCigarettes,
		mean(SATM) as average_Math
	from stat482.survey
	group by 1, 2;
QUIT;

Gender SmokeCigarettes average_Math
Female No 589.6552
Female Yes 586.6667
Male No 613.2353
Male Yes 593.3333

Launch and run the SAS program, then review the output. As you can see, the average math scores are calculated for each smoking group (Yes or No) inside each gender group (Female or Male).

Just one more thing about this program, the columns can also be referred to by their locations in the SELECT clause as in the WHERE clause. Here, 1 and 2 are used to refer to Gender and SmokeCigarettes.

Next, we will pay attention to one special summary function in SQL, which is COUNT(). You can use the COUN() function to count the non-missing values.

Example 32.21 Section

The following example count the number of rows in survey data, the number of non-missing records for math and verbal test scores, and the distinct values of gender:

PROC SQL;
	select count(*) as No_obs,
		count(SATM) as No_Math_records,
		count(SATV) as No_Verbal_records,
		count(distinct Gender) as Gender_group
	from stat482.survey;
QUIT;

No_obs No_Math_records No_Verbal_records Gender_group
226 216 215 2

The above code reveals three different common ways of using the COUNT() function.

  1. Count(*) is to count total number of rows in a table. COUNT() is the only function that allows you to use * as an argument.
  2. Count(column) is to count the number of non-missing values in a column. In the program, we count the number of non-missing values for math and verbal scores.
  3. Count(distinct column) is to count the total number of unique values in a column. In the above example, we count the number of gender categories.

Launch and run the SAS program, then review the output. With knowledge of some of the missing values inside the table, we are not surprised to see the first three numbers unmatched. The total number of rows in survey data is 226. The total numbers of non-missing values of math and verbal scores are 216 and 215, separately. Both numbers are less than 226, which means there are missing values in each column, and SATV has one more value missing. There are only two categories in Gender, Male and Female. So the last count is 2.