32.5 - Summarizing and Grouping Data32.5 - Summarizing and Grouping Data
In previous sections, we use the SQL procedure to generate detailed reports. Sometimes, the summarized report is also necessary for us to explore data. To do that, we are going to need summary functions and/or the GROUP BY clause in PROC SQL.
Many summary functions that are used in other SAS steps can also work well in PROC SQL. Below is the table of summary functions you can request:
|AVG, MEAN||mean or average of values|
|COUNT, FREQ, N||number of non-missing values|
|CSS||corrected sum of squares|
|CV||coefficient of variation(percent)|
|NMISS||number of missing values|
|PRT||probability of a greater absolute value of student's t|
|RANGE||range of values|
|STDERR||standard error of the mean|
|SUM||sum of values|
|T||student's t value for testing hypothesis|
|USS||uncorrected sum of squares|
Note: some functions have multiple names. The first listed is the SQL name.
Next we will work through examples to see how these functions perform calculations in PROC SQL. Along the way, the GROUP BY clause will be introduced and work with the functions.
In this section, the data from survey.sas7bdat will be used for all examples. First, make sure that the file is downloaded to your own computer and revise the libname as needed.
The following program uses the AVG() function to calculate the mean scores of SAT math and verbal test:
PROC SQL; select avg(SATM) as average_Math, avg(SATV) as average_Verbal from stat482.survey; QUIT
First launch and run the SAS program. When checking the output you will see two overall average scores have been calculated for SATM and SATV separately. There is only one observation in the output window.
Let’s review the function in the code. To calculate average, either MEAN() or AVG() can be used in this case. Note that there is only one argument (column) inside the function AVG(). So the statistic is calculated across all rows for one column.
AVG(STAM)—the overall average score of SATM
AVG(STAV)—the overall average score of SATV
Quite simple, right? Let’s add one more argument into the function. Can you guess how many observations will be in the output?
In the following program, two columns are the arguments of the function MEAN():
PROC SQL; select mean(SATM, SATV) as average from stat482.survey; QUIT;
We changed the program a little bit. Both SATM and SATV are put inside the function as arguments. Launch and run the SAS program. You will see there are 226 observations, which is the same as in the original survey data.
If you add more than one column as arguments of summary functions, SAS will perform the calculation across the columns for each row to generate the above output.
In this case, the summary function is not performing aggregation anymore. SAS then looks for a like-named function in BASE SAS. If yes, the calculation will be performed for each row; if not, an error message will be output in the log window. You can try to change MEAN() to AVG() to see what will happen.
ERROR: Function AVG could not be located.
The following program uses only one argument for MEAN(), but add one more column in the SELECT clause:
PROC SQL; select Gender, mean(SATM) as average_Math from stat482.survey; QUIT;
In the above program, the the SELECT statement changed again. This time, only one argument is for the MEAN() function to calculate the overall average score of SAT math grades. Outside the function, another column has been selected as well. What output will it produce?
Launch and run the SAS program. You may be surprised that the output contains 226 rows. Review the output you will see two things that have been done by the above code:
- It calculated the overall average math score for all students.
- It displayed all rows in the output because Gender is not an argument of MEAN() function.
Note that the overall average math score is just repeated for each row. You can find a message like the one below in the log window. When you submit such a program, SAS calculate the statistic first. Then merge it back with other columns. That’s how “remerging” happens.
NOTE: The query requires remerging summary statistics back with the original data
The above result is not what we wanted. Now, let’s see how to use the GROUP BY clause to make it reasonable.
The following example calculates the average SAT math score for each gender group:
PROC SQL; select Gender, mean(SATM) as average_Math from stat482.survey group by Gender; QUIT;
The above program seems identical to the program in the previous example except for one more clause: GROUP BY. Finally, we get it right and obtain the desired result: the average SAT math scores for female and male students. Of course, you can make further use of GROUP BY by adding multiple columns. Let’s find out with the next example.
The following program uses both Gender and SmokeCigarettes in the GROUP BY clause to calculate the average SAT math scores:
PROC SQL; select Gender, SmokeCigarettes, mean(SATM) as average_Math from stat482.survey group by 1, 2; QUIT;
Launch and run the SAS program, then review the output. As you can see, the average math scores are calculated for each smoking group (Yes or No) inside each gender group (Female or Male).
Just one more thing about this program, the columns can also be referred to by their locations in the SELECT clause as in the WHERE clause. Here, 1 and 2 are used to refer to Gender and SmokeCigarettes.
Next, we will pay attention to one special summary function in SQL, which is COUNT(). You can use the COUN() function to count the non-missing values.
The following example count the number of rows in survey data, the number of non-missing records for math and verbal test scores, and the distinct values of gender:
PROC SQL; select count(*) as No_obs, count(SATM) as No_Math_records, count(SATV) as No_Verbal_records, count(distinct Gender) as Gender_group from stat482.survey; QUIT;
The above code reveals three different common ways of using the COUNT() function.
- Count(*) is to count total number of rows in a table. COUNT() is the only function that allows you to use * as an argument.
- Count(column) is to count the number of non-missing values in a column. In the program, we count the number of non-missing values for math and verbal scores.
- Count(distinct column) is to count the total number of unique values in a column. In the above example, we count the number of gender categories.
Launch and run the SAS program, then review the output. With knowledge of some of the missing values inside the table, we are not surprised to see the first three numbers unmatched. The total number of rows in survey data is 226. The total numbers of non-missing values of math and verbal scores are 216 and 215, separately. Both numbers are less than 226, which means there are missing values in each column, and SATV has one more value missing. There are only two categories in Gender, Male and Female. So the last count is 2.