11.6 - The UNIVARIATE procedure
11.6 - The UNIVARIATE procedureIn this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.
Example 11.14
The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (rbc) variable in the icdb.hem2 data set:
PROC UNIVARIATE data = icdb.hem2;
title 'Univariate Analysis of RBC';
var rbc;
RUN;
The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable rbc.
Launch and run the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: Moments, Basic Statistical Measures, Tests for Location Mu0 = 0, Quantiles, and Extreme Observations. Here's what the first three sections of the output look like:
N | 635 | Sum Weights | 635 |
---|---|---|---|
Mean | 4.43500787 | Sum Observations | 2816.23 |
Std Deviation | 0.394171 | variance | 0.15537078 |
Skewness | 0.29025297 | Kurtosis | 0.51198988 |
Uncorrected SS | 12588.5073 | Corrected SS | 98.505075 |
Coeff Variation | 8.88771825 | Std Error Mean | 0.0156422 |
and the fourth section:
Quantile | Estimate |
---|---|
100% Max | 5.95 |
99% | 5.41 |
95% | 5.12 |
90% | 4.92 |
75% Q3 | 4.69 |
50% Median | 4.41 |
25% Q1 | 4.17 |
10% | 3.97 |
5% | 3.82 |
1% | 3.55 |
0% Min | 3.12 |
and the fifth and final section:
Value | Obs |
---|---|
3.12 | 218 |
3.33 | 152 |
3.35 | 227 |
3.47 | 72 |
3.54 | 365 |
Value | Obs |
---|---|
5.59 | 369 |
5.55 | 33 |
5.62 | 286 |
5.70 | 517 |
5.95 | 465 |
With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average ("Mean") red blood cell count of the 635 subjects ("N") in the data set is 4.435 with a standard deviation of 0.394. The median ("50% Median") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).
Example 11.15
When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.
When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement. (Note that in SAS 9.4, you need to disable "ODS Graphics" to get the plots in the listing output. To do so, go to menu Tools> Options > Preferences, and uncheck the box Use ODS Graphics under the tab Results. Click OK.)
If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.
The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable rbc of the hematology data set:
PROC UNIVARIATE data = icdb.hem2 NORMAL PLOT;
title 'Univariate Analysis of RBC with NORMAL and PLOT Options';
var rbc;
RUN;
Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called Tests for Normality that contains the four "test for normality" test statistics and corresponding P-values:
Test | --Statistic-- | ----p Value----- | ||
---|---|---|---|---|
Sharpiro-Wilk | W | 0.992948 | Pr < W | 0.0044 |
Kolmogorov-Smirnov | D | 0.033851 | Pr > D | 0.0771 |
Cramer-von Mises | W-Sq | 0.145326 | Pr > W-Sq | 0.0279 |
Anderson-Darling | A-Sq | 1.070646 | Pr > A-Sq | 0.0085 |
At the end of the output, you should see the histogram and box plot:
Histogram # Boxplot 5.9+* 1 0 .* 2 0 .** 6 0 .******* 19 | .******* 21 | .******************* 57 | .********************************** 101 +-----+ 4.5+****************************************** 125 *--+--* .******************************************* 128 | | .********************************** 100 +-----+ .***************** 51 | .***** 15 | .** 6 | .* 2 0 3.1+* 1 0 ----+----+----+----+----+----+----+----+--- * may represent up to 3 counts
as well as the normal probability plot for the rbc variable:
The UNIVARIATE Procedure
Variable: rbc
Normal Probability Plot 5.9+ * | * | **** | ******++ | ****++ | ***** | ****** 4.5+ ****** | ****** | ******* | ******* | *****+ |**** |* 3.1+* +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
Example 11.16
When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (subj) to indicate extreme values of red blood cell count (rbc):
PROC UNIVARIATE data = icdb.hem2;
title 'Univariate Analysis of RBC with ID Option';
var rbc;
id subj;
RUN;
Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In Example 11.14, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable ("id subj"):
Value | SUBJ | Obs |
---|---|---|
3.12 | 220007 | 218 |
3.33 | 210057 | 152 |
3.35 | 220021 | 227 |
3.47 | 110134 | 72 |
3.54 | 410059 | 365 |
Value | SUBJ | Obs |
---|---|---|
5.59 | 410063 | 369 |
5.55 | 110086 | 33 |
5.62 | 310092 | 286 |
5.70 | 510026 | 517 |
5.95 | 420074 | 465 |
SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.
You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.