11.6 - The UNIVARIATE procedure

In this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.

Example 11.14 Section

The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (rbc) variable in the icdb.hem2 data set:

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC';
   var rbc;
RUN;

The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable rbc.

Launch and run the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: Moments, Basic Statistical Measures, Tests for Location Mu0 = 0, Quantiles, and Extreme Observations. Here's what the first three sections of the output look like:

Univariate Analysis of RBC
The UNIVARIATE Procedure
variable: rbc
Moments
N 635 Sum Weights 635
Mean 4.43500787 Sum Observations 2816.23
Std Deviation 0.394171 variance 0.15537078
Skewness 0.29025297 Kurtosis 0.51198988
Uncorrected SS 12588.5073 Corrected SS 98.505075
Coeff Variation 8.88771825 Std Error Mean 0.0156422

and the fourth section:

Quantiles (Definition 5)
Quantile Estimate
100% Max 5.95
99% 5.41
95% 5.12
90% 4.92
75% Q3 4.69
50% Median 4.41
25% Q1 4.17
10% 3.97
5% 3.82
1% 3.55
0% Min 3.12

and the fifth and final section:

----Lowest----
Value Obs
3.12 218
3.33 152
3.35 227
3.47 72
3.54 365
----Highest----
Value Obs
5.59 369
5.55 33
5.62 286
5.70 517
5.95 465

With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average ("Mean") red blood cell count of the 635 subjects ("N") in the data set is 4.435 with a standard deviation of 0.394. The median ("50% Median") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).

Example 11.15 Section

When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are: Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.

When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement. (Note that in SAS 9.4, you need to disable "ODS Graphics" to get the plots in the listing output. To do so, go to menu Tools> Options > Preferences, and uncheck the box Use ODS Graphics under the tab Results. Click OK.)

If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.

The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable rbc of the hematology data set:

PROC UNIVARIATE data = icdb.hem2 NORMAL PLOT;
   title 'Univariate Analysis of RBC with NORMAL and PLOT Options';
   var rbc;
RUN;

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called Tests for Normality that contains the four "test for normality" test statistics and corresponding P-values:

Tests for Normality
Test --Statistic-- ----p Value-----
Sharpiro-Wilk W 0.992948 Pr < W 0.0044
Kolmogorov-Smirnov D 0.033851 Pr > D 0.0771
Cramer-von Mises W-Sq 0.145326 Pr > W-Sq 0.0279
Anderson-Darling A-Sq 1.070646 Pr > A-Sq 0.0085

At the end of the output, you should see the histogram and box plot:

as well as the normal probability plot for the rbc variable:

Example 11.16 Section

When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (subj) to indicate extreme values of red blood cell count (rbc):

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC with ID Option';
   var rbc;
   id subj;
RUN;

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In Example 11.14, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable ("id subj"):

------Lowest------
Value SUBJ Obs
3.12 220007 218
3.33 210057 152
3.35 220021 227
3.47 110134 72
3.54 410059 365
------Highest------
Value SUBJ Obs
5.59 410063 369
5.55 110086 33
5.62 310092 286
5.70 510026 517
5.95 420074 465

SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.

You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.