11.6 - The UNIVARIATE procedure

In this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.

Example 11.14 Section

The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (rbc) variable in the icdb.hem2 data set:

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC';
   var rbc;
RUN;

The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable rbc.

Launch and run the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: Moments, Basic Statistical Measures, Tests for Location Mu0 = 0, Quantiles, and Extreme Observations. Here's what the first three sections of the output look like:

Univariate Analysis of RBC
The UNIVARIATE Procedure
variable: rbc
Moments
N	635	Sum Weights	635
Mean	4.43500787	Sum Observations	2816.23
Std Deviation	0.394171	variance	0.15537078
Skewness	0.29025297	Kurtosis	0.51198988
Uncorrected SS	12588.5073	Corrected SS	98.505075
Coeff Variation	8.88771825	Std Error Mean	0.0156422

and the fourth section:

Quantiles (Definition 5)
Quantile	Estimate
100% Max	5.95
99%	5.41
95%	5.12
90%	4.92
75% Q3	4.69
50% Median	4.41
25% Q1	4.17
10%	3.97
5%	3.82
1%	3.55
0% Min	3.12

and the fifth and final section:

----Lowest----
Value	Obs
3.12	218
3.33	152
3.35	227
3.47	72
3.54	365

----Highest----
Value	Obs
5.59	369
5.55	33
5.62	286
5.70	517
5.95	465

With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average ("Mean") red blood cell count of the 635 subjects ("N") in the data set is 4.435 with a standard deviation of 0.394. The median ("50% Median") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).

Example 11.15 Section

When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.

When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement. (Note that in SAS 9.4, you need to disable "ODS Graphics" to get the plots in the listing output. To do so, go to menu Tools> Options > Preferences, and uncheck the box Use ODS Graphics under the tab Results. Click OK.)

If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.

The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable rbc of the hematology data set:

PROC UNIVARIATE data = icdb.hem2 NORMAL PLOT;
   title 'Univariate Analysis of RBC with NORMAL and PLOT Options';
   var rbc;
RUN;

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called Tests for Normality that contains the four "test for normality" test statistics and corresponding P-values:

Tests for Normality
Test	--Statistic--		----p Value-----
Sharpiro-Wilk	W	0.992948	Pr < W	0.0044
Kolmogorov-Smirnov	D	0.033851	Pr > D	0.0771
Cramer-von Mises	W-Sq	0.145326	Pr > W-Sq	0.0279
Anderson-Darling	A-Sq	1.070646	Pr > A-Sq	0.0085

At the end of the output, you should see the histogram and box plot:

                         Histogram                       #             Boxplot  
     5.9+*                                               1                0     
        .*                                               2                0     
        .**                                              6                0     
        .*******                                        19                |     
        .*******                                        21                |     
        .*******************                            57                |     
        .**********************************            101             +-----+  
     4.5+******************************************    125             *--+--*  
        .*******************************************   128             |     |  
        .**********************************            100             +-----+  
        .*****************                              51                |     
        .*****                                          15                |     
        .**                                              6                |     
        .*                                               2                0     
     3.1+*                                               1                0     
         ----+----+----+----+----+----+----+----+---                            
         * may represent up to 3 counts

as well as the normal probability plot for the rbc variable:

The UNIVARIATE Procedure
Variable: rbc

                                 Normal Probability Plot                        
               5.9+                                                  *          
                  |                                                  *          
                  |                                               ****          
                  |                                          ******++           
                  |                                       ****++                
                  |                                   *****                     
                  |                              ******                         
               4.5+                        ******                               
                  |                   ******                                    
                  |             *******                                         
                  |       *******                                               
                  |   *****+                                                    
                  |****                                                         
                  |*                                                            
               3.1+*                                                            
                   +----+----+----+----+----+----+----+----+----+----+          
                       -2        -1         0        +1        +2

Example 11.16 Section

When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (subj) to indicate extreme values of red blood cell count (rbc):

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC with ID Option';
   var rbc;
   id subj;
RUN;

Launch and run the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In Example 11.14, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable ("id subj"):

------Lowest------
Value	SUBJ	Obs
3.12	220007	218
3.33	210057	152
3.35	220021	227
3.47	110134	72
3.54	410059	365

------Highest------
Value	SUBJ	Obs
5.59	410063	369
5.55	110086	33
5.62	310092	286
5.70	510026	517
5.95	420074	465

SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.

You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.

11.6 - The UNIVARIATE procedure

Example 11.14 Section

Example 11.15 Section

The UNIVARIATE ProcedureVariable: rbc

Example 11.16 Section

The UNIVARIATE Procedure
Variable: rbc