11.6 - The UNIVARIATE procedure

11.6 - The UNIVARIATE procedure

In this section, we take a brief look at the UNIVARIATE procedure just so we can see how its output differs from that of the MEANS and SUMMARY procedures.

Example 11.14

The following UNIVARIATE procedure illustrates the (almost) simplest version of the procedure, in which it tells SAS to perform a univariate analysis on the red blood cell count (rbc) variable in the icdb.hem2 data set:

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC';
   var rbc;
RUN;

The simplest version of the UNIVARIATE procedure would be one in which no VAR statement is present. Then, SAS would perform a univariate analysis for each numeric variable in the data set. The DATA= option merely tells SAS on which data set you want to do a univariate analysis. As always, if the DATA= option is absent, SAS performs the analysis on the current data set. The VAR statement tells SAS to perform a univariate analysis on the variable rbc.

Launch and run  the program and review the output to familiarize yourself with the kinds of summary statistics the univariate procedure calculates. You should see five major sections in the output with the following headings: Moments, Basic Statistical Measures, Tests for Location Mu0 = 0, Quantiles, and Extreme Observations. Here's what the first three sections of the output look like:

Univariate Analysis of RBC
The UNIVARIATE Procedure
variable: rbc
Moments

N

635

Sum Weights

635

Mean

4.43500787

Sum Observations

2816.23

Std Deviation

0.394171

variance

0.15537078

Skewness

0.29025297

Kurtosis

0.51198988

Uncorrected SS

12588.5073

Corrected SS

98.505075

Coeff Variation

8.88771825

Std Error Mean

0.0156422

and the fourth section:

Quantiles (Definition 5)

Quantile

Estimate

100% Max

5.95

99%

5.41

95%

5.12

90%

4.92

75% Q3

4.69

50% Median

4.41

25% Q1

4.17

10%

3.97

5%

3.82

1%

3.55

0% Min

3.12

and the fifth and final section:

----Lowest----

Value

Obs

3.12

218

3.33

152

3.35

227

3.47

72

3.54

365

----Highest----

Value

Obs

5.59

369

5.55

33

5.62

286

5.70

517

5.95

465

With an introductory statistics course in your background, the output should be mostly self-explanatory. For example, the output tells us that the average ("Mean") red blood cell count of the 635 subjects ("N") in the data set is 4.435 with a standard deviation of 0.394. The median ("50% Median") red blood cell count is 4.41. The smallest red blood cell count in the data set is 3.12 (observation #218), while the largest is 5.95 (observation #465).

Example 11.15

When you specify the NORMAL option, SAS will compute four different test statistics for the null hypothesis that the values of the variable specified in the VAR statement are a random sample from a normal distribution. The four test statistics calculated and presented in the output are Shapiro-Wilk, Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling.

When you specify the PLOT option, SAS will produce a histogram, a box plot, and a normal probability plot for each variable specified in the VAR statement. (Note that in SAS 9.4, you need to disable "ODS Graphics" to get the plots in the listing output. To do so, go to menu Tools> Options > Preferences, and uncheck the box Use ODS Graphics under the tab Results. Click OK.)

If you have a BY statement specified as well, SAS will produce each of these plots for each level of the BY statement.

The following UNIVARIATE procedure illustrates the NORMAL and PLOT options on the variable rbc of the hematology data set:

PROC UNIVARIATE data = icdb.hem2 NORMAL PLOT;
   title 'Univariate Analysis of RBC with NORMAL and PLOT Options';
   var rbc;
RUN;

Launch and run  the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from the NORMAL and PLOT options. You should see a new section called Tests for Normality that contains the four "test for normality" test statistics and corresponding P-values:

Tests for Normality

Test

--Statistic--

----p Value-----

Sharpiro-Wilk

W

0.992948

Pr < W

0.0044

Kolmogorov-Smirnov

D

0.033851

Pr > D

0.0771

Cramer-von Mises

W-Sq

0.145326

Pr > W-Sq

0.0279

Anderson-Darling

A-Sq

1.070646

Pr > A-Sq

0.0085

At the end of the output, you should see the histogram and box plot:

                         Histogram                       #             Boxplot  
     5.9+*                                               1                0     
        .*                                               2                0     
        .**                                              6                0     
        .*******                                        19                |     
        .*******                                        21                |     
        .*******************                            57                |     
        .**********************************            101             +-----+  
     4.5+******************************************    125             *--+--*  
        .*******************************************   128             |     |  
        .**********************************            100             +-----+  
        .*****************                              51                |     
        .*****                                          15                |     
        .**                                              6                |     
        .*                                               2                0     
     3.1+*                                               1                0     
         ----+----+----+----+----+----+----+----+---                            
         * may represent up to 3 counts                                         
                                                                                
                                                                                

as well as the normal probability plot for the rbc variable:

The UNIVARIATE Procedure
Variable: rbc

                                 Normal Probability Plot                        
               5.9+                                                  *          
                  |                                                  *          
                  |                                               ****          
                  |                                          ******++           
                  |                                       ****++                
                  |                                   *****                     
                  |                              ******                         
               4.5+                        ******                               
                  |                   ******                                    
                  |             *******                                         
                  |       *******                                               
                  |   *****+                                                    
                  |****                                                         
                  |*                                                            
               3.1+*                                                            
                   +----+----+----+----+----+----+----+----+----+----+          
                       -2        -1         0        +1        +2               
                                                                                
                                                                                

Example 11.16

When you use the UNIVARIATE procedure's ID statement, SAS uses the values of the variable specified in the ID statement to indicate the five largest and five smallest observations rather than the (usually meaningless) observation number. The following UNIVARIATE procedure uses the subject number (subj) to indicate extreme values of red blood cell count (rbc):

PROC UNIVARIATE data = icdb.hem2;
   title 'Univariate Analysis of RBC with ID Option';
   var rbc;
   id subj;
RUN;

Launch and run  the SAS program. Review the output to familiarize yourself with the change in the UNIVARIATE output that arises from using the ID statement. In Example 11.14, the UNIVARIATE output indicated that observation #218 has the smallest red blood cell count in the data set (3.12), while observation #465 has the largest (5.95). Now, because of the use of the subject number as an ID variable ("id subj"):

------Lowest------

Value

SUBJ

Obs

3.12

220007

218

3.33

210057

152

3.35

220021

227

3.47

110134

72

3.54

410059

365

------Highest------

Value

SUBJ

Obs

5.59

410063

369

5.55

110086

33

5.62

310092

286

5.70

510026

517

5.95

420074

465

SAS reports the more helpful information that subject 220007 has the smallest red blood cell count, while subject 420074 has the largest.

You shouldn't be surprised to learn that the UNIVARIATE procedure can do much more than what we can address now. Just as the BY statement can be used in the MEANS and SUMMARY procedures to categorize the observations in the input data set into subgroups, so can a BY statement be used in the UNIVARIATE procedure. And, just as an OUTPUT statement can be used in the MEANS and SUMMARY procedures to create summarized data sets, so can an OUTPUT statement be used in the UNIVARIATE procedure. For more information about the functionality and syntax of the UNIVARIATE procedure, see the SAS Help and Documentation.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility