1.4 - Example: Descriptive Statistics

Example 1-5: Women's Health Survey (Descriptive Statistics) Section

Let us take a look at an example. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

  • Calcium(mg)
  • Iron(mg)
  • Protein(g)
  • Vitamin A(μg)
  • Vitamin C(mg)

Using Technology

We will use the SAS program to carry out the calculations that we would like to see.

Download the data file: nutrient.csv

The lines of this program are saved in a simple text file with a .sas file extension. If you have SAS installed on the machine on which you have downloaded this file, it should launch SAS and open the program within the SAS application. Marking up a printout of the SAS program is also a good strategy for learning how this program is put together.


Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;   /*This sets the max number of lines per page to 78.*/
title "Example: Nutrient Intake Data - Descriptive Statistics";   
/*This sets a title that will appear on each page of the output until it's changed.*/
data nutrient;   /*This defines a data set called 'nutrient'.*/
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';   /*SAS will look in this path for the nutrient.csv file.*/
  input id calcium iron protein a c;   /*This is where we provide names for the variables in order of the columns in the data set. If any were categorical (not the case here), we would need to put a '$' character after its name.*/
proc means;   
  var calcium iron protein a c;   /*If not all variables are of interest, we can specify here the ones we want to work with.*/
proc corr pearson cov;   /*The 'pearson' option specifies the pearson correlation to be computed. The 'cov' option requests the sample covariance matrix.*/
  var calcium iron protein a c;   /*If not all variables are of interest, we can specify here the ones we want to work with.*/

The first part of this SAS output, (download below), is the results of the Means Procedure - proc means. Because the SAS output is usually a relatively long document, printing these pages of output out and marking them with notes is highly recommended if not required!

Example: Nutrient Intake Data - Descriptive Statistics

The MEANS Procedure

The Means Procedure

Summary statistics

Variable N Mean Std Dev Minimum Maximum

Download the SAS Output file: nutrient2.lst

The first column of the Means Procedure table above gives the variable name. The second column reports the sample size. This is then followed by the sample means (third column) and the sample standard deviations (fourth column) for each variable. I have copied these values into the table below. I have also rounded these numbers a bit to make them easier to use for this example.

Here are the steps to find the descriptive statistics for the Women's Nutrition dataset in Minitab:

Descriptive Statistics in Minitab

  1. Go to File > Open > Worksheet [open nutrient_tf.csv]
  2. Stat > Basic Statistics > Display Descriptive Statistics
    1. Highlight and select C2 through C6 and choose ‘Select’ to move the variables into the window on the right.
    2. Select ‘Statistics...’, and check the boxes for the statistics of interest.
    3. OK > OK


Descriptive Statistics

A summary of the descriptive statistics is given here for ease of reference.

Variable Mean Standard Deviation
Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg
Protein 65.8 mg 30.6 mg
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg

Notice that the standard deviations are large relative to their respective means, especially for Vitamin A & C. This would indicate a high variability among women in nutrient intake. However, whether the standard deviations are relatively large or not, will depend on the context of the application. Skill in interpreting the statistical analysis depends very much on the researcher's subject matter knowledge.

The variance-covariance matrix is also copied into the matrix below.

\[S = \left(\begin{array}{RRRRR}157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 \\ 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 \\ 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 \\ 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 \\ 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 \end{array}\right)\]


Because this covariance is positive, we see that calcium intake tends to increase with increasing iron intake. The strength of this positive association can only be judged by comparing s12 to the product of the sample standard deviations for calcium and iron. This comparison is most readily accomplished by looking at the sample correlation between the two variables.

  • The sample variances are given by the diagonal elements of S. For example, the variance of iron intake is \(s_{2}^{2}\). 35. 8 mg2.
  • The covariances are given by the off-diagonal elements of S. For example, the covariance between calcium and iron intake is \(s_{12}\)= 940. 1.
  • Note that, the covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.

Sample Correlations

The sample correlations are included in the table below.

  Calcium Iron Protein Vit. A Vit. C
Calcium 1.000 0.395 0.500 0.158 0.229
Iron 0.395 1.000 0.623 0.244 0.313
Protein 0.500 0.623 1.000 0.147 0.212
Vit. A 0.158 0.244< 0.147 1.000 0.184
Vit. C 0.229 0.313 0.212 0.184 1.000

Here we can see that the correlation between each of the variables and themselves is all equal to one, and the off-diagonal elements give the correlation between each of the pairs of variables.

Generally, we look for the strongest correlations first. The results above suggest that protein, iron, and calcium are all positively associated. Each of these three nutrient increases with increasing values of the remaining two.

The coefficient of determination is another measure of association and is simply equal to the square of the correlation. For example, in this case, the coefficient of determination between protein and iron is \((0.623)^2\) or about 0.388.

\[r^2_{23} = 0.62337^2 = 0.38859\]

This says that about 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake. Both interpretations are equivalent.