Lesson 1: Measures of Central Tendency, Dispersion and Association

Lesson 1: Measures of Central Tendency, Dispersion and Association

Overview

A partial description of the joint distribution of the data is provided here. Three aspects of the data are of importance, the first two of which you should already be familiar with from univariate statistics. These are:

  1. Central Tendency: What is a typical value for each variable?
  2. Dispersion: How far apart are the individual observations from a central value for a given variable?
  3. Association: This might (or might not!) be a new measure for you. When more than one variable are studied together, how does each variable relate to the remaining variables? How are the variables simultaneously related to one another? Are they positively or negatively related?

Statistics, as a subject matter, is the science and art of using sample information to make generalizations about populations.

Population
A population is the collection of all people, plants, animals, or objects of interest about which we wish to make statistical inferences (generalizations). The population may also be viewed as the collection of all possible random draws from a stochastic model; for example, independent draws from a normal distribution with a given population mean and population variance.
Population Parameter
A population parameter is a numerical characteristic of a population. In nearly all statistical problems we do not know the value of a parameter because we do not measure the entire population. We use sample data to make an inference about the value of a parameter.
Sample
A sample is the subset of the population that we actually measure or observe.
Sample Statistic
A sample statistic is a numerical characteristic of a sample. A sample statistic estimates the unknown value of a population parameter. Information collected from sample statistic are sometimes refered to as Descriptive Statistics.

Here are the Notations that will be used:

\(X_{ij}\) = Observation for variable j in subject i .

\(p\) = Number of variables

\(n\) = Number of subjects

In the example to come, we'll have data on 737 people (subjects) and 5 nutritional outcomes (variables). So,

\(p\) = 5 variables

\(n\) = 737 subjects

In multivariate statistics we will always be working with vectors of observations. So in this case we are going to arrange the data for the p variables on each subject into a vector. In the expression below, \(\textbf{X}_i\) is the vector of observations for the \(i^{th}\) subject, \(i\) = 1 to \(n\) (737). Therefore, the data for the \(j^{th}\) variable will be located in the \(j^{th}\) element of this subject's vector, \(j\) = 1 to \(p\) (5).

\[\mathbf{X}_i = \left(\begin{array}{l}X_{i1}\\X_{i2}\\ \vdots \\ X_{ip}\end{array}\right)\]

Objectives

Upon successful completion of this lesson, you should be able to:

  • interpret measures of central tendancy, dispersion, and association;
  • calculate sample means, variances, covariances, and correlations using a hand calculator;
  • use software like SAS or Minitab to compute sample means, variances, covariances, and correlations.

1.1 - Measures of Central Tendency

1.1 - Measures of Central Tendency

Central Tendency: The Mean Vector

Throughout this course, we’ll use the ordinary notations for the mean of a variable. That is, the symbol \(\mu\) is used to represent a (theoretical) population mean and the symbol \(\bar{x}\) is used to represent a sample mean computed from observed data. In the multivariate setting, we add subscripts to these symbols to indicate the specific variable for which the mean is being given. For instance, \(\mu_1\) represents the population mean for variable \(X_1\) and \(\bar{x}_{1}\) denotes a sample mean based on observed data for variable \(X_{1}\).

The population mean is the measure of central tendency for the population. Here, the population mean for variable \(j\) is

\[\mu_j = E(X_{j})\]

The notation \(E\) stands for statistical expectation; here \(E(X_{j})\) is the mean of \(X_{j}\) over all members of the population, or equivalently, overall random draws from a stochastic model. For example, \(\mu_j = E(X_{j})\) may be the mean of a normal variable.

The population mean \(\mu_j\) for variable \(j\) can be estimated by the sample mean

\[\bar{x}_j = \frac{1}{n}\sum_{i=1}^{n}X_{ij}\]

Note! The sample mean \(\bar{x}_{j}\), because it is a function of our random data is also going to have a mean itself. In fact, the population mean of the sample mean is equal to population mean \(\mu_j\); i.e.,\[E(\bar{x}_j) = \mu_j \]

Therefore, the \(\bar{x}_{j}\) is unbiased for \(\mu_j\).

Another way of saying this is that the mean of the \(\bar{x}_{j}\)’s over all possible samples of size \(n\) is equal to \(\mu_j\).

Recall that the population mean vector is \(\boldsymbol{\mu}\) which is a collection of the means for each of the population means for each of the different variables.

\(\boldsymbol{\mu} = \left(\begin{array}{c} \mu_1 \\ \mu_2\\ \vdots\\ \mu_p \end{array}\right)\)

We can estimate this population mean vector, \(\boldsymbol{\mu}\), by \(\mathbf{\bar{x}}\). This is obtained by collecting the sample means from each of the variables in a single vector. This is shown below.

\(\mathbf{\bar{x}} = \left(\begin{array}{c}\bar{x}_1\\ \bar{x}_2\\ \vdots \\ \bar{x}_p\end{array}\right) = \left(\begin{array}{c}\frac{1}{n}\sum_{i=1}^{n}X_{i1}\\ \frac{1}{n}\sum_{i=1}^{n}X_{i2}\\ \vdots \\ \frac{1}{n}\sum_{i=1}^{n}X_{ip}\end{array}\right) = \frac{1}{n}\sum_{i=1}^{n}\textbf{X}_i\)

Just as the sample means, \(\bar{x}\), for the individual variables are unbiased for their respective population means, the sample mean vector is unbiased for the population mean vector.

\(E(\mathbf{\bar{x}}) = E\left(\begin{array}{c}\bar{x}_1\\\bar{x}_2\\ \vdots \\\bar{x}_p\end{array}\right) = \left(\begin{array}{c}E(\bar{x}_1)\\E(\bar{x}_2)\\ \vdots \\E(\bar{x}_p)\end{array}\right)=\left(\begin{array}{c}\mu_1\\\mu_2\\\vdots\\\mu_p\end{array}\right)=\boldsymbol{\mu}\)


1.2 - Measures of Dispersion

1.2 - Measures of Dispersion

Dispersion: Variance, Standard Deviation

Variance
A variance measures the degree of spread (dispersion) in a variable’s values.

Theoretically, a population variance is the average squared difference between a variable’s values and the mean for that variable. The population variance for variable \(X_j\) is

Population Variance
 The population variance for variable \(X_j\) is
\(\sigma_j^2 = E(X_j-\mu_j)^2\)

Note that the squared residual \((X_{j}-\mu_{j})^2\) is a function of the random variable \(X_{j}\). Therefore, the squared residual itself is random and has a population mean. The population variance is thus the population mean of the squared residual. We see that if the data tend to be far away from the mean, the squared residual will tend to be large, and hence the population variance will also be large. Conversely, if the data tend to be close to the mean, the squared residual will tend to be small, and hence the population variance will also be small.

Sample Variance
The population variance \(\sigma _{j}^{2}\) can be estimated by the sample variance
\begin{align} s_j^2 &= \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)^2\\&= \frac{\sum_{i=1}^{n}X_{ij}^2- n \bar{x}_j^2 }{n-1} \\&=\frac{\sum_{i=1}^{n}X_{ij}^2-\left(\left(\sum_{i=1}^{n}X_{ij}\right)^2/n\right)}{n-1} \end{align}

The first expression in this formula is most suitable for interpreting the sample variance. We see that it is a function of the squared residuals; that is, take the difference between the individual observations and their sample mean, and then square the result. Here, we may observe that if observations tend to be far away from their sample means, then the squared residuals and hence the sample variance will also tend to be large.

If on the other hand, the observations tend to be close to their respective sample means, then the squared differences between the data and their means will be small, resulting in a small sample variance value for that variable.

The last part of the expression above gives the formula that is most suitable for computation, either by hand or by a computer! Since the sample variance is a function of the random data, the sample variance itself is a random quantity, and so has a population mean. In fact, the population mean of the sample variance is equal to the population variance:

\[E(s_j^2) = \sigma_j^2\]

That is, the sample variance \(s _{j}^{2}\) is unbiased for the population variance \(\sigma _{j}^{2}\).

Our textbook (Johnson and Wichern, 6th ed.) uses a sample variance formula derived using maximum likelihood estimation principles. In this formula, the division is by \(n\) rather than \(n-1\).

\[s_j^2 = \frac{\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)^2}{n}\]

Example 1-1: Pulse Rates

Suppose that we have observed the following \(n =\) 5 resting pulse rates: 64, 68, 74, 76, 78

Find the sample mean, variance and standard deviation.

Answer

The sample mean is \(\bar{x} = \dfrac{64+68+74+76+78}{5}=72\).

The maximum likelihood estimate of the variance, the one consistent with our text, is

\begin{align} s^2 &= \frac{(64-72)^2+(68-72)^2+(74-72)^2+(76-72)^2+(78-72)^2}{5}\\&=\frac{136}{5} \\&= 27.2 \end{align}

The standard deviation based in this method is \(s=\sqrt{27.2}=5.215\).

The more commonly used variance estimate, the one given by statistical software, would be \(\frac{136}{5-1}=34\). The standard deviation would be \(s = \sqrt{34}=5.83\).


1.3 - Measures of Association

1.3 - Measures of Association

Measures of Association: Covariance, Correlation

Association is concerned with how each variable is related to the other variable(s). In this case, the first measure that we will consider is the covariance between two variables j and k.

Population covariance is a measure of the association between pairs of variables in a population.

Population Covariance
The population covariance between variables j and k is
\(\sigma_{jk} = E\{(X_{ij}-\mu_j)(X_{ik}-\mu_k)\}\quad\mbox{for }i=1,\ldots,n\)

Note that the product of the residuals \( \left( X_{ij}  - \mu_{j} \right) \) and \( \left(X_{ik} - \mu_{k} \right) \) for variables j and k, respectively, is a function of the random variables \(X_{ij}\) and \(X_{ik}\). Therefore, \(\left( X _ { i j } - \mu _ { j } \right) \left( X _ { i k } - \mu _ { k } \right)\) is itself random, and has a population mean. The population covariance is defined to be the population mean of this product of residuals. We see that if either both variables are greater than their respective means, or if they are both less than their respective means, then the product of the residuals will be positive. Thus, if the value of variable j tends to be greater than its mean when the value of variable k is larger than its mean, and if the value of variable j tends to be less than its mean when the value of variable k is smaller than its mean, then the covariance will be positive. Positive population covariances mean that the two variables are positively associated; variable j tends to increase with increasing values of variable k.

A negative association can also occur. If one variable tends to be greater than its mean when the other variable is less than its mean, the product of the residuals will be negative, and you will obtain a negative population covariance. Variable j will tend to decrease with increasing values of variable k.

The population covariance \(\sigma_{jk}\) between variables j and k can be estimated by the sample covariance.

Sample Covariance
This can be calculated by
\begin{align} s_{jk} &= \frac{1}{n-1}\sum_{i=1}^{n}(X_{ij}-\bar{x}_j)(X_{ik}-\bar{x}_k)\\&=\frac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{n-1} \end{align}

Just like in the formula for variance we have two expressions that make up this formula. The first half of the formula is most suitable for understanding the interpretation of the sample covariance, and the second half of the formula is used for calculation.

Looking at the first half of the expression, the product inside the sum is the residual differences between variable j and its mean times the residual differences between variable k and its mean. We can see that if either both variables tend to be greater than their respective means or less than their respective means, then the product of the residuals will tend to be positive leading to a positive sample covariance.

Conversely, if one variable takes values that are greater than its mean when the opposite variable takes a value less than its mean, then the product will take a negative value. In the end, when you add up this product over all of the observations, it will result in a negative covariance.

So, in effect, a positive covariance would indicate a positive association between the variables j and k. And a negative association is when the covariance is negative.

For computational purposes, we will use the second half of the formula. For each subject, the product of the two variables is obtained, and then the products are summed to obtain the first term in the numerator. The second term in the numerator is obtained by taking the product of the sums of variables over the n subjects, then dividing the results by the sample size n. The difference between the first and second terms is then divided by n -1 to obtain the covariance value.

Again, sample covariance is a function of the random data, and hence, is random itself. As before, the population mean of the sample covariance sjk is equal to the population covariance σjk; i.e.,

\(E(s_{jk})=\sigma_{jk}\)

That is, the sample covariance \(s_{jk}\) is unbiased for the population covariance \(\sigma_{jk}\).

The sample covariance is a measure of the association between a pair of variables:

\(s_{jk}\) = 0

This implies that the two variables are uncorrelated. (Note that this does not necessarily imply independence, we'll get back to this later.)

\(s_{jk}\) > 0

This implies that the two variables are positively correlated; i.e., values of variable j tend to increase with increasing values of variable k. The larger the covariance, the stronger the positive association between the two variables.

\(s_{jk}\) < 0

This implies that the two variables are negatively correlated; i.e., values of variable j tend to decrease with increasing values of variable k. The smaller the covariance, the stronger the negative association between the two variables.

Recall, that we had collected all of the population means of the p variables into a mean vector. Likewise, the population variances and covariances can be collected into the population variance-covariance matrix: This is also known by the name of population dispersion matrix.

Population variance-covariance matrix
\(\Sigma = \left(\begin{array}{cccc}\sigma^2_1 & \sigma_{12} & \dots & \sigma_{1p}\\ \sigma_{21} & \sigma^2_{2} & \dots & \sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1} & \sigma_{p2} & \dots &\sigma^2_p\end{array}\right)\)

Note that the population variances appear along the diagonal of this matrix, and the covariance appear in the off-diagonal elements. So, the covariance between variables j and k will appear in row j and column k of this matrix.

The population variance-covariance matrix may be estimated by the sample variance-covariance matrix. The population variances and covariances in the above population variance-covariance matrix are replaced by the corresponding sample variances and covariances to obtain the sample variance-covariance matrix:

Sample variance-covariance matrix
\( S = \left(\begin{array}{cccc}s^2_1 & s_{12} & \dots & s_{1p}\\ s_{21} & s^2_2 & \dots & s_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ s_{p1} & s_{p2} & \dots & s^2_{p}\end{array}\right)\)

Note that the sample variances appear along diagonal of this matrix and the covariances appear in the off-diagonal elements. So the covariance between variables j and k will appear in the jk-th element of this matrix.

Note!

S (the sample variance-covariance matrix) is symmetric; i.e., \(s_{jk}\) = \(s_{kj}\) .

S is unbiased for the population variance covariance matrix \(Σ\) ; i.e.,

\(E(S) = \left(\begin{array}{cccc} E(s^2_1) & E(s_{12}) & \dots & E(s_{1p}) \\ E(s_{21}) & E(s^2_{2}) & \dots & E(s_{2p})\\ \vdots & \vdots & \ddots & \vdots \\ E(s_{p1}) & E(s_{p2}) & \dots & E(s^2_p)\end{array}\right)=\Sigma\).

Because this matrix is a function of our random data, this means that the elements of this matrix are also going to be random, and the matrix, on the whole, is random as well. The statement '\(Σ\) is unbiased' means that the mean of each element of that matrix is equal to the corresponding elements of the population.

In matrix notation, the sample variance-covariance matrix may be computed used the following expressions:

\begin{align} S &= \frac{1}{n-1}\sum_{i=1}^{n}(X_i-\bar{x})(X_i-\bar{x})'\\ &= \frac{\sum_{i=1}^{n}X_iX_i'-(\sum_{i=1}^{n}X_i)(\sum_{i=1}^{n}X_i)'/n}{n-1} \end{align}

Just as we have seen in the previous formulas, the first half of the formula is used in interpretation, and the second half of the formula is what is used for calculation purposes.

Looking at the second term you can see that the first term in the numerator involves taking the data vector for each subject and multiplying by its transpose. The resulting matrices are then added over the n subjects. To obtain the second term in the numerator, first compute the sum of the data vectors over the n subjects, then take the resulting vector and multiply by its transpose; then divide the resulting matrix by the number of subjects n. Take the difference between the two terms in the numerator and divide by n - 1.

Example 1-2

Suppose that we have observed the following n = 4 observations for variables \(x_{1}\) and \(x_{2}\) .

\(x_{1}\) \(x_{2}\)
6 3
10 4
12 7
12 6
Answer

The sample means are \(\bar{x}_1\) = 10 and \(\bar{x}_2\) = 5. The maximum likelihood estimate of the covariance is the average product of deviations from the mean:

\begin{align} s_{12}&=\dfrac{(6-10)(3-5)+(10-10)(4-5)+(12-10)(7-5)+(12-10)(6-5)}{4-1}\\&=\dfrac{8+0+4+2}{4-1}=4.67 \end{align}

The positive value reflects the fact that as \(x_{1}\) increases, \(x_{2}\) also tends to increase. 

Note! The magnitude of the covariance value is not particularly helpful as it is a function of the magnitudes (scales) of the two variables. This quantity is a function of the variability of the two variables, and so, it is hard to tease out the effects of the association between the two variables from the effects of their dispersions.

Note, however, that the covariance between variables i and j must lie between the product of the two-component standard deviations of variables i and j, and negative of that same product:

\(-s_i s_j \le s_{ij} \le s_i s_j\)

Example 1-3: Body Measurements (Covariance)

In an undergraduate statistics class, n = 30 females reported their heights (inches), and also measured their left forearm length (cm), left foot length (cm), and head circumference (cm). The sample variance-covariance matrix is the following:

Covariances: Height, LeftArm, LeftFoot, HeadCirc

  Height LeftArm LeftFoot HeadCirc
Height 8.740 3.022 2.772 0.289
LeftArm 3.022 2.402 1.233 0.233
LeftFoot 2.772 1.234 1.908 0.118
HeadCirc 0.289 0.223 0.118 3.434
Analysis

Notice that the matrix has four row and four columns because there are four variables being considered. Also, notice that the matrix is symmetric.

Here are a few examples of the information in the matrix:

  • The variance of the height variable is 8.74. Thus the standard deviation is \(\sqrt{8.74} = 2.956\).
  • The variance of the left foot measurement is 1.908 (in the 3rd diagonal element). Thus the standard deviation for this variable is \(\sqrt{1.908}=1.381\).
  • The covariance between height and left arm is 3.022, found in the 1st row, 2nd column and also in the 2nd row, 1st column.
  • The covariance between left foot and left arm is 1.234, found in the 3rd row, 2nd column and also in the 2nd row, 3rd column.

All covariance values are positive so all pairwise associations are positive. But, the magnitudes do not tell us about the strength of the associations. To assess the strength of an association, we use correlation values. This suggests an alternative measure of association. 

Correlation Matrix

Correlation
The population correlation is defined to be equal to the population covariance divided by the product of the population standard deviations:
Correlation
\(\rho_{jk} = \dfrac{\sigma_{jk}}{\sigma_j\sigma_k}\)

The population correlation may be estimated by substituting into the formula the sample covariances and standard deviations:

\(r_{jk}=\dfrac{s_{jk}}{s_js_k}=\dfrac{\sum_{i=1}^{n}X_{ij}X_{ik}-(\sum_{i=1}^{n}X_{ij})(\sum_{i=1}^{n}X_{ik})/n}{\sqrt{\{\sum_{i=1}^{n}X^2_{ij}-(\sum_{i=1}^{n}X_{ij})^2/n\}\{\sum_{i=1}^{n}X^2_{ik}-(\sum_{i=1}^{n}X_{ik})^2/n\}}}\)

It is essential to note that the population and the sample correlation must lie between -1 and 1.

\(-1 \le \rho_{jk} \le 1\)

\(-1 \le r_{jk} \le 1\)

Therefore:

  • \(\rho_{jk}\) = 0 indicates, as you might expect, the two variables are uncorrelated.
  • \(\rho_{jk}\) close to +1 will indicate a strong positive dependence
  • \(\rho_{jk}\) close to -1 indicates a strong negative dependence

Sample correlation coefficients also have a similar interpretation.

For a collection of p variables, the correlation matrix is a p × p matrix that displays the correlations between pairs of variables. For instance, the value in the \(j^{th}\) row and \(k^{th}\) column gives the correlation between variables \(x_{j}\) and \(x_{k}\) . The correlation matrix is symmetric so that the value in the \(k^{th}\) row and \(j^{th}\) column is also the correlation between variables \(x_{j}\) and \(x_{k}\). The diagonal elements of the correlation matrix are all identically equal to 1.

Sample Correlation Matrix
The sample correlation matrix is denoted as R.
\(\textbf{R} = \left(\begin{array}{cccc} 1 & r_{12} & \dots & r_{1p}\\ r_{21} & 1 & \dots & r_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \dots & 1\end{array}\right)\)

Example 1-4: Body Measurements (Correlations)

The following covariance matrix shows the pairwise covariances for the height, left forearm, left foot, and head circumference measurements of n = 30 female college students.

Covariances: Height, LeftArm, LeftFoot, HeadCirc

  Height LeftArm LeftFoot HeadCirc
Height 8.740 3.022 2.772 0.289
LeftArm 3.022 2.402 1.233 0.233
LeftFoot 2.772 1.234 1.908 0.118
HeadCirc 0.289 0.223 0.118 3.434
Analysis

Here are two examples of calculating a correlation coefficient:

  • The correlation between height and left forearm is \(\dfrac{3.022}{\sqrt{8.74}\sqrt{2.402}}=0.66\).
  • The correlation between head circumference and left foot is \(\dfrac{0.118}{\sqrt{3.434}\sqrt{1.908}}=0.046\).

The complete sample correlation matrix for this example is the following:

Covariances: Height, LeftArm, LeftFoot, HeadCirc

  Height LeftArm LeftFoot HeadCirc
Height 1 0.66 0.68 0.053
LeftArm 0.66 1 0.58 0.078
LeftFoot 0.68 0.58 1 0.046
HeadCirc 0.053 0.078 0.046 1

 

Overall, we see moderately strong linear associations among the variables height, left arm, and left foot and relatively weak (almost 0) associations between head circumference and the other three variables.

In practice, use scatter plots of the variables to understand the associations between variables fully. It is not a good idea to rely on correlations without seeing the plots. Correlation values are affected by outliers and curvilinearity.


1.4 - Example: Descriptive Statistics

1.4 - Example: Descriptive Statistics

Example 1-5: Women's Health Survey (Descriptive Statistics)

Let us take a look at an example. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

  • Calcium(mg)
  • Iron(mg)
  • Protein(g)
  • Vitamin A(μg)
  • Vitamin C(mg)

Using Technology

We will use the SAS program to carry out the calculations that we would like to see.

Download the data file: nutrient.csv

The lines of this program are saved in a simple text file with a .sas file extension. If you have SAS installed on the machine on which you have downloaded this file, it should launch SAS and open the program within the SAS application. Marking up a printout of the SAS program is also a good strategy for learning how this program is put together.

 

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;   /*This sets the max number of lines per page to 78.*/
title "Example: Nutrient Intake Data - Descriptive Statistics";   
/*This sets a title that will appear on each page of the output until it's changed.*/
data nutrient;   /*This defines a data set called 'nutrient'.*/
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';   /*SAS will look in this path for the nutrient.csv file.*/
  input id calcium iron protein a c;   /*This is where we provide names for the variables in order of the columns in the data set. If any were categorical (not the case here), we would need to put a '$' character after its name.*/
  run;   
proc means;   
  var calcium iron protein a c;   /*If not all variables are of interest, we can specify here the ones we want to work with.*/
  run;   
proc corr pearson cov;   /*The 'pearson' option specifies the pearson correlation to be computed. The 'cov' option requests the sample covariance matrix.*/
  var calcium iron protein a c;   /*If not all variables are of interest, we can specify here the ones we want to work with.*/
  run;

The first part of this SAS output, (download below), is the results of the Means Procedure - proc means. Because the SAS output is usually a relatively long document, printing these pages of output out and marking them with notes is highly recommended if not required!

Example: Nutrient Intake Data - Descriptive Statistics

The MEANS Procedure

The Means Procedure

Summary statistics

 
Variable N Mean Std Dev Minimum Maximum
calcium
iron
protein
a
c
737
737
737
737
737
624.0492537
11.1298996
65.8034410
839.6353460
78.9284464
397.2775401
5.9841905
30.5757564
1633.54
73.5952721
7.4400000
0
0
0
0
2866.44
58.6680000
251.0120000
34434.27
433.3390000

Download the SAS Output file: nutrient2.lst

The first column of the Means Procedure table above gives the variable name. The second column reports the sample size. This is then followed by the sample means (third column) and the sample standard deviations (fourth column) for each variable. I have copied these values into the table below. I have also rounded these numbers a bit to make them easier to use for this example.

Here are the steps to find the descriptive statistics for the Women's Nutrition dataset in Minitab:

Descriptive Statistics in Minitab

  1. Go to File > Open > Worksheet [open nutrient_tf.csv]
  2. Stat > Basic Statistics > Display Descriptive Statistics
    1. Highlight and select C2 through C6 and choose ‘Select’ to move the variables into the window on the right.
    2. Select ‘Statistics...’, and check the boxes for the statistics of interest.
    3. OK > OK

Analysis

Descriptive Statistics

A summary of the descriptive statistics is given here for ease of reference.

Variable Mean Standard Deviation
Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg
Protein 65.8 mg 30.6 mg
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg

Notice that the standard deviations are large relative to their respective means, especially for Vitamin A & C. This would indicate a high variability among women in nutrient intake. However, whether the standard deviations are relatively large or not, will depend on the context of the application. Skill in interpreting the statistical analysis depends very much on the researcher's subject matter knowledge.

The variance-covariance matrix is also copied into the matrix below.

\[S = \left(\begin{array}{RRRRR}157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 \\ 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 \\ 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 \\ 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 \\ 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 \end{array}\right)\]

Interpretation

Because this covariance is positive, we see that calcium intake tends to increase with increasing iron intake. The strength of this positive association can only be judged by comparing s12 to the product of the sample standard deviations for calcium and iron. This comparison is most readily accomplished by looking at the sample correlation between the two variables.

  • The sample variances are given by the diagonal elements of S. For example, the variance of iron intake is \(s_{2}^{2}\). 35. 8 mg2.
  • The covariances are given by the off-diagonal elements of S. For example, the covariance between calcium and iron intake is \(s_{12}\)= 940. 1.
  • Note that, the covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.

Sample Correlations

The sample correlations are included in the table below.

  Calcium Iron Protein Vit. A Vit. C
Calcium 1.000 0.395 0.500 0.158 0.229
Iron 0.395 1.000 0.623 0.244 0.313
Protein 0.500 0.623 1.000 0.147 0.212
Vit. A 0.158 0.244< 0.147 1.000 0.184
Vit. C 0.229 0.313 0.212 0.184 1.000

Here we can see that the correlation between each of the variables and themselves is all equal to one, and the off-diagonal elements give the correlation between each of the pairs of variables.

Generally, we look for the strongest correlations first. The results above suggest that protein, iron, and calcium are all positively associated. Each of these three nutrient increases with increasing values of the remaining two.

The coefficient of determination is another measure of association and is simply equal to the square of the correlation. For example, in this case, the coefficient of determination between protein and iron is \((0.623)^2\) or about 0.388.

\[r^2_{23} = 0.62337^2 = 0.38859\]

This says that about 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake. Both interpretations are equivalent.


1.5 - Additional Measures of Dispersion

1.5 - Additional Measures of Dispersion

Overall Measures of Dispersion

Sometimes it is also useful to have an overall measure of dispersion in the data. In this measure, it would be good to include all of the variables simultaneously, rather than one at a time. In the past, we looked at the individual variables and their variances to measure the individual variances. Here we are going to look at measures of dispersion of all variables together, particularly we are going to look at such measures that look at the total variation.

The variance \(\sigma _{j}^{2}\) measures the dispersion of an individual variable Xj. The following two are used to measure the dispersion of all variables together.

  • Total Variation
  • Generalized Variance

To understand total variation we first must find the trace of a square matrix. A square matrix is a matrix that has an equal number of columns and rows. Important examples of square matrices include the variance-covariance and correlation matrices.

Trace of an n x n Matrix
The trace of an n x n matrix \(\mathbf{A}\) is
\(trace(\textbf{A}) = \sum_{i=1}^{n}a_{ii}\)

For instance, in a 10 x 10 matrix, the trace is the sum of the diagonal elements.

Total Variation of a Random Vector, \(\mathbf{X}\)

The total variation, therefore, of a random vector \(\mathbf{X}\) is simply the trace of the population variance-covariance matrix.

\(trace (\Sigma) = \sigma^2_1 + \sigma^2_2 +\dots \sigma^2_p\)

Thus, the total variation is equal to the sum of the population variances.

The total variation can be estimated by:

\(trace(S) = s^2_1+s^2_2+\dots +s^2_p\)

The total variation is of interest for principal components analysis and factor analysis and we will look at these concepts later in this course.

Example 1-6: Woman's Health Survey (Variance)

Let us use the data from the USDA women’s health survey again to illustrate this. We have taken the variances for each of the variables from the software output and have placed them in the table below.

Variable Variance
Calcium 157829.4
Iron 35.8
Protein 934.9
Vitamin A 2668452.4
Vitamin C 5416.3
Total 2832668.8

The total variation for the nutrient intake data is determined by simply adding up all of the variances for each of the individual variables. The total variation equals 2,832,668.8. This is a very large number.

Note! The problem with total variation is that it does not take into account correlations among the variables.

Interpreting Correlation

These plots show simulated data for pairs of variables with different levels of correlation. In each case, the variances for both variables are equal to 1, so that the total variation is 2.

When the correlation r = 0, then we see a shotgun-blast pattern of points, widely dispersed over the entire range of the plot.

scatterplot graph for r = 0.0

Increasing the correlation to r = 0.7, we see an oval-shaped pattern. Note that the points are not as widely dispersed.

scatterplot graph for r = 0.7

Increasing the correlation to r = 0.9, we see that the points fall along a 45-degree line, and are even less dispersed.

scatterplot graph for r = 0.9

Thus, the dispersion of points decreases with increasing correlation. But, in all cases, the total variation is the same. The total variation does not take into account the correlation between the two variables.

Fixing the variances, the scatter of the data will tend to decrease as \(| r | \rightarrow 1\).

The Determinant

To take into account the correlations among pairs of variables an alternative measure of overall variance is suggested. This measure takes a large value when the various variables show very little correlation among themselves. In contrast, this measure takes a small value if the variables show a very strong correlation among themselves, either positive or negative. This particular measure of dispersion is the generalized variance. In order to define the generalized variance, we first define the determinant of the matrix.

We will start simple with a 2 x 2 matrix and then we will move on to more general definitions for larger matrices.

Let us consider the determinant of a 2 x 2 matrix \(\mathbf{B}\) as shown below. Here we can see that it is the product of the two diagonal elements minus the product of the off-diagonal elements.

\(|\textbf{B}| =\left|\begin{array}{cc}b_{11} & b_{12}\\ b_{21} & b_{22}\end{array}\right| = b_{11}b_{22}-b_{12}b_{21}\)

Here is an example of a simple matrix that has the elements 5, 1, 2, and 4. You will get the determinant 18. The product of the diagonal 5 x 4 subtracting the elements of the off-diagonal 1 x 2 yields an answer of 18:

\(\left|\begin{array}{cc}5 & 1\\2 & 4\end{array}\right| = 5 \times 4 - 1\times 2= 20-2 =18\)

Determinant of a General \(p\ x\ p\) Matrix \(\mathbf{B}\)

More generally the determinant of a general p x p matrix \(\mathbf{B}\) is given by the expression shown below:

\(|\textbf{B}| = \sum_{j=1}^{p}(-1)^{j+1}b_{1j}|B_{1j}|\)

The expression involves the sum over all of the first row of \(\mathbf{B}\). Note that these elements are noted by \(b_{1j}\). These are pre-multiplied by -1 raised to the \([j + 1]^{th}\) power, so basically we are going to have alternating plus and minus signs in our sum. The matrix \(B1_j\) is obtained by deleting row 1 and column j from the matrix \(\mathbf{B}\).

By definition, the generalized variance of a random vector \(\mathbf{X}\) is equal to \(|\sum|\), the determinant of the variance/covariance matrix. The generalized variance can be estimated by calculating \(|S|\), the determinant of the sample variance/covariance matrix.


1.6 - Example: Generalized Variance

1.6 - Example: Generalized Variance

Example 1-7: Woman's Health Survey (Generalized Variance)

Find and interpret the generalized variance for the Women's Health Survey data.

Using Technology

The generalized variance for the Women's Health Survey data can be calculated using the SAS program below.

Download the data file here: nutrient.csv

 

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;   /*This sets the max number of lines per page to 78.*/
title "Example: Nutrient Intake Data - Generalized Variance";   /*This sets a title that will 
appear on each page of the output until it's changed.*/
data nutrient;   /*This defines a data set called 'nutrient'.*/
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';   /*SAS will look in this path for the 
  nutrient.csv file.*/
  input id calcium iron protein a c;   /*This is where we provide names for the variables 
  in order of the columns in the data set. If any were categorical (not the case here), 
  we would need to put a '$' character after its name.*/
  run;   
proc iml;   /*The iml procedure allows for many general calculations to be made, including 
matrix operations.*/
  start genvar;   /*This defines a SAS module that can be called to compute the 
  generalized variance. The lines of code below are executed when 'genvar' is called, 
  and both the sample covariance matrix and the generalized variance are printed.*/
    one=j(nrow(x),1,1);   /*This defines a column vector of 1s. The size is determined by 
    'x', which is a variable that is defined outside the module below.*/
    ident=i(nrow(x));   /*This creates an identity matrix with the same number of rows 
    as x.*/
    s=x`*(ident-one*one`/nrow(x))*x/(nrow(x)-1.0);   /*This is the sample covariance 
    matrix, which is an unbiased estimate of the population covariance matrix.*/
    genvar=det(s);   /*The generalized variance is the determinant of the sample 
    covariance matrix.*/
    print s genvar;   /*This is the statement that prints both the sample covariance 
    matrix and the generalized variance.*/
  finish;   /*This ends the genmod module definition. The module hasn't run yet and 
  won't be called until we define the 'x' argument below.*/
  use nutrient;   /*This makes the variables from the 'nutrient' data set available 
  for use in this iml environment.*/
  read all var{calcium iron protein a c} into x;   /*This creates a vector x 
  consisting of the variables specified. This vector is what will be used in the 
  'genvar' module defined above.*/
  run genvar;   /*This statements calls the 'genvar' module, which we defined above.*/

Generalized Variance using Minitab

  1. Download the 'Determat.mac’ macro file and save it to your computer.
  2. File > Run Script, and then choose 'Minitab Macro' for type. Then choose ‘OK’.
  3. Stat > Basic Statistics > Covariance
    1. Highlight and select C3, C4, and C6 and choose ‘Select’ to move these three variables into the window on the right. Only these variables are chosen for this particular example because if all six variables are used, the value of the generalized variance is too large to be displayed.
    2. Check the box for ‘Store matrix’.
    3. Select ‘OK’. No results are displayed at this point.
  4. Data > Display Data
    1. Highlight and select M1 and click ‘Select’ to move it into the window on the right.
    2. Select ‘OK’ to display the sample covariance matrix.
  5. View > Command Line/History to show the command line window on the right side.
    1. In the command line window, type ‘%Determat M1’ without quotes.
    2. Select ‘Run’ near the lower-right corner of the command line window. The generalized variance is displayed in the data display area.

Analysis

The output from the programs report the sample variance/covariance matrix.

Example: Nutrient Intake Data - Generalized variance

S         GENVAR

157829.44

940.08944 6075.8163

102411.13

6701.616 2.83E19
940.08944 35.810536 114.05803 2383.1534 137.67199  
6075.8163 114.05803 934.87688 7330.0515 477.19978  
102411.13 2383.1543 7330.0515 2668452.4 22063.249  
6701.616 137.67199 477.19978 22063.249 5416.2641  

You should compare this output with the sample variance/covariance matrix output obtained from the corr procedure from our last program, nutrient2. You will see that we have the exact same numbers that were presented before. The generalized variance is that single entry in the far upper right-hand corner. Here we see that the generalized variance is:

\[|S| = 2.83 \times 10^{19}\]

Interpretation

In terms of interpreting the generalized variance, the larger the generalized variance the more dispersed the data are. Note that the volume of space occupied by the cloud of data points is going to be proportional to the square root of the generalized variance.

In this example...

\[\sqrt{|S|} = 5.37 \times 10^9\]

This represents a very large volume of space. Again, the interpretation of this particular number depends largely on subject matter knowledge. In this case, we can not say if this is a particularly large number or not unless we know more about women's nutrition.


1.7 - Summary

1.7 - Summary

In this lesson we learned how to:

  • interpret various measures of central tendency, dispersion, and association;
  • compute sample means, variances, covariances, and correlations using a hand calculator;
  • use software like SAS and Minitab to compute sample means, variances, covariances, and correlations.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility