# 4.4 - Multivariate Normality and Outliers

4.4 - Multivariate Normality and Outliers## Q-Q Plot for Evaluating Multivariate Normality and Outliers

The variable \(d^2 = (\textbf{x}-\mathbf{\mu})'\Sigma^{-1}(\textbf{x}-\mathbf{\mu})\) has a chi-square distribution with *p* degrees of freedom, and for “large” samples the observed Mahalanobis distances have an approximate chi-square distribution. This result can be used to evaluate (subjectively) whether a data point may be an outlier and whether observed data may have a multivariate normal distribution.

A Q-Q plot can be used to picture the Mahalanobis distances for the sample. The basic idea is the same as for a normal probability plot. For multivariate data, we plot the ordered Mahalanobis distances versus estimated quantiles (percentiles) for a sample of size *n* from a chi-squared distribution with *p* degrees of freedom. This should resemble a straight-line for data from a multivariate normal distribution. Outliers will show up as points on the upper right side of the plot for which the Mahalanobis distance is notably greater than the chi-square quantile value.

**Determining the Quantiles**

- The \(i^{th}\) estimated quantile is determined as the chi-square value (with df =
*p*) for which the cumulative probability is (*i*- 0.5) /*n*. - To determine the full set of estimated chi-square quantiles, this is done for value of
*i*from 1 to*n*.

## Example 4-2: Q-Q Plot for Board Stiffness Data

This example reproduces Example 4.14 in the text (page 187). For each of *n* = 30 boards, there are *p* = 4 measurements of board stiffness. Each measurement was done using a different method.

A SAS plot of the Mahalanobis distances is given below. The distances are on the vertical and the chi-square quantiles are on the horizontal. At the right side of the plot we see an upward bending. This indicates possible outliers (and a possible violation of multivariate normality). In particular, the final point has \(d^{2}≈ 16\) whereas the quantile value on the horizontal is about 12.5. The next-to-last point on the plot might also be an outlier. A printout of the distances, before they were ordered for the plot, shows that the two possible outliers are boards 16 and 9, respectively.

#### Using SAS

The SAS code used to produce the graph just given follows.

The data step reads the dataset.

Download the SAS program here: Q_Qplot.sas

The `proc princomp`

calculates the principal components and stores the standardized principal components in a dataset named pcresult.

The next data step calculates the Mahalanobis distances and keeps them in a dataset named mahal.

The `proc print`

will print the distances (with observation numbers).

Next, we sort the mahal dataset in order of the distances. This is to prepare for the Q-Q plot.

In the next data step, we compute estimated quantiles of a chi-square distribution with df = 4. In the `prb = line`

, the value 30 is the sample size and in the `cinv`

function the value 4 is the df (because we have 4 variables).

Finally, the `gplot`

procedure plots distances versus chi-square quantiles.

#### Using Minitab

View the video below to walk through how to produce a QQ plot for the borad stiffness dataset using Minitab.