4.4 - Multivariate Normality and Outliers

4.4 - Multivariate Normality and Outliers

Q-Q Plot for Evaluating Multivariate Normality and Outliers

The variable \(d^2 = (\textbf{x}-\mathbf{\mu})'\Sigma^{-1}(\textbf{x}-\mathbf{\mu})\) has a chi-square distribution with p degrees of freedom, and for “large” samples the observed Mahalanobis distances have an approximate chi-square distribution. This result can be used to evaluate (subjectively) whether a data point may be an outlier and whether observed data may have a multivariate normal distribution.

A Q-Q plot can be used to picture the Mahalanobis distances for the sample. The basic idea is the same as for a normal probability plot. For multivariate data, we plot the ordered Mahalanobis distances versus estimated quantiles (percentiles) for a sample of size n from a chi-squared distribution with p degrees of freedom. This should resemble a straight-line for data from a multivariate normal distribution. Outliers will show up as points on the upper right side of the plot for which the Mahalanobis distance is notably greater than the chi-square quantile value.

Determining the Quantiles

  • The \(i^{th}\) estimated quantile is determined as the chi-square value (with df = p) for which the cumulative probability is (i -  0.5) / n.
  • To determine the full set of estimated chi-square quantiles, this is done for value of i from 1 to n.

Example 4-2: Q-Q Plot for Board Stiffness Data

This example reproduces Example 4.14 in the text (page 187). For each of n = 30 boards, there are p = 4 measurements of board stiffness. Each measurement was done using a different method.

A SAS plot of the Mahalanobis distances is given below. The distances are on the vertical and the chi-square quantiles are on the horizontal. At the right side of the plot we see an upward bending. This indicates possible outliers (and a possible violation of multivariate normality). In particular, the final point has \(d^{2}≈ 16\) whereas the quantile value on the horizontal is about 12.5. The next-to-last point on the plot might also be an outlier. A printout of the distances, before they were ordered for the plot, shows that the two possible outliers are boards 16 and 9, respectively.

SAS Plot of the Mahbalanobis Distance
SAS Plot of the Mahbalanobis Distance

Using SAS

The SAS code used to produce the graph just given follows.

The data step reads the dataset.

Download the SAS program here: Q_Qplot.sas

The proc princomp calculates the principal components and stores the standardized principal components in a dataset named pcresult.

The next data step calculates the Mahalanobis distances and keeps them in a dataset named mahal.

The proc print will print the distances (with observation numbers).

Next, we sort the mahal dataset in order of the distances. This is to prepare for the Q-Q plot.

In the next data step, we compute estimated quantiles of a chi-square distribution with df = 4. In the prb = line, the value 30 is the sample size and in the cinv function the value 4 is the df (because we have 4 variables).

Finally, the gplot procedure plots distances versus chi-square quantiles.

Using Minitab

View the video below to walk through how to produce a QQ plot for the borad stiffness dataset using Minitab.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility