Lesson 3: Graphical Display of Multivariate Data

Lesson 3: Graphical Display of Multivariate Data

Overview

An important step in the analysis of any dataset is Exploratory Data Analysis (EDA), including the graphical display of data.

Why do we look at graphical displays of the data? Graphical displays may:

  • suggest a plausible model for the data,
  • assess the validity of model assumptions,
  • detect outliers, or
  • suggest plausible normalizing transformations

Many multivariate methods assume that the data have a multivariate normal distribution. Exploratory data analysis through the graphical display of data may be used to assess the normality of data. If evidence is found that the data are not normally distributed, then graphical methods may be applied to determine appropriate normalizing transformations for the data.

In this course, we will use SAS and Minitab to demonstrate graphical methods as well as for other applications later. Both SAS and Minitab diagrams are provided side-by-side as far as possible. If diagrams require extensive instructions, tabs are provided separately for SAS and Minitab.

Objectives

Upon completion of this lesson, you should be able to:

  • Identify and interpret graphical methods for summarizing multivariate data including histograms, scatterplot matrices, and rotating 3-dimensional scatterplots;
  • Produce graphics using interactive data analysis in SAS and Minitab;
  • Understand when transformations of the data should be applied and what specific transformations should be considered;
  • Learn how to identify unusual observations (outliers), and understand issues regarding how outliers should be handled if they are detected.

3.1 - Graphical Methods

3.1 - Graphical Methods

Example 3-1: Women’s Health Survey (Graphing)

Let us take a look again at the nutrition data. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

  • Calcium(mg)
  • Iron(mg)
  • Protein(g)
  • Vitamin A(μg)
  • Vitamin C(mg)

We can read the data from the SAS file below. Various transformed variables are also created at this step for inspection. Here are some different ways we could take a look at this data graphically using SAS (and Minitab).

Download the SAS program: nutrient2.sas

Univariate Cases

Using Histograms we can:

  • Assess Normality
  • Find Normalizing Transformations
  • Detect Outliers

Here we have a histogram (produced in SAS) for the daily intake of calcium. Note that the data appear to be skewed to the right, suggesting that calcium is not normally distributed. This suggests that a normalizing transformation should be considered.

Example: Nutrient Intake Data - Descriptive Statistics

The UNIVARIATE Procedure

Histogram for calcium

Common transformations include:

  • Square Root (often used with counts data)
  • Quarter Root
  • Log (either natural or base 10)

The square root transformation is the weakest of the above transformations, while the log transformation is the strongest. In practice, it is generally a good idea to try all three transformations to see which appears to yield the most symmetric distribution.

The following shows histograms for the raw data (calcium), square-root transformation (S_calciu), quarter-root transformation (S_S_calc), and log transformation (L_calciu). With increasingly stronger transformations of the data, the distribution shifts from being skewed to the right to being skewed to the left. Here, the square-root transformed data is still slightly skewed to the right, suggesting that the square-root transformation is not strong enough. In contrast, the log-transformed data are skewed to the left, suggesting that the log transformation is too strong. The quarter-root transformation results in the most symmetric distribution, suggesting that this transformation is most appropriate for this data.

Histogram for calcium
Histogram for L_calciu
Histogram for S_calciu
Histogram for S_S_calc

In practice, histograms should be plotted for each of the variables, and transformations should be applied as needed. There is no 'best' transformation for all datasets.

Bivariate Cases

Using Scatter Plots we can:

  • Describe relationships between pairs of variables
  • Assess linearity
  • Find Linearizing Transformations
  • Detect Outliers

Here we have a scatterplot (produced in Minitab) in which calcium is plotted against iron. This plot suggests that the daily intake of calcium tends to increase with the increasing daily intake of iron. If the data have a bivariate normal distribution, then the scatterplot should be approximately elliptical. However, the points appear to fan out from the origin, suggesting that the data are not bivariate normal.

scatterplot of iron vs calcium

After applying quarter-root transformations to calcium and iron, we obtain a scatter of points that appear more elliptical. Moreover, it appears that the relationship between the transformed variables is approximately linear. The point in the lower left-hand corner appears to be an unusual observation or outlier. Upon closer examination, it was found that this woman reported zero daily intake of iron. Since this is very unlikely to be correct, we might justifiably remove this observation from the data set.


Outliers

Note! It is not appropriate to remove an observation from the data just because it is an outlier. Consider, for example, the ozone hole in the Antarctic. For years, NASA had been flying polar-orbiting satellites designed to measure ozone in the upper atmosphere without detecting an ozone hole. Then, one day, a scientist visiting the Antarctic pointed an instrument straight-up into the sky and found evidence of an ozone hole. What happened? It turned out that the software used to process the NASA satellite data had a routine for automatically removing outliers. In this case, all observations with unusually low ozone levels were automatically removed by this routine. A close review of the raw, preprocessed data confirmed that there was an ozone hole.

The above is a special case, where the outliers are the most interesting observations. In general, outliers are removed only if there is a compelling reason to believe that something is wrong with the individual observations; e.g. if the observation is deemed to be impossible, as in the case of zero daily intakes of iron. This underscores the need to have good field or lab notes with details on the data collection process. Lab notes may indicate that something may have gone wrong with an individual observation; e.g., a laboratory sample may have been dropped on the floor leading to contamination. If such a sample results in an outlier, then that sample may legitimately be removed from the data.

Outliers often have a greater influence on the results of data analyses than the remaining observations. For example, outliers have a strong influence on the calculation of the sample mean. If outliers are detected, and there is no corroborating evidence to suggest that they should be removed, then resistant statistical techniques should be applied. Here, by resistant techniques, we mean techniques or processes that are not easily influenced by outliers. For example, the sample median is not sensitive to outliers, and so may be calculated in place of the sample mean, if we believe that there is a possibility that the sample mean may give a wrong picture. Outlier-resistant methods go well beyond the scope of this course. If outliers are detected, then you should consult with a statistician.

Trivariate Cases

Using Rotating Scatter Plots we can:

  • Describe relationships among three variables
  • Detect Outliers

Using Technology

 

Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.

options ls=78;
title "Example: Nutrient Intake Data - Descriptive Statistics";
data nutrient;
  infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';
  input id calcium iron protein a c;
  L_calciu = log(calcium);
  S_calciu = calcium**.5;
  S_S_calc = calcium**.25;
  L_iron = log(iron);
  S_S_iron = iron**.25;
  L_prot = log(protein);
  S_S_prot = protein**.25;
  S_S_a = a**.25;
  S_S_c = c**.25;
run;

proc univariate data=nutrient;
histogram calcium S_calciu S_S_calc L_calciu;
run;

proc g3d data=nutrient;
   scatter iron*protein=calcium / rotate=60;
run;
quit;

proc corr data=nutrient plots(maxpoints=75000)=matrix;
var S_S_calc S_S_iron S_S_prot S_S_a S_S_c;
run;

Using rotating scatter plots in SAS:

3-D scatter plot of calcium by iron and protein

3-D scatter plot of calcium by iron and protein

By rotating a 3-dimensional scatterplot, the illusion of three dimensions can be achieved. Here, we are looking to see if the cloud of points is approximately elliptical in shape.

Creating a 3D Scatter plot in Minitab for L_calc, L_iron and L_prot.

minitab dialog box

  1. Select Graph > 3D Scatter Plot
  2. The default is already Simple, so click OK.
  3. In Z, enter L_iron. In Y, enter L_prot. In X, enter L_calc.
  4. Click OK.

Note: The plot (shown below) can be rotated using the 3D Graph tools that appear with the plot. If it does not appear, choose Tools > Toolbars and check 3D Graph Tools.

minitab 3d tools plot plot

View the video to walk through what this process looks like in Minitab.

 

Multivariate Cases

Using a Matrix of Scatter Plots we can:

  • Look at all of the relationships between pairs of variables in one group of plots
  • Describe relationships among three or more variables

Here, we have a matrix of scatterplots for quarter-root transformed data on all variables. Note that each variable appears to be positively related to the remaining variables. However, the strength of that relationship depends on which pair of variables is considered. For example, quarter-root iron is strongly related to quarter-root protein, but the relationship between calcium and vitamin C is not very strong.

Using Technology

Matrix of scatterplots generated using SAS.

proc sgscatter data=nutrient;
title "Scatterplot Matrix for Nutrition Data";
matrix S_S_calc S_S_iron S_S_prot S_S_vitA S_S_vitC;
run;
Scatter Plot Matrix

Creating a matrix of scatterplots for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC in Minitab.

minitab dialog box

  1. Select Graph > Matrix Plot
  2. The default is already Simple, so click OK.
  3. Under Graph variables, enter S_S_calc, S_S_iron, S_S_prot, S_S_vitA, and S_S_vitC.
  4. Click OK.

A matrix plot for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC

plot

View the video to walk through what this process looks like in Minitab.


3.2 - Summary

3.2 - Summary

In this lesson we learned about:

  • How to interpret graphical displays of multivariate data;
  • How to determine the most appropriate normalizing transformation of the data;
  • How to detect outliers;
  • Use of software in producing multivariate graphics
  • Issues regarding when outliers should be removed from the data, or when they should be retained.

Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility