Lesson 3: Graphical Display of Multivariate Data
Lesson 3: Graphical Display of Multivariate DataOverview
An important step in the analysis of any dataset is Exploratory Data Analysis (EDA), including the graphical display of data.
Why do we look at graphical displays of the data? Graphical displays may:
- suggest a plausible model for the data,
- assess the validity of model assumptions,
- detect outliers, or
- suggest plausible normalizing transformations
Many multivariate methods assume that the data have a multivariate normal distribution. Exploratory data analysis through the graphical display of data may be used to assess the normality of data. If evidence is found that the data are not normally distributed, then graphical methods may be applied to determine appropriate normalizing transformations for the data.
In this course, we will use SAS and Minitab to demonstrate graphical methods as well as for other applications later. Both SAS and Minitab diagrams are provided side-by-side as far as possible. If diagrams require extensive instructions, tabs are provided separately for SAS and Minitab.
Objectives
- Identify and interpret graphical methods for summarizing multivariate data including histograms, scatterplot matrices, and rotating 3-dimensional scatterplots;
- Produce graphics using interactive data analysis in SAS and Minitab;
- Understand when transformations of the data should be applied and what specific transformations should be considered;
- Learn how to identify unusual observations (outliers), and understand issues regarding how outliers should be handled if they are detected.
3.1 - Graphical Methods
3.1 - Graphical MethodsExample 3-1: Women’s Health Survey (Graphing)
Let us take a look again at the nutrition data. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:
- Calcium(mg)
- Iron(mg)
- Protein(g)
- Vitamin A(μg)
- Vitamin C(mg)
We can read the data from the SAS file below. Various transformed variables are also created at this step for inspection. Here are some different ways we could take a look at this data graphically using SAS (and Minitab).
Download the SAS program: nutrient2.sas
Univariate Cases
Using Histograms we can:
- Assess Normality
- Find Normalizing Transformations
- Detect Outliers
Here we have a histogram (produced in SAS) for the daily intake of calcium. Note that the data appear to be skewed to the right, suggesting that calcium is not normally distributed. This suggests that a normalizing transformation should be considered.
The UNIVARIATE Procedure
Common transformations include:
- Square Root (often used with counts data)
- Quarter Root
- Log (either natural or base 10)
The square root transformation is the weakest of the above transformations, while the log transformation is the strongest. In practice, it is generally a good idea to try all three transformations to see which appears to yield the most symmetric distribution.
The following shows histograms for the raw data (calcium), square-root transformation (S_calciu), quarter-root transformation (S_S_calc), and log transformation (L_calciu). With increasingly stronger transformations of the data, the distribution shifts from being skewed to the right to being skewed to the left. Here, the square-root transformed data is still slightly skewed to the right, suggesting that the square-root transformation is not strong enough. In contrast, the log-transformed data are skewed to the left, suggesting that the log transformation is too strong. The quarter-root transformation results in the most symmetric distribution, suggesting that this transformation is most appropriate for this data.
In practice, histograms should be plotted for each of the variables, and transformations should be applied as needed. There is no 'best' transformation for all datasets.
Bivariate Cases
Using Scatter Plots we can:
- Describe relationships between pairs of variables
- Assess linearity
- Find Linearizing Transformations
- Detect Outliers
Here we have a scatterplot (produced in Minitab) in which calcium is plotted against iron. This plot suggests that the daily intake of calcium tends to increase with the increasing daily intake of iron. If the data have a bivariate normal distribution, then the scatterplot should be approximately elliptical. However, the points appear to fan out from the origin, suggesting that the data are not bivariate normal.
After applying quarter-root transformations to calcium and iron, we obtain a scatter of points that appear more elliptical. Moreover, it appears that the relationship between the transformed variables is approximately linear. The point in the lower left-hand corner appears to be an unusual observation or outlier. Upon closer examination, it was found that this woman reported zero daily intake of iron. Since this is very unlikely to be correct, we might justifiably remove this observation from the data set.
Outliers
The above is a special case, where the outliers are the most interesting observations. In general, outliers are removed only if there is a compelling reason to believe that something is wrong with the individual observations; e.g. if the observation is deemed to be impossible, as in the case of zero daily intakes of iron. This underscores the need to have good field or lab notes with details on the data collection process. Lab notes may indicate that something may have gone wrong with an individual observation; e.g., a laboratory sample may have been dropped on the floor leading to contamination. If such a sample results in an outlier, then that sample may legitimately be removed from the data.
Outliers often have a greater influence on the results of data analyses than the remaining observations. For example, outliers have a strong influence on the calculation of the sample mean. If outliers are detected, and there is no corroborating evidence to suggest that they should be removed, then resistant statistical techniques should be applied. Here, by resistant techniques, we mean techniques or processes that are not easily influenced by outliers. For example, the sample median is not sensitive to outliers, and so may be calculated in place of the sample mean, if we believe that there is a possibility that the sample mean may give a wrong picture. Outlier-resistant methods go well beyond the scope of this course. If outliers are detected, then you should consult with a statistician.
Trivariate Cases
Using Rotating Scatter Plots we can:
- Describe relationships among three variables
- Detect Outliers
Using Technology
Note: In the upper right-hand corner of the code block you will have the option of copying ( ) the code to your clipboard or downloading ( ) the file to your computer.
options ls=78;
title "Example: Nutrient Intake Data - Descriptive Statistics";
data nutrient;
infile "D:\Statistics\STAT 505\data\nutrient.csv" firstobs=2 delimiter=',';
input id calcium iron protein a c;
L_calciu = log(calcium);
S_calciu = calcium**.5;
S_S_calc = calcium**.25;
L_iron = log(iron);
S_S_iron = iron**.25;
L_prot = log(protein);
S_S_prot = protein**.25;
S_S_a = a**.25;
S_S_c = c**.25;
run;
proc univariate data=nutrient;
histogram calcium S_calciu S_S_calc L_calciu;
run;
proc g3d data=nutrient;
scatter iron*protein=calcium / rotate=60;
run;
quit;
proc corr data=nutrient plots(maxpoints=75000)=matrix;
var S_S_calc S_S_iron S_S_prot S_S_a S_S_c;
run;
Using rotating scatter plots in SAS:
3-D scatter plot of calcium by iron and protein
By rotating a 3-dimensional scatterplot, the illusion of three dimensions can be achieved. Here, we are looking to see if the cloud of points is approximately elliptical in shape.
Creating a 3D Scatter plot in Minitab for L_calc, L_iron and L_prot.
- Select Graph > 3D Scatter Plot
- The default is already Simple, so click OK.
- In Z, enter L_iron. In Y, enter L_prot. In X, enter L_calc.
- Click OK.
Note: The plot (shown below) can be rotated using the 3D Graph tools that appear with the plot. If it does not appear, choose Tools > Toolbars and check 3D Graph Tools.
View the video to walk through what this process looks like in Minitab.
Multivariate Cases
Using a Matrix of Scatter Plots we can:
- Look at all of the relationships between pairs of variables in one group of plots
- Describe relationships among three or more variables
Here, we have a matrix of scatterplots for quarter-root transformed data on all variables. Note that each variable appears to be positively related to the remaining variables. However, the strength of that relationship depends on which pair of variables is considered. For example, quarter-root iron is strongly related to quarter-root protein, but the relationship between calcium and vitamin C is not very strong.
Using Technology
Matrix of scatterplots generated using SAS.
proc sgscatter data=nutrient;
title "Scatterplot Matrix for Nutrition Data";
matrix S_S_calc S_S_iron S_S_prot S_S_vitA S_S_vitC;
run;
Creating a matrix of scatterplots for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC in Minitab.
- Select Graph > Matrix Plot
- The default is already Simple, so click OK.
- Under Graph variables, enter S_S_calc, S_S_iron, S_S_prot, S_S_vitA, and S_S_vitC.
- Click OK.
A matrix plot for S_S_calc, S_S_iron, S_S_protein, S_S_vitA, and S_S_vitC
View the video to walk through what this process looks like in Minitab.
3.2 - Summary
3.2 - SummaryIn this lesson we learned about:
- How to interpret graphical displays of multivariate data;
- How to determine the most appropriate normalizing transformation of the data;
- How to detect outliers;
- Use of software in producing multivariate graphics
- Issues regarding when outliers should be removed from the data, or when they should be retained.