Lesson 11: Principal Components Analysis (PCA)

Overview Section

Sometimes data are collected on a large number of variables from a single population. As an example consider the Places Rated dataset below.

Example 11-1: Places Rated Section

In the Places Rated Almanac, Boyer and Savageau rated 329 communities according to the following nine criteria:

  1. Climate and Terrain
  2. Housing
  3. Health Care & the Environment
  4. Crime
  5. Transportation
  6. Education
  7. The Arts
  8. Recreation
  9. Economics
Note! Within the dataset, except for housing and crime, the higher the score the better. For housing and crime, the lower the score the better. While some communities might rate better in the arts, other communities might rate better in other areas such as having a lower crime rate and good educational opportunities.

Objective

With a large number of variables, the dispersion matrix may be too large to study and interpret properly. There would be too many pairwise correlations between the variables to consider. Graphical displays may also not be particularly helpful when the data set is very large. With 12 variables, for example, there will be more than 200 three-dimensional scatterplots.

To interpret the data in a more meaningful form, it is necessary to reduce the number of variables to a few, interpretable linear combinations of the data. Each linear combination will correspond to a principal component.

(There is another very useful data reduction technique called Factor Analysis discussed in a subsequent lesson.)

Objectives

Upon completion of this lesson, you should be able to:

  • Perform a principal components analysis using SAS and Minitab
  • Assess how many principal components are needed;
  • Interpret principal component scores and describe a subject with a high or low score;
  • Determine when a principal component analysis should be based on the variance-covariance matrix or the correlation matrix;
  • Compare principal component scores in further analyses.