11.3 - Example: Places Rated

11.3 - Example: Places Rated

Example 11-2: Places Rated

We will use the Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to nine criteria:

  1. Climate and Terrain
  2. Housing
  3. Health Care & Environment
  4. Crime
  5. Transportation
  6. Education
  7. The Arts
  8. Recreation
  9. Economics

 

Notes

  • The data for many of the variables are strongly skewed to the right.
  • The log transformation was used to normalize the data.

Download the text file that contains the data here: places.txt

Using SAS

The SAS program will implement the principal component procedures:

Download the SAS program here: places.sas

View the video explanation of the SAS code.

When you examine the output, the first thing that SAS does is provide summary information. There are 329 observations representing the 329 communities in our dataset and 9 variables. This is followed by simple statistics that report the means and standard deviations for each variable.

Below this is the variance-covariance matrix for the data. You should be able to see that the variance reported for climate is 0.01289.

What we really need to draw our attention to here is the eigenvalues of the variance-covariance matrix. In the SAS output, the eigenvalues are in ranked order from largest to smallest. These values appear in Table 1 below for discussion.

Using Minitab

View the video below to see how to perform a principle components analysis of the places_rated.txt data using the Minitab statistical software application.

Data Analysis

Step 1: Examine the eigenvalues to determine how many principal components should be considered:

Table 1. Eigenvalues and the proportion of variation explained by the principal components.

Component Eigenvalue Proportion Cumulative
1 0.3775 0.7227 0.7227
2 0.0511 0.0977 0.8204
3 0.0279 0.0535 0.8739
4 0.0230 0.0440 0.9178
5 0.0168 0.0321 0.9500
6 0.0120 0.0229 0.9728
7 0.0085 0.0162 0.9890
8 0.0039 0.0075 0.9966
9 0.0018 0.0034 1.0000
Total 0.5225    

If you take all of these eigenvalues and add them up, then you get the total variance of 0.5223.

The proportion of variation explained by each eigenvalue is given in the third column. For example, 0.3775 divided by the 0.5223 equals 0.7227, or, about 72% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 0.7227 plus 0.0977 equals 0.8204, and so forth. Therefore, about 82% of the variation is explained by the first two eigenvalues together.

Next we need to look at successive differences between the eigenvalues. Subtracting the second eigenvalue 0.051 from the first eigenvalue, 0.377 we get a difference of 0.326. The difference between the second and third eigenvalues is 0.0232; the next difference is 0.0049. Subsequent differences are even smaller. A sharp drop from one eigenvalue to the next may serve as another indicator of how many eigenvalues to consider.

The first three principal components explain 87% of the variation. This is an acceptably large percentage.

An Alternative Method to determine the number of principal components is to look at a Scree Plot. With the eigenvalues ordered from largest to the smallest, a scree plot is the plot of \(\hat{\lambda_i}\) versus i. The number of components is determined at the point beyond which the remaining eigenvalues are all relatively small and of comparable size. The following plot is made in Minitab.

plot
The scree plot for the variables without standardization (covariance matrix)

As you see, we could have stopped at the second principal component, but we continued till the third component. Relatively speaking, the contribution of the third component is small compared to the second component.

Step 2: Next, we compute the principal component scores. For example, the first principal component can be computed using the elements of the first eigenvector:

\begin{align}\hat{Y}_1 & =  0.0351 \times (\text{climate}) + 0.0933 \times (\text{housing}) + 0.4078 \times (\text{health})\\ & + 0.1004 \times (\text{crime}) + 0.1501 \times (\text{transportation}) + 0.0321 \times (\text{education}) \\ & 0.8743 \times (\text{arts}) + 0.1590 \times (\text{recreation}) + 0.0195 \times (\text{economy})\end{align}

In order to complete this formula and compute the principal component for the individual community of interest, plug in that community's values for each of these variables. A fairly standard procedure is to use the difference between the variables and their sample means rather than the raw data. This is known as a translation of the random variables. Translation does not affect the interpretations because the variances of the original variables are the same as those of the translated variables.

Magnitudes of the coefficients give the contributions of each variable to that component. However, the magnitude of the coefficients also depend on the variances of the corresponding variables.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility