In the previous example, we looked at a principal components analysis applied to the raw data. In our earlier discussion, we noted that if the raw data is used, then a principal component analysis will tend to give more emphasis to those variables that have higher variances than to those variables that have lower variances. In effect, the results of the analysis will depend on the units of measurement used to measure each variable. That would imply that a principal component analysis should only be used with the raw data if all variables have the same units of measure. And even in this case, only if you wish to give those variables which have higher variances more weight in the analysis.
A unique example of this type of implementation might be in an ecological setting where you are looking at counts of different species of organisms at a number of different sample sites. Here, one may want to give more weight to the more common species that are observed. By analyzing the raw data you will tend to find that more common species will also show higher variances and will be given more emphasis. If you were to do a principal component analysis on standardized counts, all species would be weighted equally regardless of how abundant they are and hence, you may find some very rare species entering in as significant contributors in the analysis. This may or may not be desirable. These types of decisions need to be made with a scientist from the field.
Summary Section
- The results of the principal component analysis depend on the measurement scales.
- Variables with the highest sample variances tend to be emphasized in the first few principal components.
- Principal component analysis using the covariance function should only be considered if all of the variables have the same units of measurement.
If the variables have different units of measurement, (i.e., pounds, feet, gallons, etc), or if we wish each variable to receive equal weight in the analysis, then the variables should be standardized before conducting a principal components analysis. To standardize a variable, subtract the mean and divide by the standard deviation:
\(Z_{ij} = \frac{X_{ij}-\bar{x}_j}{s_j}\)
where
- \(X_{ij}\) = Data for variable j in sample unit i
- \(\bar{x}_{j}\)= Sample mean for variable j
- \(s_j\) = Sample standard deviation for variable j
Principal Component Analysis Procedure with Standardized Data Section
The principal components are first calculated by obtaining the eigenvalues for the correlation matrix:
\(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\)
In this matrix, we denote the eigenvalues of the sample correlation matrix R and the corresponding eigenvectors
\(\mathbf{\hat{e}}_1, \mathbf{\hat{e}}_2, \dots, \mathbf{\hat{e}}_p\)
The estimated principal components scores are calculated using formulas similar to before, but instead of using the raw data we use the standardized data:
\begin{align} \hat{Y}_1 & = \hat{e}_{11}Z_1 + \hat{e}_{12}Z_2 + \dots + \hat{e}_{1p}Z_p \\ \hat{Y}_2 & = \hat{e}_{21}Z_1 + \hat{e}_{22}Z_2 + \dots + \hat{e}_{2p}Z_p \\&\vdots\\ \hat{Y}_p & = \hat{e}_{p1}Z_1 + \hat{e}_{p2}Z_2 + \dots + \hat{e}_{pp}Z_p \\ \end{align}
The rest of the procedure and the interpretations follow as discussed before.