##
How do we find the coefficients \(\boldsymbol{e_{ij}}\) for a principal component?
Section* *

The solution involves the eigenvalues and eigenvectors of the variance-covariance matrix \(Σ\).

Let \(\lambda_1\) through \(\lambda_p\) denote the eigenvalues of the variance-covariance matrix \(Σ\). These are ordered so that \(\lambda_1\) has the largest eigenvalue and \(\lambda_p\) is the smallest.

\(\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p\)

Let the vectors \(\boldsymbol{e}_1\) through \(\boldsymbol{e}_1\)

\(\boldsymbol{e}_1 , \boldsymbol{e}_2 , \dots , \boldsymbol{e}_p\)

denote the corresponding eigenvectors. It turns out that the elements for these eigenvectors are the coefficients of our principal components.

The variance for the *i*th principal component is equal to the *i*th eigenvalue.

\(var(Y_i) = \text{var}(e_{i1}X_1 + e_{i2}X_2 + \dots e_{ip}X_p) = \lambda_i\)

Moreover, the principal components are uncorrelated from one another.

\(\text{cov}(Y_i, Y_j) = 0\)

The variance-covariance matrix may be written as a function of the eigenvalues and their corresponding eigenvectors. This is determined by the Spectral Decomposition Theorem. This will become useful later when we investigate topics under factor analysis.

##
Spectral Decomposition Theorem
Section* *

The variance-covariance matrix can be written as the sum over the *p* eigenvalues, multiplied by the product of the corresponding eigenvector times its transpose as shown in the first expression below:

\begin{align} \Sigma & = \sum_{i=1}^{p}\lambda_i \mathbf{e}_i \mathbf{e}_i' \\ & \cong \sum_{i=1}^{k}\lambda_i \mathbf{e}_i\mathbf{e}_i'\end{align}

The second expression is a useful approximation if \(\lambda_{k+1}, \lambda_{k+2}, \dots , \lambda_{p}\) are small. We may approximate Σ by

\(\sum\limits_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\)

Again, this is more useful when we talk about factor analysis.

Earlier in the course, we defined the total variation of \(\mathbf{X}\) as the trace of the variance-covariance matrix, that is the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues as shown below:

\begin{align} trace(\Sigma) & = \sigma^2_1 + \sigma^2_2 + \dots +\sigma^2_p \\ & = \lambda_1 + \lambda_2 + \dots + \lambda_p\end{align}

This will give us an interpretation of the components in terms of the amount of the full variation explained by each component. The proportion of variation explained by the *i*th principal component is then defined to be the eigenvalue for that component divided by the sum of the eigenvalues. In other words, the *i*th principal component explains the following proportion of the total variation:

\(\dfrac{\lambda_i}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\)

A related quantity is the proportion of variation explained by the first *k* principal component. This would be the sum of the first *k* eigenvalues divided by its total variation.

\(\dfrac{\lambda_1 + \lambda_2 + \dots + \lambda_k}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\)

Naturally, if the proportion of variation explained by the first *k* principal components is large, then not much information is lost by considering only the first *k* principal components.

##
Why It May Be Possible to Reduce Dimensions
Section* *

When we have a correlation (multicollinearity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line. That line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total= verbal+math, meaning the third variable is completely determined by the first two. The reason for saying “at most” two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data.

**Note!**

All of this is defined in terms of the population variance-covariance matrix Σ which is unknown. However, we may estimate Σ by the sample variance-covariance matrix given in the standard formula here:

\(\textbf{S} = \frac{1}{n-1} \sum\limits_{i=1}^{n}(\mathbf{X}_i-\bar{\textbf{x}})(\mathbf{X}_i-\bar{\textbf{x}})'\)

##
Procedure
Section* *

Compute the eigenvalues \(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\) of the sample variance-covariance matrix **S**, and the corresponding eigenvectors \(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \dots, \hat{\mathbf{e}}_p\).

Then we define the estimated principal components using the eigenvectors as the coefficients:

\begin{align} \hat{Y}_1 & = \hat{e}_{11}X_1 + \hat{e}_{12}X_2 + \dots + \hat{e}_{1p}X_p \\ \hat{Y}_2 & = \hat{e}_{21}X_1 + \hat{e}_{22}X_2 + \dots + \hat{e}_{2p}X_p \\&\vdots\\ \hat{Y}_p & = \hat{e}_{p1}X_1 + \hat{e}_{p2}X_2 + \dots + \hat{e}_{pp}X_p \\ \end{align}

Generally, we only retain the first *k* principal components. Here we must balance two conflicting desires:

- To obtain the simplest possible interpretation, we want
*k*to be as small as possible. If we can explain most of the variation just by two principal components then this would give us a simple description of the data. When*k*is small, the first*k*components explain a large portion of the overall variation. If the first few components explain a small amount of variation, we need more of them to explain a desired percentage of total variance resulting in a large*k*. - To avoid loss of information, we want the proportion of variation explained by the first
*k*principal components to be large. Ideally as close to one as possible; i.e., we want

\(\dfrac{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_k}{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_p} \cong 1\)