11.2 - How do we find the coefficients?
11.2 - How do we find the coefficients?How do we find the coefficients \(\boldsymbol{e_{ij}}\) for a principal component?
The solution involves the eigenvalues and eigenvectors of the variance-covariance matrix \(Σ\).
Let \(\lambda_1\) through \(\lambda_p\) denote the eigenvalues of the variance-covariance matrix \(Σ\). These are ordered so that \(\lambda_1\) has the largest eigenvalue and \(\lambda_p\) is the smallest.
\(\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_p\)
Let the vectors \(\boldsymbol{e}_1\) through \(\boldsymbol{e}_1\)
\(\boldsymbol{e}_1 , \boldsymbol{e}_2 , \dots , \boldsymbol{e}_p\)
denote the corresponding eigenvectors. It turns out that the elements for these eigenvectors are the coefficients of our principal components.
The variance for the ith principal component is equal to the ith eigenvalue.
\(var(Y_i) = \text{var}(e_{i1}X_1 + e_{i2}X_2 + \dots e_{ip}X_p) = \lambda_i\)
Moreover, the principal components are uncorrelated from one another.
\(\text{cov}(Y_i, Y_j) = 0\)
The variance-covariance matrix may be written as a function of the eigenvalues and their corresponding eigenvectors. This is determined by the Spectral Decomposition Theorem. This will become useful later when we investigate topics under factor analysis.
Spectral Decomposition Theorem
The variance-covariance matrix can be written as the sum over the p eigenvalues, multiplied by the product of the corresponding eigenvector times its transpose as shown in the first expression below:
\begin{align} \Sigma & = \sum_{i=1}^{p}\lambda_i \mathbf{e}_i \mathbf{e}_i' \\ & \cong \sum_{i=1}^{k}\lambda_i \mathbf{e}_i\mathbf{e}_i'\end{align}
The second expression is a useful approximation if \(\lambda_{k+1}, \lambda_{k+2}, \dots , \lambda_{p}\) are small. We may approximate Σ by
\(\sum\limits_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\)
Again, this is more useful when we talk about factor analysis.
Earlier in the course, we defined the total variation of \(\mathbf{X}\) as the trace of the variance-covariance matrix, that is the sum of the variances of the individual variables. This is also equal to the sum of the eigenvalues as shown below:
\begin{align} trace(\Sigma) & = \sigma^2_1 + \sigma^2_2 + \dots +\sigma^2_p \\ & = \lambda_1 + \lambda_2 + \dots + \lambda_p\end{align}
This will give us an interpretation of the components in terms of the amount of the full variation explained by each component. The proportion of variation explained by the ith principal component is then defined to be the eigenvalue for that component divided by the sum of the eigenvalues. In other words, the ith principal component explains the following proportion of the total variation:
\(\dfrac{\lambda_i}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\)
A related quantity is the proportion of variation explained by the first k principal component. This would be the sum of the first k eigenvalues divided by its total variation.
\(\dfrac{\lambda_1 + \lambda_2 + \dots + \lambda_k}{\lambda_1 + \lambda_2 + \dots + \lambda_p}\)
Naturally, if the proportion of variation explained by the first k principal components is large, then not much information is lost by considering only the first k principal components.
Why It May Be Possible to Reduce Dimensions
When we have a correlation (multicollinearity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions. For instance, imagine a plot of two x-variables that have a nearly perfect correlation. The data points will fall close to a straight line. That line could be used as a new (one-dimensional) axis to represent the variation among data points. As another example, suppose that we have verbal, math, and total SAT scores for a sample of students. We have three variables, but really (at most) two dimensions to the data because total= verbal+math, meaning the third variable is completely determined by the first two. The reason for saying “at most” two dimensions is that if there is a strong correlation between verbal and math, then it may be possible that there is only one true dimension to the data.
All of this is defined in terms of the population variance-covariance matrix Σ which is unknown. However, we may estimate Σ by the sample variance-covariance matrix given in the standard formula here:
\(\textbf{S} = \frac{1}{n-1} \sum\limits_{i=1}^{n}(\mathbf{X}_i-\bar{\textbf{x}})(\mathbf{X}_i-\bar{\textbf{x}})'\)
Procedure
Compute the eigenvalues \(\hat{\lambda}_1, \hat{\lambda}_2, \dots, \hat{\lambda}_p\) of the sample variance-covariance matrix S, and the corresponding eigenvectors \(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \dots, \hat{\mathbf{e}}_p\).
Then we define the estimated principal components using the eigenvectors as the coefficients:
\begin{align} \hat{Y}_1 & = \hat{e}_{11}X_1 + \hat{e}_{12}X_2 + \dots + \hat{e}_{1p}X_p \\ \hat{Y}_2 & = \hat{e}_{21}X_1 + \hat{e}_{22}X_2 + \dots + \hat{e}_{2p}X_p \\&\vdots\\ \hat{Y}_p & = \hat{e}_{p1}X_1 + \hat{e}_{p2}X_2 + \dots + \hat{e}_{pp}X_p \\ \end{align}
Generally, we only retain the first k principal components. Here we must balance two conflicting desires:
- To obtain the simplest possible interpretation, we want k to be as small as possible. If we can explain most of the variation just by two principal components then this would give us a simple description of the data. When k is small, the first k components explain a large portion of the overall variation. If the first few components explain a small amount of variation, we need more of them to explain a desired percentage of total variance resulting in a large k.
- To avoid loss of information, we want the proportion of variation explained by the first k principal components to be large. Ideally as close to one as possible; i.e., we want
\(\dfrac{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_k}{\hat{\lambda}_1 + \hat{\lambda}_2 + \dots + \hat{\lambda}_p} \cong 1\)