13.1 - Setting the Stage for Canonical Correlation Analysis

What motivates canonical correlation analysis? Section

It is possible to create pairwise scatter plots with variables in the first set (e.g., exercise variables), and variables in the second set (e.g., health variables). But if the dimension of the first set is p and that of the second set is q, there will be pq such scatter plots, it may be difficult, if not impossible, to look at all of these graphs together and interpret the results.

Similarly, you could compute all correlations between variables from the first set (e.g., exercise variables), and variables in the second set (e.g., health variables), however, interpretation is difficult when pq is large.

Canonical Correlation Analysis allows us to summarize the relationships into fewer statistics while preserving the main facets of the relationships. In a way, the motivation for canonical correlation is very similar to principal component analysis. It is another dimension-reduction technique.

Canonical Variates Section

Let's begin with the notation:

We have two variables \(X\) and \(Y\).

Suppose we have p variables in set 1: \(\textbf{X} = \left(\begin{array}{c}X_1\\X_2\\\vdots\\ X_p\end{array}\right)\)

and suppose we have q variables in set 2: \(\textbf{Y} = \left(\begin{array}{c}Y_1\\Y_2\\\vdots\\ Y_q\end{array}\right)\)

We select X and Y based on the number of variables in each set so that \(p ≤ q\). This is done for computational convenience.

We look at linear combinations of the data, similar to principal components analysis. We define a set of linear combinations named U and V. U corresponds to the linear combinations from the first set of variables, X, and V corresponds to the second set of variables, Y. Each member of U is paired with a member of V. For example, \(U_{1}\) below is a linear combination of the p X variables and \(V_{1}\) is the corresponding linear combination of the q Y variables. Similarly, \(U_{2}\) is a linear combination of the p X variables, and \(V_{2}\) is the corresponding linear combination of the q Y variables. And, so on...

\begin{align} U_1 & = a_{11}X_1 + a_{12}X_2 + \dots + a_{1p}X_p \\ U_2 & = a_{21}X_1 + a_{22}X_2 + \dots + a_{2p}X_p \\ & \vdots \\ U_p & = a_{p1}X_1 +a_{p2}X_2 + \dots +a_{pp}X_p\\ & \\ V_1 & = b_{11}Y_1 + b_{12}Y_2 + \dots + b_{1q}Y_q \\ V_2 & = b_{21}Y_1 + b_{22}Y_2 + \dots +b_{2q}Y_q \\ & \vdots \\ V_p & = b_{p1}Y_1 +b_{p2}Y_2 + \dots +b_{pq}Y_q\end{align}

Thus define

\((U_i, V_i)\)

as the \(i^{th}\) canonical variate pair. ( \(U_{1}\), \(V_{1}\)) is the first canonical variate pair, similarly ( \(U_{2}\), \(V_{2}\)) would be the second canonical variate pair, and so on. With \(p ≤ q\) there are p canonical covariate pairs.

We hope to find linear combinations that maximize the correlations between the members of each canonical variate pair.

We compute the variance of \(U_{i}\) variables with the following expression:

\(\text{var}(U_i) = \sum\limits_{k=1}^{p}\sum\limits_{l=1}^{p}a_{ik}a_{il}cov(X_k, X_l)\)

The coefficients \(a^{i1}\) through \(a^{ip}\) that appear in the double sum are the same coefficients that appear in the definition of \(U_{i}\). The covariances between the \(k^{th}\) and \(l^{th}\) X-variables are multiplied by the corresponding coefficients \(a^{ik}\) and \(a^{il}\) for the variate \(U_{i}\).

Similar calculations can be made for the variance of \(V_{j}\) as shown below:

\(\text{var}(V_j) = \sum\limits_{k=1}^{p} \sum\limits_{l=1}^{q} b_{jk}b_{jl}\text{cov}(Y_k, Y_l)\)

The covariance between \(U_{i}\) and \(V_{j}\) is:

\(\text{cov}(U_i, V_j) = \sum\limits_{k=1}^{p} \sum\limits_{l=1}^{q}a_{ik}b_{jl}\text{cov}(X_k, Y_l)\)

The correlation between \(U_{i}\) and \(V_{j}\) is calculated using the usual formula. We take the covariance between the two variables and divide it by the square root of the product of the variances:

\(\dfrac{\text{cov}(U_i, V_j)}{\sqrt{\text{var}(U_i) \text{var}(V_j)}}\)

The canonical correlation is a specific type of correlation. The canonical correlation for the \(i^{th}\) canonical variate pair is simply the correlation between \(U_{i}\) and \(V_{i}\):

\(\rho^*_i = \dfrac{\text{cov}(U_i, V_i)}{\sqrt{\text{var}(U_i) \text{var}(V_i)}} \)

This is the quantity to maximize. We want to find linear combinations of the X's and linear combinations of the Y's that maximize the above correlation.

Canonical Variates Defined Section

Let us look at each of the p canonical variates pair individually.

First canonical variate pair: \( \left( U _ { 1 } , V _ { 1 } \right)\):

The coefficients \(a_{11}, a_{12}, \dots, a_{1p}\) and \(b_{11}, b_{12}, \dots, b_{1q}\) are selected to maximize the canonical correlation \(\rho^*_1\) of the first canonical variate pair. This is subject to the constraint that variances of the two canonical variates in that pair are equal to one.

\(\text{var}(U_1) = \text{var}(V_1) = 1\)

This is required to obtain unique values for the coefficients.

Second canonical variate pair: \( \left( U _ { 2 } , V _ { 2 } \right)\)

Similarly we want to find the coefficients \(a_{21}, a_{22}, \dots, a_{2p}\) and \(b_{21}, b_{22}, \dots, b_{2q}\) that maximize the canonical correlation \(\rho^*_2\) of the second canonical variate pair, \( \left( U _ { 2 } , V _ { 2 } \right)\). Again, we will maximize this canonical correlation subject to the constraint that the variances of the individual canonical variates are both equal to one. Furthermore, we require the additional constraints that \( \left( U _ { 1 } , U _ { 2 } \right)\), and \( \left( V_{1} , V_{2} \right)\) are uncorrelated. In addition, the combinations \( \left( U_{1} , V_{2} \right)\) and \( \left( U_{2} , V_{1} \right)\) must be uncorrelated. In summary, our constraints are:

\(\text{var}(U_2) = \text{var}(V_2) = 1\),

\(\text{cov}(U_1, U_2) = \text{cov}(V_1, V_2) = 0\),

\(\text{cov}(U_1, V_2) = \text{cov}(U_2, V_1) = 0\).

Basically, we require that all of the remaining correlations equal zero.

This procedure is repeated for each pair of canonical variates. In general, ...

\( i^{th} \) canonical variate pair: \( \left( U _ { i } , V _ { i } \right)\)

We want to find the coefficients \(a_{i1}, a_{i2}, \dots, a_{ip}\) and \(b_{i1}, b_{i2}, \dots, b_{iq}\) that maximize the canonical correlation \(\rho^*_i\) subject to the constraints that

\(\text{var}(U_i) = \text{var}(V_i) = 1\),

\(\text{cov}(U_1, U_i) = \text{cov}(V_1, V_i) = 0\),

\(\text{cov}(U_2, U_i) = \text{cov}(V_2, V_i) = 0\),

\(\vdots\)

\(\text{cov}(U_{i-1}, U_i) = \text{cov}(V_{i-1}, V_i) = 0\),

\(\text{cov}(U_1, V_i) = \text{cov}(U_i, V_1) = 0\),

\(\text{cov}(U_2, V_i) = \text{cov}(U_i, V_2) = 0\),

\(\vdots\)

\(\text{cov}(U_{i-1}, V_i) = \text{cov}(U_i, V_{i-1}) = 0\).

Again, requiring all of the remaining correlations to be equal to zero.

Next, let's see how this is carried out in SAS...