16.1 - Singular Value Decomposition
The most fundamental dimension reduction method is called the singular value decomposition or SVD. Oddly, statisticians don't seem to know much about this (although we use a related method, principal components analysis, very frequently). However in computer science and machine learning, SVD is one of the most important computational methods.
Our data consist of n samples and g feature values arranged as a g × n matrix which I will call M. (I will assume here that g>n but if not, the matrix can be transposed.) The features might be genes or protein binding sites, and the values are the normalized expression values. It is important to use the normalized values when using these dimension reduction methods. For microarray data, the values will be the usual normalized data. For sequencing data, the pseudo-counts computed by multiplying the actual counts by the normalization factors are used to normalize the samples to an equivalent library size. If the data are not normalized, the patterns that are extracted are often specific to the samples with the higher values, rather than being typical of the data as a whole.
Any g × n matrix M can be decomposed into 3 matrices U, D, V where
\[M = UDV^T\]
and:
- D is n × n diagonal with \(d_{11} \ge d_{22} \ge ... d_{nn} \ge 0 \) called the singular values.
- U is a g × n matrix called the left singular vectors or eigensamples U^{T}U = Identity
- V is a n × n matrix call the right singular vectors or eigenfeatures VV^{T }= V^{T}V = Identity
The columns of U are \(U_1, U_2, \cdots U_n\). \(U_i\) is the i^{th} eigensample. Each eigensample has an entry for each of the \(g\) features. Similarly the columns of \(V\) are \(V_1, V_2, \cdots V_n\). \(V_i\) is the i^{th} eigenfeature. Each eigenfeature has an entry for each of the n samples. The normalized singular values \(d_{ii}^2/ \sum_{j=1}^n d_{jj}^2 \) have an interpretation as the relative size of the pattern described by eigenfeature and eigensample i and are sometimes called the percent variance explained by the eigencomponents.
Another way to look at the decomposition of M which helps us to understand why the eigencomponents pick out patterns is
\[M=\sum_{i=1}^n d_{ii} U_i V_i^\top \]
This says that M is the sum of matrices each of which is the product of one eigensample and one eigengene, and weighted by the corresponding singular value. Since all the eigensamples and eigengenes have norm one, the size of the component matrix is determined by the singular values. So the first few components "explain" most of the patterns in the data matrix.
To see how this works, we will look at the primate brain data which we used in homework 5.
Principal Components Analysis
The SVD is a matrix decomposition, but it is not tied to any particular statistical method. A closely related method, Principal Components Analysis or PCA, is one of the most important methods in multivariate statistics.
Suppose M is the data matrix. The column-centered data matrix \(\tilde{M}\) is computed by subtracting the column mean from the entries of each column. The principal components of M are the eigenvectors of \(\hat{C}=\tilde{M}^\top \tilde{M}\) . If we consider each column of M to be a variable for which we have k observations, the diagonal elements of \(\hat{C}/(k-1)\) are the sample variances and. the off-diagonal elements are the sample covariances. Often the eigenvalues of \(\hat{C}\) or \(\hat{C}/(k-1)\) are called the variances.
Typically, in this set-up, the columns of the matrix are the features. Each row is considered an observation (of each of the features). \(\hat{C}/(k-1)\) is considered the empirical variance matrix of the data.
Now suppose that the SVD of \(\tilde{M}=\tilde{U}\tilde{D}\tilde{V}^\top\). \(\hat{C}=\tilde{M}^\top \tilde{M}=\tilde{V}\tilde{D}^2\tilde{V}^\top\). So the eigenvalues of \(\hat{C}\) are \(\tilde{d}_{ii}^2\) the squared singular values of \(\tilde{M}\) . And the eigenvectors of \(\hat{C}\) are the right singular vectors of \(\tilde{M}\) i.e. the eigenfeatures.
In PCA, we only compute \(\tilde{V}\) and \(\tilde{D}^2\). However, since \(\tilde{D}^2\) is diagonal, we can readily compute \(\tilde{D}^{-1}\). Thus we can readily compute \(\tilde{U}=\tilde{M}\tilde{V}\tilde{D}^{-1}\).
A commonly used plot is a plot of the data "on" two of the PCs. When we plot the data on the PCs, we are actually plotting the entries of the corresponding rows of \(\tilde{U}\). Often we do not bother to multiply by \(\tilde{D}^{-1}\) which is only a multiplicative scaling of the axes and plot the entries of the corresponding rows of \(\tilde{U}\tilde{D}\). e.g. When we plot the data on the first two PCs, we are actually plotting \((U_{i1},U_{i2})\).
There is a bit of confusion over terminology, as SVD is sometimes referred to as uncentered PCA, and PCA is sometimes referred to as the SVD of the column centered data (although PCA focuses only on the singular vectors corresponding to columns).
In bioinformatics, usually \(g >> n\) and the features are the rows of the matrix. To use PCA software such as prcomp and princomp in R, and to get the centering correct, we need to transpose the data matrix. This does not cause any computational problems (because the software actually uses the SVD to do the computation, rather than compute the empirical covariance matrix) but I find it confusing. I prefer to use SVD software and just keep track of whether I want to center the rows (for eigengenes) or columns (for eigensamples).
PCA or SVD?
PCA focuses on "explaining" the data matrix using the sample means plus the eigencomponents. When the column mean is far from the origin, the first right singular value is usually quite highly correlated with column mean - thus using PCA concentrates on the second, third and sometimes higher order singular vectors. This is a loss of information when the mean is informative for the process under study. On the other hand, when the scatterplot of the data is roughly elliptical, the PCs typically align with the major axes of the ellipse. Due to the uncorrelatedness constraint, if the mean is far from the origin, the first singular vector will be close to the mean and the others will be tilted away form the major axes of the ellipse. Thus the first singular vector will not be informative about the spread of the data, and the second and third singular values will not be in the most informative directions. Generally, PCA will be more informative, particularly as a method for plotting the data, than uncentered SVD.
"Raw" data or z-scores?
Recall that for gene expression clustering, we often prefer to use z-scores, which emphasize the up and down pattern of gene expression, rather than the magnitude of expression. Similarly, for eigen-analysis, we might prefer to use the gene-wise z-scores. When the genes are in the rows of the data matrix, this is the centered and scaled values of each row.