16.5 - Supervised Dimension Reduction

Printer-friendly versionPrinter-friendly version

SVD and PCA are called unsupervised dimension reduction because the act only on the data matrix. Often as well as our feature by sample matrix, we have other information about the samples such as phenotypes, population subgroups and so on which we want to predict from the feature by sample matrix.  Since we have many more features than samples, in order to do this in a meaningful way we need to reduce the number of features (i.e. dimension reduction).  One way to do this is to use the first few eigenfeatures as predictors.  This is called principal component regression.  However, because determining the eigenfeatures uses only the sample matrix and not the other information, there is no guarantee that these eigenfeatures will be good predictors.

Methods that use the response variable to develop predictors are called Sufficient Dimension Reduction or SDR methods.  When the response variable is categorical (such as brain region) the simplest dimension reduction method simply averages the samples over each category as we did to find the eigenfeatures and eigentreatments.  When the response variable is continuous such as milk production or biomass we create categories by "slicing" the response variables into bins - e.g. weight groups.  Besides simply averaging the data within bin or category and doing SVD, there are a number of more sophisticated methods such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), which find features useful for prediction.  One interesting finding with such methods is that they can find features that are useful for nonlinear as well as linear regression.  

Full discussion of SDR methods is beyond the scope of this course.  However, those interested can find information in https://projecteuclid.org/euclid.ss/1185975631.