4.3 -1995 - Two Huge Steps for Biological Inference

Printer-friendly versionPrinter-friendly version

In 1995, the first microarray was "spotted" and hybridized, starting the "omics" revolution in biology.

Also in 1995, independently, Benjamini and Hochberg conceived of the idea of False Discovery Rate or FDR. Their idea was that for large m, we do not expect all of the null hypotheses to be true, and so we do not want to stringently control Pr(V > 0). Instead, we want to control the expected proportion of our discoveries that are false assuming we make at least one discovery, that is FDR=E(V/R|R>0).

Let q be the target FDR. Benjamini and Hochberg proved that if q is the target FDR rejecting while \(p_{(i)} \leq qi/m\) controls FDR at level q. [1]

The Benjamini and Hochberg method is used extensively in bioinformatics and other "big data" disciplines. It requires the tests to be independent. This is seldom true in "omics" data for which our features may be gene expression or proteins, which occur in pathways which induce correlated behavior. However, in a follow-up paper, Benjamini and Hochberg also showed that their procedure controls FDR for certain types of correlation.

The BH procedure may not work so well for highly correlated data such as SNP frequencies for SNPs that are densely located. Considerable work has gone into developing FDR controlling procedures for highly correlated data such as dense SNPs and neuroimaging data.  The Benjamini and Yekutieli (BY) method [2] controls FDR for any correlation structure, but is much less powerful than the BH method.

Although the BH procedure is meant to control FDR, not the FWER, "BH-adjusted p-values" computed as \(p_{BH(i)}= min(mp(i)/i,1)\) are often used as adjusted p-values.

The BH procedure is more powerful than the Holmes procedure.

All of the procedures could be made more powerful because we really only need to adjust for the null tests. If we only knew \(m_0\) we could adjust for it instead of m, giving us larger cut-off values. Fortunately, it turns out that when we have done many tests, it is fairly easy to estimate \(m_0\). There are many estimation methods.

FDR controlling or estimation methods that estimate \(m_0\) and use it in place of m, are called adaptive FDR methods.

[1]  Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB, 57, 289--300.https://www.jstor.org/stable/2346101

[2] Benjamini, Yoav, and Daniel Yekutieli. "The control of the false discovery rate in multiple testing under dependency." Annals of statistics (2001): 1165-1188.  https://projecteuclid.org/download/pdf_1/euclid.aos/1013699998