Lesson 4: Multiple Testing

Key Learning Goals for this Lesson:

Understand the multiple testing problem
Interpret a histogram of p-values from independent tests
Be familiar with the Bonferroni method
Understand false discovery and non-discovery rates
Be familiar with some FDR estimation methods

An event that is rare if we have only one opportunity to observe can become quite common if we are observing thousands of events. For example, when you roll 2 fair dice, getting double sixes happens only about 1 out of 36 times. But if you roll 3600 times, you expect to get about 100 rolls with 2 sixes.

The p-value is the probability of obtaining a result at least as extreme as the observed result if the null hypothesis is true. Suppose we accept p < 0.05 as "extreme". If we do 10,000 (independent) tests, and all the null hypotheses are true, we expect about 5% of the tests (i.e. about 500) to have p < 0.05.

This is a huge problem in high throughput analysis, because we are usually doing thousands of tests. We do not want to waste our time following up false positive hypotheses. But if we use conventional p-value cut-offs, this will be inevitable.

This chapter discusses some approaches to correcting our inference methods when we are doing multiple tests.

4.1 - Mistakes in Statistical Testing

The table below shows the possible correct and incorrect outcomes for a single test:

	Not Significant	Significant
\(H_0\) True		type I Error
\(H_0\) False	type II Error

A type I error is called a false discovery. A type II error is called a false non-discovery. Generally false discoveries are considered to be more serious than false nondiscoveries, although this is not always the case. Investigators usually follow-up with discoveries, so that false discoveries can lead to misleading and expensive follow-up studies. But nondiscoveries are usually abandoned, so that false nondiscoveries can lead to missing potentially important results.

In high throughput studies we typically test each of our m features individually, leading to the following table:

	Not Significant	Significant
\(H_0\) True	U	V	\(m_0\)
\(H_0\) False	T	S	\(m-m_0\)
	W	R	\(m\)

The total errors are T + V.

The false discovery proportion FDP is V / R.
The false discovery rate is the expected value of V / R, given that R ≠ 0.

Similarly, the false nondiscovery proportion FNP is T / W.
The false nondiscovery rate is the expected value of T / W, given that RW≠ 0.

\(\pi_0=m_0 / m\) is the proportion of null tests.

Before 1995, the objective of multiple testing correction was to control Pr(V > 0), the so-called Family-Wise Error Rate (FWER).

The problem is: as \(m_0\) grows, so does Pr(V > 0) for any fixed cut-off one would use to ascertain statistical significance.

4.2 - Controlling Family-wise Error Rate

Pr(V > 0) is called the family-wise error rate or FWER. It is easy to show that if you declare tests significant for \(p < \alpha\) then FWER ≤ \(min(m_0\alpha,1)\).

The most commonly used method which controls FWER at level \(\alpha\) is called Bonferroni's method. It rejects the null hypothesis when \(p < \alpha / m\). (It would be better to use \(m_0\) but we don't know what it is - more on that later.)

The Bonferroni method is guaranteed to control FWER, but it has a big problem. It greatly reduces your power to detect real differences. For example, suppose the effect size is 2 and you are doing a t-test, rejecting for p < 0.05. With 10 observations per group, the power is 99%. Now suppose you have 1000 tests, and use the Bonferroni method. That means that to reject, we need p < 0.00005. The power is now only 29%. If you have 10 thousand tests (which is small for genomics studies) the power is only 10%.

Sometimes the "Bonferroni-adjusted p-values are reported". They are just:

\(p_b=min(mp,1)\).

Another simple more powerful but less popular method uses the sorted p-values:

\[p_{(1)}\leq p_{(2)} \leq \cdots \leq p_{(m)}\]

Holmes showed that the FWER is controlled with the following algorithm:

Compare \(p_{(i)}\) with \(\alpha / (m-i+1)\). Starting from i = 1, reject until \(p_{(i)}\) is greater.

The most significant test must therefore pass the Bonferroni criterion. However, if it is significant, the next most significant is tested at a less stringent level. Heuristically, after rejecting the most significant test, we conclude the \(m_0 \leq m-1\) and use \(m-1\) for the next correction, and so on sequentially.

The Holmes method is more powerful than the Bonferroni method, but it is still not very powerful. We can also compute "Holmes-adjusted p-values" \(p_{h(i)} = min((m-i+1)p_{(i)},1)\).

4.3 -1995 - Two Huge Steps for Biological Inference

In 1995, the first microarray was "spotted" and hybridized, starting the "omics" revolution in biology.

Also in 1995, independently, Benjamini and Hochberg conceived of the idea of False Discovery Rate or FDR. Their idea was that for large m, we do not expect all of the null hypotheses to be true, and so we do not want to stringently control Pr(V > 0). Instead, we want to control the expected proportion of our discoveries that are false assuming we make at least one discovery, that is FDR=E(V/R|R>0).

Let q be the target FDR. Benjamini and Hochberg proved that if q is the target FDR rejecting while \(p_{(i)} \leq qi/m\) controls FDR at level q. [1]

The Benjamini and Hochberg method is used extensively in bioinformatics and other "big data" disciplines. It requires the tests to be independent. This is seldom true in "omics" data for which our features may be gene expression or proteins, which occur in pathways which induce correlated behavior. However, in a follow-up paper, Benjamini and Hochberg also showed that their procedure controls FDR for certain types of correlation.

The BH procedure may not work so well for highly correlated data such as SNP frequencies for SNPs that are densely located. Considerable work has gone into developing FDR controlling procedures for highly correlated data such as dense SNPs and neuroimaging data. The Benjamini and Yekutieli (BY) method [2] controls FDR for any correlation structure, but is much less powerful than the BH method.

Although the BH procedure is meant to control FDR, not the FWER, "BH-adjusted p-values" computed as \(p_{BH(i)}= min(mp(i)/i,1)\) are often used as adjusted p-values.

The BH procedure is more powerful than the Holmes procedure.

All of the procedures could be made more powerful because we really only need to adjust for the null tests. If we only knew \(m_0\) we could adjust for it instead of m, giving us larger cut-off values. Fortunately, it turns out that when we have done many tests, it is fairly easy to estimate \(m_0\). There are many estimation methods.

FDR controlling or estimation methods that estimate \(m_0\) and use it in place of m, are called adaptive FDR methods.

[1] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB, 57, 289--300.https://www.jstor.org/stable/2346101 [1]

[2] Benjamini, Yoav, and Daniel Yekutieli. "The control of the false discovery rate in multiple testing under dependency." Annals of statistics (2001): 1165-1188. https://projecteuclid.org/download/pdf_1/euclid.aos/1013699998 [2]

4.4 - Estimating \(m_0\) (or \(\pi_0\))

The proportion of hypotheses that are truly null is an important parameter in many situations. For example, when comparing normal and diseased tissues we might hypothesize a greater number of genes have changed response in the normal tissue than in the diseased when challenged by a toxin.

For multiple testing adjustments, the proportion of null hypotheses among the m tests is important, because we need only adjust for the \(m_0=\pi_0 m\) tests which are actually null. The other \(m-m_0\) can contribute only to the true discoveries.

Estimation of \(\pi_0\) starts with the observation that when the test statistic is continuous, then the distribution of p-values for the null tests is uniform on the interval (0,1). This is because for any significance level \(\alpha\) the proportion of tests with p-value less than \(\alpha\) is \(\alpha\). By subtraction, we can see that the proportion of tests in any interval inside (0,1) is just the length of the interval.

Below we see a histogram of p-values from t-tests on data simulated to be 10000 2-sample tests. For each 2-sample test, both samples have the same mean and SD. Since the means are the same, the null hypothesis is true. We can see the distribution is not quite uniform, but is pretty close because these are "observed" values and so have deviations from the ideal histogram. Below this histogram is a histogram in which the two samples have different means but the same power to detect the difference. As expected, the p-values are skewed to small values. The percentage of p-values in this histogram with p-value less than \(\alpha\) is the power of the test when significance is declared at level \(\alpha\).

histogram of nulls

histogram of nonnulls

Notice that in the histogram of non-null p-values, there are very few p-values bigger than 0.2 and almost none bigger than 0.5.

Of course when we actually test many features, some of which have the same means in both samples and some of which do not, we get a mix of histograms of this type. We should get a histogram something like the one below. In each bin of the histogram, we do not know which tests come from the truly null hypotheses, and which from the truly non-null. But we do know that the large p-values come mostly from the null distribution, and that the histogram should be fairly flat where most of the p-values come from the null. The red line shows an estimate of what the part of the histogram coming from the null should look like (although it might be a bit lower than it should be, comparing with the top histogram above).

histogram of nulls and nonnulls showing rejections and null tests

There are several estimators of \(\pi_0\) based on this picture. Some try to estimate where the histogram flattens out. The two simplest are Storey's method and the Pounds and Cheng method.

The Pounds and Cheng [1] method is based on noting that the expected average of the null p-values is 0.5. They then assume that all of the non-null p-values are exactly 0. Then \(\hat{\pi}_0=2*\text{average}(\text{p-value})\) because on average we expect the average p-value to be \(\pi_0*0.5+(1-\pi_0)*0)\).

Storey’s method [2] uses the histogram directly, assuming that all p-values bigger than some cut-off \(\lambda\) come from the null distribution. Then we expect \(m_0(1-\lambda)\) of the tests to have p-value greater than \(\lambda\). We then count and get some number \(m_\lambda\) p-values in this region, giving us \(\hat{m}_0=m_\lambda /(1- \lambda)\) or \(\hat{\pi}_0 = m_\lambda / (m*(1-\lambda)) \) . Sophisticated implementations of the method estimate a value for \(\lambda\) but \(\lambda=0.5\) works quite well in most cases.

[1] Storey, John D. "A direct approach to false discovery rates." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64.3 (2002): 479-498.https://www.genomine.org/papers/directfdr.pdf [3]

[2] Pounds, Stan, and Cheng Cheng. "Improving false discovery rate estimation."Bioinformatics 20.11 (2004): 1737-1745. https://bioinformatics.oxfordjournals.org/content/20/11/1737.full.pdf [4]

4.5 - q-Values

Storey's method also leads to a direct estimate of FDP. If we reject at level \(\alpha\) we expect the number of false discoveries to be \(\alpha m_0\). So the estimate of FDP is \(\alpha\hat{m}_0 / R\).

This leads directly to the Storey q-value [1] which is often interpreted as either an FDR-adjusted p-value or FDP(p) where p is any observed p-value in the experiment.

We start by sorting the p-values as we do for the BH or Holmes procedures.

Note that if we reject for \(p\leq p_{(i)}\) then the total rejections will be at least i (with equality unless two or more of the p-values are equal to \(p_{(i)}\)). Let R(\(\alpha\)) be the number of rejections when we reject for all \(p\leq\alpha\). Then define the q-values by:

\[q(1)= p_{(1)}\hat{m}_0/R(p_{(1)})\]

\[q(i+1)=max(q(i),p_{(i+1)}\hat{m}_0/R(p_{(i+1)})\]

This associates a q-value with each feature, which estimates the FDP if you reject the null hypothesis for this feature and all features which are this significant or more. Often we pick a cut-off for the q-value and reject the null hypothesis for all features with q-value less than or equal to our cut-off.

[1] Storey, John D. "The positive false discovery rate: a Bayesian interpretation and the q-value." Annals of statistics(2003): 2013-2035. https://projecteuclid.org/download/pdf_1/euclid.aos/1074290335 [5]

4.6 - Using the Histogram of p-values

We have seen that the histogram of all the p-values of our features plays an important role in estimating FDR. This suggests that we should plot our p-values.

Below is a set of plots from a microarray experiment with multiple treatments. A test was done for each pair of treatments, leading to a large number of pairs, each giving a p-value for each feature. The p-values for each pair of treatments is shown below. Notice that they all have the expected shape, and so the use of q-values is appropriate.

histogram of p-values

Sometimes our tests do NOT have the expected distribution. Below is the histogram of p-values for a set of p-values from an experiment using an antigene microarray. The histogram is very informative - it tells me that the test statistic does not have the assumed null distribution, and so the computed p-values are not valid. After looking at the data with the microarray provider, we concluded that some of the statistical analysis steps were not appropriate. Another statistical analysis pipeline will need to be developed.

histogram of p-values

I always look at the histogram of p-values before interpreting the p-values, computing q-values, or estimating \(\pi_0\). There are many reasons that it might not have the ideal shape. If the data are counts (like sequencing data) the histogram has a different characteristic shape, which we will discuss when we discuss sequencing data. If the data are intensities, a hump at low p (rather than the peak near p=0) might indicate correlation among the tests, due to strong association of the features. Another possibility is that the test statistic does not have the assumed null distribution. For example, if a block design was used but the blocks were not accounted for in the statistical analysis, the degrees of freedom of the test statistic will be incorrect. Another possibility is that the data are highly skewed and the sample size is too small for the t-statistic to have a t-distribution.