2.9 - More About Tests: Power, False Discovery, Non-discovery

Printer-friendly versionPrinter-friendly version

Typically when discussing p-values, we consider the probability of a rare event for a single comparison.  However, for "omics" data we are doing simultaneous tests of 10s of thousands of variables.  An event that occurs 1% of the time is rare when you observe a single value from a sampling distribution, but almost certain to occur if you observe 100 thousand values.   If you purchase a lottery ticket the chances that you win may be one in 40 million. But someone will win that lottery! Rare things happen when you do a lot of tests and we will need to take this into account.

Power

When we did the permutation test and the t-test we assumed that there was no difference in expression in order to compute the p-value. If the null hypothesis is true but  we reject it, we made a mistake which is called Type I error or false detection. But that's not the only kind of mistake one could make. We could fail could detect a gene that actually differentially expresses. This is called a type II error or false non-detection.  The probability that we correctly detect something when the null hypothesis is false, is called the power of the test.

We would like to design our studies so that the probability of both type I and type II errors are small.

The biology of gene expression in normal and tumor tissues is not under our control.  Each gene has a population of expression values in normal tissues and a population of expression values in tumor tissues -- either the population means are the same (null is true) or they are not (null is false).    If there is truly a difference we want |t*| to be big. So, what is under our control that can give us more power to reject the null when it is false?

Let's take another look at the equation for t*. What values in the equation below will make t* large?

\[t*=\frac{\bar{X} - \bar{Y}}{\sqrt{\frac{S_x^2}{n}+\frac{S_y^2}{m}}}\]

One thing would be to have a large numerator, \(\bar{X}-\bar{Y}\). However, we know that the mean of the sampling distribution of  \(\bar{X}-\bar{Y}\) is \(\mu_X-\mu_Y\) which is the difference in average expression in the two kinds of tissue and  is outside of our control. (However, in some experiments in nature we can control this too by selecting more extreme conditions such as exposures!)

On the other hand, we always have control over the denominator because we select the sample size. The larger the sample size the smaller the denominator. This is why statisticians are always saying, "Take more samples!".  We also have some control over the variances - we can often reduce variance by improving some aspect of our experiment.

Science takes many resources - time, money, labor and samples.  All of these are limited.  As well, generally each individual experiment is only part of a sequence of studies which together give a bigger picture of the biology.  A good experimental design will try to balance all of these aspects, and this usually means that we cannot take all the samples that statistical principles would suggest.

In the long run, we always want to make sure we have enough samples to have some reasonable amount of power to detect differences of a reasonable size. We won't be able to detect everything. Features that have very small differences might need huge sample sizes to have a reasonable detection probability. We have to figure out what size difference we want to detect and use a sample size that has a chance of detecting it.

We also have two ways control over the variability. One way is experiment control - but we have to be careful about this. For instance, when we grow plants in a greenhouse instead of a field many environmental influences which increase variance will be removed from the picture. However, if you want to find out what happens in the field the greenhouse is not going to be representative. We often control variability in unrealistic ways within our experiments. For example, in nutrition experiments we seldom allow subjects to eat whatever they want - subjects have to agree that they will eat all of their meals in the lab throughout the experiment so that we can remove other sources of variability. From this we try to say what's going to happen in the general population when they follow a dietary guideline. 

Recently concerns have been raised that too stringent experimental control is contributing to irreproducible research.  The idea is that variability is being so tightly controlled, that it is practically impossible to reproduce the experimental conditions and that either these conditions are contributing to observed differences (or lack of differences) or the estimated variances are so small that there are false detections.  See for example [1].

The second way to reduce variability is by using better measurement instruments. This does not change the biological variability but allows us to control measurement error. The smaller the measurement error, the less total variability we have. If we can't get rid of the measurement error, we could take multiple measurements of the same sample and average them together. Averaging gets rid of whatever variability might have occurred through measurement without removing biological variability.  For example, people who do dilution series often do these series in triplicate and then averaged results because dilution series are often 'noisy'.  However, there are limitations to this type of duplication.  Generally it is more effective to take more biological samples rather than duplicate measurements of the same sample, unless the duplicate measures are much less expensive.  [2]

It turns out that sample size is very critical to everything. Larger sample size not only increases the power, the possibility of rejecting when you're supposed to, but it also reduces the false discovery rate. Increasing sample size also reduces the false non-discovery rate (FNR).

Increasing sample size is not always the best course of action because we need to make the most of the resources we have. But experiments with sample sizes that are too small are a complete waste of resources leading to mainly false detections and many false nondetections.

Another way to increase power is to improve your analyses. We can improve quite substantially on tests which consider only one feature at a time by using more sophisticated methods.  So far we have used frequentist methods.  Another approach to statistics is called Bayesian statistics. As well, there are frequentist methods based on appealing approximations to Bayesian statistics which can often produce much more powerful tests.  We will look at these methods next.

References

[1] Richter, S.H., Garner, J.P., Zipser, B., Lewejohann, L., Sachser, N., Touma, C., Schindler, B., Chourbaji, S., Brandwein, C., Gass, P., van Stipdonk, N., van der Harst, J., Spruijt, B., Võikar, V., Wolfer, D.P., Würbel, H.  Effect of population heterogenization on the reproducibility of mouse behavior: A multi-laboratory study. (2011) PLoS ONE, 6 (1), art. no. e16461,

[2] Krzywinski, M., &  Altman, N.  (2015) Points of Significance: Sources of variation. Nature Methods, 12(1), 5-6  doi:10.1038/nmeth.3224