2.10 - Bayes, Empirical Bayes and Moderated Methods

Printer-friendly versionPrinter-friendly version

Bayesian Methods

Statistical methods are divided broadly into two types: frequentist (or classical) and Bayesian.  Both approaches require data generated by a known randomization mechanism from some population with some unknown parameters, with the objective of using the data to determine some of the properties of the unknowns.  Both approaches also require some type of model of the population - e.g. the population is Normal, or the population has a finite mean.  However, in frequentist statistics, probabilities are assigned only as the frequency of an event occurring when sampling from the population.  In Bayesian statistics, the information about the unknown parameters is also summarized by a probability distribution.

To understand the difference, consider tossing a coin 10 times to examine whether it is fair or not.  Suppose the outcome is 7 heads.  The frequentist obtains the probability of the outcome given that the coin is fair (0.12), the p-value (Prob(7 or more heads or tails|fair)=0.34) and concludes that there is no evidence that the coin is not fair.  She might also produce a 95% confidence interval for the probability of a head (0.35, 0.93).  For the frequentist, there is no sense in asking the probability that the coin is fair - it either is or is not fair.  The frequentist makes statements about the probability of the sample after making an assumption about the population parameter (which in this case is the probability of tossing a head).

The Bayesian, by contrast, starts with general information about similar coins - on average they are fair but each possible value between zero and 1 has some probability, higher near 0.5 and quite low at 0 and 1.  The probability assessment of the heads proportion is called the prior probability.  Each person might have their individual assessment, based on their personal experience, which is called a subjective prior.  Alternatively, a prior probability distribution might be selected based on good properties of the resulting estimates, called an objective prior.  The data is then observed.  For any particular heads proportion \(\pi\) the probability of 7 heads can be computed (called the likelihood).  The likelihood and the prior are then combined using Bayes' theorem in probability, to give the posterior distribution of \(\pi\) - which gives the probability distribution of the heads proportion given the data.  Since the observed number (7 heads in 10 tosses) is higher than 50%, the posterior will give higher probability than the prior to proportions greater than 1/2.  The Bayesian can compute things like Prob(coin is biased towards head|7 heads in 10 tosses) although she still cannot compute Prob(coin is exactly fair| 7 heads in 10 tosses) (because the probability of any single value is zero).  More information about Bayes' Theorem and Bayesian statistics, including more details about this example, can be found in [1] and [2].

In Bayesian statistics unknown quantities are given a prior distribution.  When we are measuring only one feature, this is controversial - what is the "population" that a quantity like the population mean could come from?  However, in high-throughput analysis, this is a natural approach.  Each  feature could be considered a member of a population of features.  So, for example, the mean expression of gene A (in all possible biological samples) can be thought of as a sample from the mean expression of all genes.  When doing differential expression analysis we could put a distribution on \(\mu_X-\mu_Y\) or on the binary outcome: the gene does/does not differentially express.

Due to the information added by the prior, Bayesian analyses tend to be more "powerful" than frequentist analyses.  (I put "powerful" in quotes, because it does not mean the same thing to the Bayesian as to the frequentist due to the different formulation.)  As well, Bayesians directly address questions like "What is the probability that this gene overexpresses in tumor samples" which are of interest to biologists.  So, why don't we use Bayesian methods all the time?

Unfortunately, Bayesian models are quite difficult to set up except in the simplest cases.  For example, Bayesian models are available to replace the t-tests we have already looked at, but more complex models including analyses like one-way ANOVA, are difficult to specify because there are multiple dependent parameters.  (Recall the one-way ANOVA is an extension of the two-sample t-test to 3 or more populations.  With G populations, there are G(G-1)/2 pairwise differences in means, and these differences and their dependencies would all need priors.)  Another problem is that investigators with different priors would draw different inferences from the same data, which seems contrary to the idea of objective evidence based on the data.   Although the influence of the prior can be shown to be overwhelmed by the data once there is sufficient data, sample sizes in many studies are too small for this to occur.

Software is available to replace t-tests with Bayesian tests for the simplest differential expression scenario. However, because this software is not extensible to more complex situations, we will not be using it in this class.  For some problems, however, Bayesian methods provide powerful analysis tools.

Empirical Bayes

In high throughput biology we have the population of features, as well as the population of samples. For each feature we can obtain an estimate of the parameters of interest such as means, variances or differences in means.  The histograms of these estimates (over all the genes) provide an estimate of the prior for the population of features called the empirical prior.  This leads to a set of frequentist methods called the empirical Bayesian methods, which are more powerful than "one-feature-at-a-time" methods.

The idea with empirical Bayesian methods is to use the Bayesian set-up but to estimate the priors from the population of all features. Formally speaking, empirical Bayes are frequentist methods which produce p-values and confidence intervals.  However, because we have the empirical priors, we can also use some of the probabilistic ideas from Bayesian analysis.  We will be using empirical Bayes methods for differential expression analysis.

Moderated Methods

Empirical Bayes methods are related to another set of methods called moderated methods or James-Stein estimators.  These are based on a remarkable result by James and Stein that when estimating quantities that can be expressed as expectations (which includes both population means and population variances) that when there are 3 or more populations, a weighted average of the sample estimate from the population and a quantity computed from all the populations is better on average than using just the data from the particular population.  (Translated into our setting, it means that if you want to know if genes 1 through G differentially express, you should use the data from all the genes to make this determination for each gene, even though for gene i, its expression levels will be weighted more heavily.)  This is called Stein's paradox.  Further information can be found in [3] and as well as a brief Wikipedia article.  The result is paradoxical, because it does not matter if the populations have anything in common - the result holds even if the quantities we want to estimate are mean salaries of NFL players, mean mass of galaxies, mean cost of a kilo of apples in cities of a certain type and mean air pollution indices.

Moderated methods are very intuitive for "omics" data, as we always have many more than 3 features, and since we have a population of features the result is less paradoxical. Empirical Bayes methods are moderated methods for which the weighting is generated by the empirical prior.  Methods which do not fit under the empirical Bayes umbrella rely on ad hoc weights which are chosen using other statistical methods.

Empirical Bayes and moderated methods have been popularized by a number of software packages first developed for differential expression analysis of gene expression microarrays, in particular LIMMA (an empirical Bayes method), SAM (a moderated method) and MAANOVA (a moderated method). 

We will start our analyses with microarray data. We will perform t-tests and then will use empirical Bayes t-tests to gain power. The more power you have for a given p-value, the smaller both your false discovery rates and your false non-discovery rates will be.

We improve power by having adequate sample size, good experimental design and a good choice of statistical methodology.

References

[1] López Puga J, Krzywinski M, Altman N. (2015) Points of significance: Bayes' theorem. Nat Methods. 2015 Apr;12(4):277-8. PubMed PMID: 26005726.

[2] Puga, J. L., Krzywinski, M., & Altman, N. (2015). Points of significance: Bayesian Statistics. Nature Methods, 12(5), 377-378 doi:10.1038/nmeth.3368

[3] Efron, B. and Morris, C. (1977), “Stein’s Paradox in Statistics,” Scientific American, 236, 119-127.