Lesson 6: Statistics for Differential Expression in Microarray Studies

Key Learning Goals for this Lesson:

Developing a design matrix
Understanding the moderated t-test
Understanding why increasing power improves both the FDR and FNR

After preprocessing and normalization, the data are in an expression matrix usually with genes in the rows and samples in the columns. The basic analysis is a test for each gene. We need to account for the experimental design in the analysis - for example we may have paired samples or technical replicates.

We are going to demonstrate with the ColonCA data.

6.1 - Two Condition Studies

The most basic differential expression study has two conditions. These could be:

treatment vs control
mutant vs wild type
exposure B versus exposure C

The first thing we will do is look at how this works on a two channel microarray. Because we find that the two channels on the same microarray are more similar than samples on different microarrays, we will used a paired analysis. Usually each array will have both conditions using a dye swap design. The dye effect will average out over the arrays, as long as the same number of samples for each treatment are labelled with each dye, so we do not need to create a special statistical model with a dye effect. However, M=log2(R)-log2(G), so for some microarrays M represents B-C and for others C-B and we do need to keep track of that.

dye swap diagram

For the colonCA data, let's assume that the 22 paired samples were paired on the arrays (i.e. the two tissues from the same subject were hybridized to the two channels on the same array) and that red is always reported before green. (When we have data from a lab or a data repository, it should be clear which samples are labeled with each dye.) There is a positive patient number for the normal tissue samples and a negative patient number with the same number with for the tumor tissue samples. This makes it easy for us to look at the data.

For example, we can see below that the first two columns are microarray 12 with the tumor in the red and the normal in the green.

Microarray channel example

For microarray 27 (columns 3 and 4), the normal is in the red and the tumor is in the green. This is exactly what we expect in a dye-swap design.

Microarray channel example 2

Typically, the data will be arranged in two matrices, one for the red channel on each array, and one for the green, as shown below. After normalization, we obtain M=log2(R)-log2(G). For the paired comparison, we reduce the data to d=log2(normal)-log2(tumor) for each patient (microarray). As shown below, on array 12, d=-M and on array 27 d=M.

Microarray channel example 3

To account for this, we make a "design matrix" which is a matrix with one column and 22 rows (one for each microarray). The entry is +1 if d=M and -1 if d=-M, so that the element-wise product of the design matrix and the M values is d. This simple operation allows us to compute the d values associated with every gene in a single matrix operation, giving us a matrix of the log2 expression difference between normal and tumor for every gene on every microarray as a matrix with genes in the rows and microarrays in the columns.

We can then compute the t-statistic for each row, giving us a paired t-test for every gene. Some of the t-statistics are shown below, along with the histogram of all of the 2000 t-statistics. If there is no differential expression, these t-statistics all come from the null distribution. Assuming that the difference in log2 expression values is approximately normal, the null distribution should be t with 21 d.f. for each of the genes.

Hsa. 3004	-2.057516
Hsa.13491	-1.169676
Hsa.13491.1	-1.590321
Hsa.37254	1.137496
Hsa. 541	-1.750747
Hsa. 20836	-0.851902

The results look fairly bell-shaped, but there might be some genes with t-statistics farther out in the tails than expected. To be sure, we need to compute the p-value for each test, based on the t on 21 d.f. as the reference null distribution.

Instead of computing a t-statistic for every gene using a loop, we will use the LIMMA package, which computes the mean difference and the s.e. of the mean difference efficiently using matrix operations. We will use the LIMMA package throughout this course.

Here are the results that we get using LIMMA.

	t.test	LIMMA (unmoderated)
Hsa. 3004	-2.057516	-2.057516
Hsa.13491	-1.169676	-1.169676
Hsa.13491.1	-1.590321	-1.590321
Hsa.37254	1.137496	1.137496
Hsa. 541	-1.750747	-1.750747
Hsa. 20836	-0.851902	-0.851902

One reason to use LIMMA is computational efficiency. ColonCA has only 2000 genes, so computing the t-statistic using a loop does not take very long. However, when we have 30 - 60 thousand genes, using a loop is quite slow. However, LIMMA has two other advantages. It turns out that the idea of using a design matrix allows for very flexible statistical analysis. Be changing the design matrix, we can use LIMMA for complicated experimental designs like 2-way ANOVA and randomized complete block designs. We can have a mix of paired and unpaired samples. We can do analysis of covariance (ANCOVA) or regression. As well, the default method in LIMMA uses an empirical Bayes estimate to "moderate" the standard deviation in the t-test denominator using the distribution of all the standard deviations. This can really improve the power especially with small samples. Recall that although power is the probability of correctly rejecting the null hypothesis when it is false, if we obtain more power for the same p-value then we actually improve the false discovery rate as well.

Below are the p-values from the 2000 paired t-tests using the ordinary t-statistic and the moderated t-statistic. Because our sample size (22) is relatively large, there is not a huge difference between the two test statistics, although if you look carefully you can see that the first bin is slightly higher for the moderated test (which is what you expect if there are some truly differentially expressing genes).

p-values and moderated p-values

The table below shows the estimated percentage of genes that do NOT differentially express \(\pi_0\)

	Pounds and Cheng \(\pi_0\)	Storey \(\pi_0\)	tests with p < 0.05
ordinary t	0.758	0.664	122
moderated t	0.761	0.679	178

Pounds and Cheng's method says that about 25% differentially expressed and about 75% don't. Storey's method says that roughly 33% differentially expressed. However, both methods estimate a slightly larger percentage of NON-differentially expressing genes using the moderated test. And yet, at p<0.05, there are almost 50% more "detections" using the moderated test. Why?

The moderated t has more power so there are fewer non-detections. Even the most conservative of the 4 estimates of \(\pi_0\) estimates that 25% or 500 genes differentially express. So, even if we reject at p<0.05 and ALL the discoveries are true, we have a substantial number of false nondiscoveries. The moderated method improves power and so reduces the false nondiscovery rate by using information from the population of genes to improve the estimate of the SD of expression for each gene.

What is a Moderated T-test?

The idea of moderation comes from a deep and surprising statistical result called Stein's paradox [1]. In its simplest form it states that when you are estimating 4 or more population means, the best estimator is not the 4 or more sample means, but a weighted average of the sample means and the "mean of the means". One way to find an appropriate weighting is to assume that the population means themselves come from a population (which is a Bayesian idea). This population is then the prior, and the best estimator of each population mean is the Bayes predictor, which is a weighted average of the prior mean and the sample mean. LIMMA uses an Empirical Bayes method which estimates the prior from the set of all feature(since we have 2000 of them - one for each gene).

LIMMA uses this Empirical Bayes method to moderate the sample variances, which are mean squared deviations (and so are a type of sample mean). Below is a plot of variance of the genes and beside it a gamma distribution with the mean and variance matching the average of variance of the sample variances. (Got that? We computed 2000 variances and then used them as a sample to compute a mean and variance.)

var of genes and gamma distribution plots

The shapes of these two plots is very similar (although the peak of the Gamma distribution is slightly lower.) LIMMA assumes that the variances come from a Gamma distribution and uses the variances computed for each gene to obtain the empirical estimates of the mean and variance of the Gamma. This Gamma distribution is the empirical prior distribution for the variances.

Now consider the plot below of the same Gamma distribution. The mean of the distribution is denoted by the vertical black line. The other 4 vertical lines are the TRUE (but unknown) variances of 4 genes. The curves of the same color are the sampling distributions of the sample variances for sample size 22 for a gene with this variance. Notice that the genes with high variance have a sampling distribution that is very spread out and that there are a lot of sample variances in the left tail. So, if we observe a gene with a large sample variance, its true variance is likely to be smaller. Conversely (and harder to see on this plot) the genes with variance smaller than the mean variance are have more sample variances below the true value. The moderated variance is a weighted average of the black line and the observed variance for each gene. This pulls the large variances down towards the mean variance (making the t-tests for those genes more powerful) and the small variances up towards the mean variance (making the t-tests for those genes less powerful). Moderation is somewhat like having a larger sample size for estimating the variance because the sample variance (on average) is more likely to be far from the population variance, while moderated variances are (on average) closer. The moderated variance also has an associated d.f. that is higher.

gamma distribution - interactive?

This empirical Bayes moderation turns out to be equivalent to having a larger sample size. So, as well as shrinking the sample variances towards the mean variance, the associated degrees of freedom increase. These are the d.f. of the null t-distribution for the moderated t-test. Recall that as the d.f. of a t-test increase, the critical values decrease. Using the moderated variance to estimate the SD has two effects on the t-test - the SD in the denominator of the test changes and the d.f. increase. The result is that for genes with large sample variance, power is increased in two ways - the moderated sample variance is smaller and the d.f. are larger. For genes with small sample variance, power is decreased by having a larger moderated sample variance and increased by having more d.f.

If you had a sample size of 100 instead of 22 all of the distributions would be narrower, and there is less shrinkage. As a result, the amount of moderation adjusts automatically to the amount of information available for each gene.

In summary, the moderated t-test is a t-test using the square root of the moderated variance as the SD instead of the sample variance. Having a better estimate of the standard deviation in turns out to be reflected in having more power for the tests. You would have the most power for that sample size if you actually knew the standard deviation, in which case you would use a normal distribution for the null distribution.

There are many different models for doing variance moderation: Cyber T and SAM are two of these products available in R which use moderated variances with a somewhat different justification. I prefer LIMMA and the associated statistical model because the clever empirical prior used by LIMMA leaves us with t-tests. Using other software you have to simulate to get the null distribution. Using LIMMA means that the t-test using the moderated standard deviation continues to have a t-distribution when the null is true, but there is a change in the d.f. associated with the moderated variance which is computed along with the moderated standard deviation. Simulations done by Smyth and others confirmed that the moderated t-test gains power. This means that the for any given false discovery rate, the false non-discovery rate will be smaller.

Doing the moderated tests with LIMMA

To perform the moderated tests with LIMMA we start with the normalized data. For two-channel microarrays, we usually use the normalize M as the data, which will give us a moderated paired t-test. This is appropriate, because hybridizing the two samples on the same microarray induces a pairing. As well, for the colonCA data, we have assumed that the two samples on the same array are also from the same patient.

We require a "design matrix" to determine what needs to be averaged for the test. When there are dye-swaps, some M values are B-C and others are C-B, so our design matrix is a single column of +/-1 with +1 for Normal-Cancer and -1 for Cancer-Normal. Denoting the design matrix as D and the column of M values as M, D^TM is the average Normal-Cancer value.

Given the data and the design matrix, LIMMA computes the difference in means the the appropriate SD in the one step, and then computes the moderated SD and the moderated t-test in the "eBayes" step. I then usually use Storey's method to compute q-values and come up with a final gene list.

6.2 - Two-Condition Studies on a 1 Channel MicroArray

In this section, we work with independent samples. We assume that the measurement platform is a 1-channel microarray such as an Affymetrix array, or a two channel 'reference design'in which one channel is devoted to a reference sample, so that M=sample-reference. We start with normalized data on the log2 scale, in a data matrix with genes in the rows and samples in the columns.

We will use the colonCA data again, but this time we use only the normal samples from the 22 patients with two samples, and only the cancer samples from the 18 patients that had only cancer samples. This gives us independent samples.

Our typical situation is that each sample is on separate array and the analysis will be a two sample t-test. Again, we can do this by doing t-tests for every row. But we prefer to do the empirical Bayes or 'moderated' t-test because to get a little more power.

In the paired case, we had an M value for each patient which represented the difference in expression, so our design matrix required only one column which gave the sign of M. In the unpaired case, some patients provided cancer samples and others provided normal samples. So we need a slightly different design matrix. I prefer to use what is called the treatment means model. The columns of the design matrix are indicator variables for each treatment group. So, in this case, we have 2 columns. One column has a 1 for each normal sample and a 0 for each tumor sample, while the other column has a 1 for each tumor sample and a 0 for each normal sample. Letting D be the design matrix and E the log2(Expression) matrix (after normalization) we find that M=ED is a 2000 x 2 matrix. The first column is the sample mean for the normal samples for each gene and the second column is the sample mean for the tumor samples for each gene. The ordinary (unmoderated) variances are the pooled within treatment variances.

To obtain the difference in means, we use a contrast matrix. If C=[1,-1] a 1x2 matrix, then MC^Tis a 1x2000 matrix with the difference in means for each gene.

In the final step we moderate the pooled within variances exactly as we did in the paired case, and compute the moderated two-sample t-statistics (using the pooled variances).

Below you can see the 2000 p-values from the usual two sample t-tests and a moderated two sample t-test. As we saw with the paired t-test, the moderation does not have a large effect, because the samples sizes (18 and 22) are relatively large. But we do expect some additional power due to the moderation.

p-values and moderated p-values

As we did before, we also estimated \(pi_0\), the percentage of genes that do not differentially expressed. We used the Pounds method which is the average p-value. We also use the Storey's method which uses the the area under the flat part of the curve. They are somewhat different but both show that there is quite a bit of differential expression. At most 65% of the genes do not differentially express and 35% do. As before, our nondetection rate is pretty high, since we expect about 35% of 2000 = 700 genes to differentially express. And, as before, we detect more "significant" genes with the moderated than with the unmoderated test, even though the estimate of \(\pi_0\) is quite similar.

	Pounds and Cheng \(\pi_0\)	Storey \(\pi_0\)	#q < 0.05
ordinary t	0.648	0.582	326
moderated t	0.648	0.588	340

So far we have done two different analyses (paired and unpaired), each using part of the data. The power of LIMMA and other "linear models" methods is that we can fit a model that uses all the data. We will look at that in Chapter 7.

6.3 - Blocked Samples

Actually the colonCA samples were measured on Affymetrix@ microarrays, which are 1-channel arrays. So the analysis of the paired samples should be handled as if the subjects are random blocks. The blocking factor is handled using the intrablock correlation, which is explained in Section 7.3 of the course notes.

One way to express the similarity of samples within blocks is to assume that there are two components of variability - the within block variability and the between block variability. A sample may be paired, in which case it is in a block of size 2 or unpaired, in which case it is in a block of size 1. All samples therefore have within block variability. Samples in the same block vary only by within block variability while samples in different blocks have both within and between block variability.

Another way of expressing the same idea is the intrablock correlation, which is the ratio of the within block variance to the total variance. Instead of estimating the 2 variance components, you then estimate the intrablock correlation and one of the variance components - usually the within block variability.

LIMMA uses the intrablock correlation and enforces a very severe shrinkage. It assumes that all the genes have the same intrablock correlation. However, they each have their own within block variability, which is computed using the Empirical Bayes method outlined in a previous section.

Computing the intrablock correlation is done in the LIMMA function duplicateCorrelation. The function takes as input the data, the blocking variable and the design matrix. We need the design matrix because we want to compute the residual correlation, so we need to include both the predictors (which are stored in the design matrix and are usually indicator variables for the treatments) and the blocks.

The first step, as with other shrinkage methods, is to compute the correlation for each gene. As with shrinkage of the variances, it is helpful to think about the population of correlations from which these correlations could come. It turns out that when the data are Normally distributed and all the genes have the same intrablock correlation \(\rho\), then \(Z=.5 log [(1+r)/(1-r)]=arctanh(r)\) is \(Normal(arctanh(\rho),1/sqrt(N-3))\) where \(N\) is the sample size. Under the assumption that the genes are independent (which is not quite true) with the same true intrablock correlation, the observed correlations are a sample from this distribution. So the arctanh transformed gene intrablock correlations are a sample from \(Normal(arctanh(\rho),1/sqrt(N-3))\).

This makes our life simple, because the only unknown for the distribution of the Z's is the population mean, and we know that the best estimator is the sample mean - i.e. the mean of the Z's. duplicateCorrelation computes the gene by gene correlations, transforms to the Z's and computes the mean on the arctanh scale. (Actually, duplicateCorrelation uses the 15% trimmed mean to avoid outliers in case the actual data are not Normally distributed.)

For the colonCA data we see that the genewise correlations are skewed.

correlation for colonCA data

The Z's look quite Normally distributed. The variance should be \(1/sqrt(2000-3)\), so the SD of the sampling distribution should be about 0.15. The actual SD of the 2000 sample correlations is 0.26. So, the trimming helps with the excess variability which shows up on the histogram as set of very short bins on both sides of the main peak. The estimated mean correlation is \(0.43=(tanh(mean(atanh(r)))\).