Lesson 7: Linear Models for Differential Expression in Microarray Studies

Key Learning Goals for this Lesson:

Understanding why we use statistical models
Understanding the linear model
Setting up the design matrix

Introduction

Linear models are among the most used statistical methods. T-tests, ANOVA, ANCOVA and regression can all be formulated as special cases of linear models. As well, a set of models called generalized linear models are - no surprise given the name - generalizations of the linear model and are also widely used for modeling and analysis.

In most sciences we have model systems that are simplifications of the complex systems that we would like to work with or which provide us with a system in which we can make manipulations which would not be possible in the "real" system. For example, we use mouse models for human diseases because we can create inbred lines (which are much simpler than human lineages) and we can impose conditions that would be difficult or unethical to impose on human subjects, such as special diets or exposures to pathogens.

A mathematical model is also a simplification. We can build quite complex models, for example of protein interaction networks, that allow us to set parameters and then simulate how the system behaves. We then usually want to determine if the "real" system behaves in the same way.

Statistical models are often less refined than mathematical models in defining the mechanisms of a system, but include stochastic (random) terms that allow us to make probability statements. This allows us to do statistical testing and compute predictions and confidence intervals. Once we have a model for a system, the model can be used to determine appropriate study designs, sample sizes, etc.

Statistical models essentially have two components, the systematic or "fixed effects" component that describes the deterministic part of the system and the random component that describes the random parts of the system including biological and technical variation. Usually, we are interested in estimating or testing the systematic effects caused by conditions such as genotype, developmental stage, environmental exposure, etc.

Every model includes some assumptions. In general, the more data we have, the fewer assumptions we need to make, as we can use the data to estimate more model parameters. For example, the model for the 2-sample t-test with pooled variance states that the samples have different means but the same variance. If both samples are sufficiently large, we can use Welch's t-test which allows the samples to have different means and different variances. Another assumption of the t-test is that each sample comes from a population that is close to Normal.

Sometimes we can manipulate the model to more closely resemble the data and sometimes we manipulate the data to more closely resemble the model. For example, in linear models we usually assume that the noise is Normally distributed. Gene expression data is usually skewed - taking logarithms of the data tend to make the noise more symmetric and hence closer to Normal. In some cases we may do a sensitivity analysis of the model to determine sensitivity to violations of the model assumptions. For example, this type of analysis tells us that the t-test is quite insensitive (robust) to non-Normality as long as the data are not too skewed, but is quite sensitive to skewness.

Formulating a model can also guide the study design and the analysis of the resulting data. For example, when we set up a study to determine if there are "treatment differences", our intention to use a t-test implies a model in which "treatment difference" actually implies "difference in treatment means" rather than e.g. differences in variability or skewness. We use the model to determine the sample size required to achieve the desired power. And of course, after the data are collected, we then use the t-test for the analysis, possibly after checking that the data are "close to Normal".

The linear model is one of the simplest models used in statistics. It encompasses some models that you do not usually think of as "linear" such as ANOVA and polynomial trends. In this Chapter we will learn more about linear models and how to set up a linear model for statistical analyses in R.

7.1 - Linear Models

Systematic and Noise Models

Statisticians often think of data as having a systematic component and a noise component. Lets assume for now that the data Y is the log2(expression) of a particular gene (in some unspecified units). Then a commonly used and very useful model is

\[Y=\mu +\epsilon\]

where \(\mu\) is the systematic piece, the population mean, also written E(Y). This is an unknown constant.

The other piece is \(\epsilon\), which is the random error or the noise. It may not be error in the colloquial sense, even though we call it \(\epsilon\) for error. It is more a mixture of measurement error and biological variation.

Straight Lines and Linear Models

For example, suppose the population is all undergraduate students at Penn State enrolled in Sept. of the current year (which is a large, but finite population). This population has a mean weight \(\mu\) which can be determined by measuring the weight of each student and averaging. The weight Y of an student has a deviation \(\epsilon= Y-\mu\). Even if we could perfectly measure the weights, biological variability among the students ensure that \(\epsilon\) has a distribution, often called the "error" distribution. Because of the way that \(\epsilon\) was computed, we know that the mean of the error distribution is 0.

Now we know that weight varies with height, so we could consider the distribution of weights for students of a particular height. For example, here is a histogram of weight versus height for Stat 200 students in 1995 (not a random sample of Penn State students):

plot of weight vs height for students in 1995

We see that the relationship between height and weight is approximately linear. Recall that \(Z=\beta_0+\beta_1 X\) is the equation of a line with intercept \(\beta_0\) and slope \(\beta_1\).

graph of a line

To take advantage of this, we might model weight as:

\[ Y=\beta_0+\beta_1 X+\epsilon\]

where

Y=weight

X=height

If we assume that for each fixed value of X, the errors have mean zero, then the mean of Y for fixed X (written E(Y|X) or \(\mu(X)\) )is \(\beta_0+\beta_1 X\).

Now suppose we have measured n students, obtaining \(x_i,y_i\) for i=1...n. Let

x be the column vector with entries the n values of height

y be the column vector with entries the n values of weight

\(\epsilon\) be the column vector with entries \(\epsilon_i=y_i-\beta_0+\beta_1 x_i\)

\(\beta\) be the column vector with the 2 entries \(\beta_0\) and \(\beta_1\).

and 1 be a column vector with n entries, each of which is the number "1".

Finally, let X be the matrix with columns 1 and x.

Notice that X\(\beta\) is a column vector with entries \(\beta_0+\beta_1 x_i\).

So we can write y=X\(\beta +\epsilon\) in matrix notation. Also, note that E(Y|X)=X\(\beta\).

X is called the design matrix. It is a matrix with known entries which is a function of our data x - in this case a column of 1's and the column with the \(x_i\) values. \(\beta\) is a vector of unknown parameters.

In mathematics and statistics, a function f of a vector \(\alpha\) is said to be a linear function if \(f(\alpha)=M\alpha\) for some known matrix M. In statistics, we have a linear model when E(Y|X)=M(X)\(\beta\). That is, a linear model in statistics includes the case when E(Y|X) is a straight line, but also includes other cases in which M(X) is a known matrix which is a function of X. Note that we are using "X" in three different ways: "x" is the vector of x-variables recorded on each of our samples, "X" is the abstract x-variable (which may be random or fixed in advance) and "X" is the design matrix.

Recall that in Chapter 6 we defined two different design matrices, one for the paired t-test and the other for the two-sample t-test. We will now show how these two tests relate to a linear model.

Paired and Two-sample t-tests Arise from Linear Models

Suppose our data arise as paired samples. Since we are reserving X for other purposes, we will call the measured quantities Z and Y. In a paired t-test, we change our data to D=Z-Y and test \(\mu_Z=\mu_Y\) or equivalently, \(\mu_D=\mu_Z-\mu_Y=0\) where the first equality is due to the fact that we can exchange the order of averaging and differencing, and the second equality is the null hypothesis.

Our model for the data is

\[D=\mu_D+\epsilon=\beta_0+\epsilon\]

so that \(\beta_0=\mu_D\) and \(\beta_1=0\).

The design matrix is just the column of ones and since \(\beta_1=0\) we can let \(\beta\) be the 1-vector \(\beta=\beta_0\). Then the \(i^{th}\) entry of X\(\beta\) is \(\beta_0\). In this case, the test of \(\mu_D=0\) becomes a test of \(\beta_0=0\).

Note that for our two-channel microarray study, the design matrix was a column of +/1. This is because the data was M=log2(R)-log2(G) not Z-Y. The plus and minus signs then yields that the \(i^{th}\) entry of X\(\beta\) is \(\beta_0\) if M=Z-Y and \(-\beta_0\) if M=Y-Z.

Now consider the two-sample t-test. One way to set this up is to record (X,Y) where X is "1" if the sample comes from population 1 and "0" otherwise. Then we want to test \(\mu_1=\mu_2\) or equivalently \(\mu_1-\mu_2=0\). The model is

\[Y=\beta_0+\beta_1X +\epsilon.\]

Then if the sample comes from population 1, E(Y|X)=\(\mu_1=\beta_0+\beta_1\) and if the sample comes from population 2, E(Y|X)=\(\mu_2=\beta_0\). So \(\beta_1=\mu_1-\mu_2=0\) and our test becomes a test of \(\beta_1=0\).

Linear Models with Larger Design Matrices

The linear model is readily extended to p predictor variables \(X_1 \cdots X_p\) with observations \(x_{i1}\cdots x_{ip}\) by

\[Y=\beta_0+\beta_1X_1 +\beta_2 X_2 + \cdots \beta_pX_p +\epsilon.\]

The design matrix then has p+1 columns: the first column is 1 and column j has entries \(x_{ij}\).

Now consider one-way ANOVA with p treatments. Let \(X_j\) be the indicator variable which is 1 if the observation came from treatment j and 0 otherwise, for j=1...(p-1). The design matrix then consists of a column of 1's and the p-1 indicator variables. Notice that if the observation comes from treatment j, j<p then

\[E(Y|X_1\cdots X_{p-1})=\beta_0 + \beta_j\]

and if the observation comes from treatment p, then

\[E(Y|X_1\cdots X_{p-1}]=\beta_0\].

Letting \(\mu_j\) be the \(j^{th}\) treatment mean, this means that \(\beta_0=\mu_p\) and \(\beta_j=\mu_j-\mu_p\).

If the \(p^{th}\) treatment is the control, setting up the model this way is sensible. We can test every treatment versus the control by testing \(\beta_j=0\). However, often we do not have a control, or other comparisons are also of interest. In that case, a slightly different model called the cell means model is more convenient.

Cell Means Model

We remove the column of 1's from the design matrix, but we include an indicator for treatment p. Now our design matrix has \(x_ij) in column jand the model is

\[E(Y|X_1 \cdots X_p)=\beta_1X_1 +\beta_2 X_2 + \cdots \beta_pX_p\].

This gives \(\beta_j=\mu_j\). We then use another matrix, called the contrast matrix, to set up the comparisons of interest such as \(\mu_1=\mu_2\) or \(\mu_1-\mu_2=\mu_3-\mu_4\). Contrast matrices are discussed in the next section.

Two-factor Designs Using Linear Models

Now suppose we have two factors, such as normal/tumor and gender. Suppose we include two indicator variables, \(X_1=1\) for tumor tissue and \(X_1=0\) for normal and \(X_2 = 0\) for the males and \(X_2 = 1\) for the females. Notice that we have 4 means, \(\mu_{TF}\), \(\mu_{TM}\), \(\mu_{NF}\) and \(\mu_{NM}\). Letting

\[E(Y|X_1,X_2)=\beta_0+\beta_1 x_1+\beta_2 x_2.\]

Then \(\mu_{TF}=\beta_0+\beta_1+\beta_2\), \(\mu_{TM}=\beta_0+\beta_1\), \(\mu_{NF}=\beta_0+\beta_2\) and \(\mu_{NM}=\beta_0\). We say that the effects are additive because the effect of gender and tissue type just add up. See the plot below.

additive effect of gender

The difference in expression between normal and tumor tissue is \(\beta_1\) for both men and women . Possibly for colon cancer this might make sense. However, for an estrogen driven cancer the difference in expression from males to females might differ. This model does not allow for this. It only allows for a fixed difference between normal and tumor samples, (which is \(\beta_1\)), and a fixed difference between males and females, (which is \(\beta_2\)). A more general model allows synergy, anti-synergy or some sort of interaction between the factors.

There are several design matrices that can allow us to do this more general linear model. One thing we could do is use the cell means model, with an indicator variable for each tissue/gender combination. Another method is called the interaction model. The design matrix for this model has one additional column, which is the product of \(X_1\) and \(X_2\). Notice that \(X_1X_2=1\) only when the the sample is tumor and female. Letting:

\[E(Y|X_1,X_2)=\beta_0+\beta_1 x_1+\beta_2 x_2+\beta_3 x_1 x_2\]

Then \(\mu_{TF}=\beta_0+\beta_1+\beta_2+\beta_3\), \(\mu_{TM}=\beta_0+\beta_1\), \(\mu_{NF}=\beta_0+\beta_2\) and\(\mu_{NM}=\beta_0\). And \(\mu_{TF}-\mu_{NF}=\beta_1+\beta_3\) while \(\mu_{TM}-\mu_{NM}=\beta_1\) so the difference between normal and tumor tissue can differ in men and women, as in the plot below.

additive effect of gender, tumor samples

\(\beta_3\) is called the interaction between \(X_1\) and \(X_2\). \(\beta_3=(\mu_{TF}-\mu_{NF})-(\mu_{TM}-\mu_{NM})=(\mu_{TF}-\mu_{TM})-(\mu_{NF}-\mu_{NM})\) so the interaction can be described as the difference in differential expression in normal and tumor tissue for men compared to women, or the difference in differential expression in men and women for normal versus tumor tissue.

A More Complicated Experiment

What should we do if we want to include two stages or two different types of the cancer, as well as the normal samples? How would we handle this?

Another 0, 1 indicator could be put into place.

\(X_1 =1\) if the cancer is stage 1 (0 otherwise)
\(X_2 =1\) if the cancer is stage 2 (0 otherwise)

What about interaction? A tissue sample can come from a female and be either normal or tumor. However, a sampe cannot be both stage 1 and stage 2 cancer. So there cannot be interaction between \(X_1\) and \(X_2\) in this case (i.e. there are no samples for which \(X_1X_2=1\) ).

Everything that we want to consider concerning gene expression should have a term in the model. We can include indicator variables for groups or variables where the ordering doesn't make sense. We might also have continuous variables that give us linear effects, or we might have polynomial variables that give us curved effects. For instance, some responses we expect to go up with age and so we put in age without an indicator variable which would give us linear effect for age. If there is a cyclic effect, such as cell cycles, we could put in a sine wave over time.

ANOVA and ANCOVA can also be written with linear models.

7.2 - Contrasts

We have seen how to express the population means in a t-test or ANOVA by a linear model. We have also seen that some of the tests of interest in ANOVA can be expressed as tests of regression coefficients. For other comparisons we will use contrasts.

What is a contrast? A contrast is essentially a difference in regression coefficients. We have seen that the regression coefficients can express a difference in means or a single mean, as well as the slope and intercept of a line. A contrast is a way of testing more general hypotheses about population means.

Suppose we have p different populations (treatments) and we have measured the expression level of the gene in a sample from each population. The mean in population j is \(\mu_j\) and the sample mean is \(\bar{Y}_j\).

Suppose \(c_1, c_2, … ,c_p\) are numbers and the \(\sum c_j=0\). \(\sum c_j \mu_j\) is called a contrast of the population means. Notice that if all the means are equal, then any contrast is zero.

We can estimate the population contrast by plugging in the sample means:

\[c_1 \bar{y}_1 +c_2 \bar{y}_2 + ... c_T \bar{y}_p \]

Notice that simple comparisons like \(\mu_1=\mu_2\) can be expressed as a contrast with \(c_1=1\) and \(c_2=-1\).

However, contrasts can express more general comparisons. For example, consider normal and tumor tissue in males and females with means \(\mu_{TF}\), \(\mu_{TM}\), \(\mu_{NF}\) and \(\mu_{NM}\). We could look at the difference in gene expression in normal male and female tissue with coefficients, \(c_{TF}=c_{TM}=0\), \(c_{NF}=1\) and \(c_{NM}=-1\) (i.e. \(\mu_{NF}-\mu_{NM}=0\) ) or the difference in gene expression in males and females averaged over tissue type \(c_{TF}=c_{NF}=.5\) and \(c_{TM}=c_{NM}=-.5\) (i.e. \(\frac{\mu_{TF}+\mu_{NF}}{2}-\frac{\mu_{TM}+\mu_{NM}}{2}=0\)). The interaction between gender and tissue type can be expressed by \(c_{TF}=c_{NM}=1\) and \(c_{TM}=c_{NF}=-1\), (i.e. \((\mu_{TF}-\mu_{NF})-\mu_{TM}-\mu_{NM})=0\)). Any of these contrasts can be estimated by plugging in the sample means.

To set up the model in a statistical package, you usually need to specify the factors and their interactions, as well as any blocking structure. You do not have to set up the indicator variables. However, you do need to figure out which contrasts you want to estimate and test, and how these are specified by the package.

If the experiment is complicated, for example if the experiment is unbalanced, has interactions, has multiple sources of noise, etc. the sample mean may not be the best estimate of the population mean. In these cases, a better estimate of the mean can be obtained by using the maximum likelihood estimator. Sophisticated statistical software allows you to specify the factors and error terms, and will compute appropriate estimates of the population means.

To perform the tests is a fairly simple procedure. Statistical theory tells us that if each population is approximately Normal and has the same variance \(\sigma^2\), and the samples are independent, then if the contrast is actually 0:

\[t*=\frac{\text{Estimated Contrast}}{SE(\text{Estimated Contrast})}\]

has a t-distribution with degrees of freedom determined from the MSE. This is just a generalization of the usual one and two-sample t-test and is based on the same statistical theory. The MSE is computed as in ANOVA adjusted by using moderated variance estimator. This also changes the d.f.

In complicated situations you might have to do more than one test. For example, suppose there are three populations such as three genotypes. You could set up the null hypothesis that they have identical mean expression:

\[H_0: \mu_1 = \mu_2 =\mu_3 \]

However, you can't test this with one t-test. There are different ways that this could be set up. One way would be to set up two contrasts against \(\mu_2\):

\[\mu_1 - \mu_2 =0, \;\;\; \mu_2 - \mu_3 =0 \]

Or alternatively, another common way might be:

\[\mu_1 - \mu_2 =0, \;\;\; \mu_3 - (\mu_1+\mu_2)/2 =0 \]

If all the means are equal, all 4 of these contrasts are zero. However, if the means are not all equal, the contrasts differ in value, and so the results from the second test in each pair might vary.

Main Effects and Interactions

If you have taken a course that covers two-way or higher ANOVA, you have likely learned about main effects and interactions. The F-tests used to determine if these terms are statistically significantly different from zero are based on accumulating information from contrasts. The d.f. for these terms determines the number of contrasts required. As the simplest example, we will consider a 2-factor ANOVA with 2 levels of each factor. For this design, the main effects and interaction all have 1 d.f., so only one contrast is required.

Lets return to our example of normal and tumor tissue in males and females. We have already seen the contrasts for the interaction. The contrast for the main effect of tissue type has coefficients \(c_{TF}=c_{TM}=\frac{1}{2}\), \(c_{NF}=c_{NM}=-\frac{1}{2}\) - i.e. it is the expression in tumor averaged over male and female minus the expression in normal tissue averaged over male and female. Similarly, the contrast for the main effect of gender has coefficients \(c_{TF}=c_{NF}=\frac{1}{2}\), \(c_{TM}=c_{NM}=-\frac{1}{2}\) . The F-statistic for each effect is just the square of the t-statistic.

In general, if there are p "treatments" or factor combinations, we can express the null hypothesis of no difference in the means using p-1 linearly independent contrasts. These can be combined to form the ANOVA F-test (and is why the treatment sum of squares has p-1 d.f.) Although the LIMMA software does compute the F-test, we usually use the tests based on the individual contrasts. The ANOVA approach is used in MAANOVA, which is also available in Bioconductor (as well as in ANOVA and MIXED routines in SAS, SPSS, etc.).

Using the Cell Means Model

While many popular statistical packages use the main effects and interactions set-up for ANOVA, along with F-tests, the software that we will use for differential analysis makes heavy use of contrasts. This includes the software for differential expression and differential binding (i.e. ChIP) as measured both on microarrays and sequencing platforms. Since I find it simplest to think in terms of contrasts of the means, I prefer to use the cell means model which was discussed in Section 7.1. The simplest way to set this up is to have an indicator variable which labels the treatment combination for each sample. For example, suppose I have tumor/normal tissue, gender and dose 1,2,3 of a drug. I label the treatments as TF1, TM1, ... NM3. There are 12 different labels -- one for each of the p=12 treatments . If I have 3 replicates, I will have a factor of length 36 with the labels for each sample in the order in which they appear in the data matrix.

If the labels are stored in a character vector, or as a factor, then when the vector is used in a model formula, R knows to create the indicator variables for each treatment. This makes it simple to set up the cell means model as seen in Section 7.1. You must remember to remove the intercept, because the default in the software is to include an intercept term.

Steps in Fitting the Linear Model and Contrasts

The software we use actually has 3 steps:

Set up and fit the model.
Set up and fit the contrasts
Estimate the variance using the ANOVA MSE and do the t-tests for the contrasts.

Estimating Variance

An important part of the ANOVA and regression models that we usually use is that the variance is the same in all the populations. When this is true, a pooled estimate of the variance is an improvement on the individual sample variances. The sample variances are the average squared deviation from the sample mean. The pooled sample variances arise from taking the squared deviations from the sample mean and averaging over all the samples.

For example, in the plot below, each boxplot represents the samples from one treatment combination, with 4 treatments in all. The sample means have been added to each box (blue horizontal line) and for treatment 3 the 5 observations \(Y_{31}\cdots Y_{35}\) are shown as red dots. The residuals (also called deviations) from the sample mean are the red lines connecting the observations to the mean. The within treatment estimate of variance is the sum of squared lengths of these lines, divided by n_i-1 where n_iis the sample size (number of biological replicates) within the treatment.

boxplot of brain tissue samples

We need to have replicates in order to compute the deviations, because if there is only one observation in the treatment, then the sample mean is the same as the sample value and the deviation is zero.

The pooled estimate of the variance is the sum of the squared deviations using all the treatments, divided by the sum of the (n_i-1) which comes out to n_T - p where n_Tis the total number of observations and p is the number of treatments. This is the MSE of the ANOVA model.

For a regression situation (for a predictor like age for instance) we do not need to have replicates for each value of the predictor. This is because the mean for each value of the predictor is estimated as the corresponding point on the line, and the deviations from the line are used in the sum of squared deviations. This is the MSE of the regression model.

regression plot of deviations

Our final step in estimating the variance will be shrinkage using the empirical Bayes model as discussed in Chapter 6.

7.3 - Noise Components

While most data have multiple sources of noise, we only need to model these sources when some samples share a noise component. Samples with a shared noise component are more similar than samples which have a different component of that noise source. If we do not model the shared noise components, our statistical software assumes that all samples are equally variable. This leads to estimated SEs that are too large for some contrasts, leading to false negative tests, and too small for others, leading to false positive tests.

The simplest example of a shared noise component is when we have technical replicates from the same biological replicate. Suppose we have 3 individuals and take 2 technical replicates from each. The 2 replicates from the same individual share the same biological noise component. For this reason, they are more similar than 2 samples taken from different individuals. All 6 samples have both biological and technical variation, but the pairs from the same individual share their biological component and so are correlated.

Technical replication is usually done to obtain a more precise estimate for an individual in the population by reducing technical noise. Some sources of technical noise are discussed in [1]. Since we are usually interested in population parameters, rather than individuals, obtaining a more precise estimate for each individual is seldom as cost-effective as measuring more individuals. The excepts to this are: 1) when we are testing our protocols and need to access the relative sizes of various sources of technical noise 2) when we are interested in the measurements for an individual, for example for a diagnostic test, and 3) when technical replication is much cheaper than biological replication and the technical noise is large. An example of the last case would be samples of human brain tissue - since these are very difficult to obtain, any measurement we take should be taken with as much precision as possible.

Another example of a shared noise component comes from blocking in an experiment. In this case, we expect block to block variability, modeled by the block effect. The observations in the same block share the same block effect and so are less variable than observations in different blocks.

Blocking is done so that comparisons among treatments can be made using samples which share the same block component. Because these samples are similar (in the absence of a treatment effect) we have more precision to assess difference induced by the treatment.

Precision is gained by replication, because we are generally interested in population means. As we have seen (e.g. [2]) when we average n samples with the same mean and variance \(\sigma^2\), the variance of the sample mean is \(\sigma^2/n\). To understand what occurs with different noise components, we need to see what happens when we average over samples that do or do not share noise components. We also need to consider the effect of shared noise components on the variability of estimated contrasts.

Whenever there are shared noise sources such as blocks or biological entites, these need to be included as factors in the model. For example, if we have 6 samples from 6 individuals from the same treatment, we might model the data as

\[Y_i=\mu+\epsilon_i\]

We understand that the biological and technical variation are pooled into the "errors" \(\epsilon_i\).

However, if we have 6 samples which are pairs of technical replicates of 3 individuals from the same treatment, we need to include a term for the shared biological variation b:

\[Y_{ij}=\mu+b_i+\eta_{ij}\]

If we are using the same measurement system, then we understand that \(\epsilon_{ij}=b_i+\eta_{ij}\) where \(\eta_{ij}\) expresses the part of the variability in the observations that is due to technical variation, but not biological variation. We assume that \(\eta_{ij}\) is independent of \(b_i\). Following the rules of probability, we find that the variance of \(Y_{ij}\) is \(\sigma^2_\epsilon=\sigma^2_b+\sigma^2_\eta\). Notice that when we have technical replicates, \(Y_{i1}-Y_{i2}=\eta_{i1}-\eta_{i2}\). So the variance of \(Y_{i1}-Y_{i2}\) is \(2\sigma^2_\eta\) which can be estimated from the data. If we take the difference of observations from different biological units i and k then \(Y_{ij}-Y_{ks}=b_i-b_s+\eta_{ij}-\eta_{ks}=\epsilon_{ij}-\epsilon_{ks}\). So the variance of\(Y_{ij}-Y_{ks}\) is \(2\sigma^2_b+2\sigma^2_\eta=2\sigma^2_\epsilon\). So, if we have technical replicates, we can estimate \(\sigma^2_b\), \(\sigma^2_\eta\) and \(\sigma^2_\epsilon\). However, if all the observations come from differential biological units, we do not have enough information to estimate the technical variation and biological variation -- they are said to be confounded.

In some experiments, we want to know the relative sizes of the biological and technical variation. For these experiments we definitely need both biological and technical replication. However, in most experiments, we are interested in contrasts of treatment means.

Consider the case without blocking and just 2 treatments, A and B. We have observations \(Y_{Cij}\) where C is either A or B, i is the biological replicate and j is the technical replicate of sample i. The contrast \(\mu_A-\mu_B\) is estimated by \(\bar{Y}_A-\bar{Y}_B\). If the biological replicates are independent, then the variance of the estimated contrast is just the sum of the variance of \(\bar{Y}_A\) and \(\bar{Y}_B\).

If the 12 samples are 6 biological replicates for each treatment, then

\[\bar{Y}_C=\mu_C+\bar{b}_C+\bar{\eta}_C\]

where \(\bar{b}\) and \(\bar{\eta}\) are both averaged over 6 independent values from treatment C. So \(Var(\bar{Y}_C)=\sigma^2_b/6+\sigma^2_\eta/6=\sigma^2_\epsilon/6\). The contrast is

\[\bar{Y}_A-\bar{Y}_B=\mu_A-\mu_B+\bar{b}_A-\bar{b}_B+\bar{\eta}_A-\bar{\eta}_B.\]

So, the variance of the contrast will be 2\(\sigma^2_\epsilon/6\). As we can see, we do not require separate estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\) to estimate the variance of the contrast.

If the 12 samples are 3 biological replicates for each treatment with 2 technical replicates then averaging the 6 replicates for the treatment is the same as first averaging the technical replicates (which share the same biological noise) and then averaging the averages to again obtain

\[\bar{Y}_C=\mu_C+\bar{b}_C+\bar{\eta}_C\]

However, now \(\bar{b}\) is averaged over only 3 independent values, although \(\bar{\eta}\) is still averaged over 6 independent values. So \(Var(\bar{Y}_C)=\sigma^2_b/3+\sigma^2_\eta/6\). The variance of the contrast will be 2(\(\sigma^2_b/3+\sigma^2_\eta/6\)) which cannot be estimated without the separate estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\). As well the variance of the contrast is larger than it would have been if the 12 observations all came from different biological replicates.

Suppose instead we use the biological unit as blocks, and are able to apply treatments A and B to the two samples from the same biological unit (i.e. paired samples). Now we have

\[\bar{Y}_C=\mu_C+\bar{b}_S+\bar{\eta}_C\]

with the averages over the 6 samples from treatment C. Notice however, that because the same biological units were measured for A and B, the term \(\bar{b}_S\) is the shared in both averages. The contrast is

\[\bar{Y}_A-\bar{Y}_B=\mu_A-\mu_B+\bar{b}_S-\bar{b}_S+\bar{\eta}_A-\bar{\eta}_B=\mu_A-\mu_B+\bar{\eta}_A-\bar{\eta}_B.\]

which has variance 2\(\sigma^2_\eta\). Notice that this is less than the variance of the contrast without blocking (which is why we block). However, to estimate the variance of the contrast, we need to estimate \(\sigma^2_\eta\).

From the perspective of setting up a model for our statistical software, the exercise above shows that we can only model shared sources of noise, and that for correct statistical inference we must model them. All of the software require a term in the model for each shared source of noise and assume that there are unshared sources. The statistical model entered into the software must include terms for all shared sources and must not include terms for unshared sources.

In the examples above:

If there are technical and biological replicates, the model needs a term for the biological replicates (but not the technical replicates).
If there are blocks, the model needs a term for the blocks but not for the samples within blocks.

Although many experiments have several levels of replication [3] which require several noise factors, the software we will use for differential analysis (LIMMA) allows only one shared noise factor, which is designated by block. So, if there biological and technical replicates, the biological replicates need to be designated as blocks. However, if there is an actual blocking factor and technical replicates, the technical replicates should be averaged before the data are analyzed.

This requirement of only one shared noise factor is specific to LIMMA because of the way it handles variance moderation. Software such as SAS (and lmeR in R) allow multiple shared noise factors, but do not do variance moderation. MAANOVA, another bioconductor package for microarray data, does variance moderation for each shared noise factor, but requires simulation to obtain null distributions for the resulting statistical tests. I prefer the model fitted by LIMMA and so far, I have been able to handle most situations using LIMMA by using blocking factors and averaging.

Intraclass Correlation and Shrinkage

When there are shared sources of variation, the units that share this source are correlated. A measure of this correlation is called the intraclass correlation.

\[\frac{\sigma^2_b}{\sigma^2_b+\sigma^2_\eta}=\frac{\sigma^2_b}{\sigma^2_\epsilon}=\rho.\]

Notice that if we have an estimate r of \(\rho\) and and estimate of \(\sigma^2_\epsilon\) then this gives us estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\).

Any software that can model shared sources of variation computes either the variance components or the intraclass correlation (or both) for each feature. One problem with this approach is that there is often not much information for estimating the shared variance. For example, if there are 2 treatments, with 3 biological replicates per treatment and 2 technical replicates per biological replicate, then there are 12 observations but only 6 biological replicates. All 12 observations are used to estimate the technical variation but only the 6 biological replicates are used to estimate the biological variation. (This is not entirely obvious without going through the computations in detail.) As a result, the biological variation may be poorly estimated.

To estimate the shared variance component, the LIMMA software does an extreme shrinkage of the intraclass correlation. This is a rather clever approach, because it assumes that each feature has its own variance, but that the ratio of the shared to total variance is similar. An estimate of the intraclass correlation is computed for each feature. A central summary of the intraclass correlation is computed. All of the intraclass correlations are replaced by this central value r^*. Then the variation components \(\sigma^2_b\) and \(\sigma^2_\eta\) are estimated using r^*and an estimate of \(\sigma^2_\epsilon\). Because we usually have thousands of features, the SE of r^* is close to zero, so it is treated as a known (error-free) quantity. A side effect of this shrinkage is a change of d.f. in the t-tests in which the variance estimates are used. This approach to estimating the variance components is supported by simulation studies using synthetic data, and generally improves the FDR and FNR of the resulting tests.

The central value r^* is computed using what is called the normalizing transformation of the correlations computed from each feature. The normalizing transformation for correlations is tan(r) where "tan" is the trigonometric function "tangent". The average of tan(r) is taken over all the features, and than transformed back using arctangent. Essentially, the normalizing transformation makes the sample correlations behave more like a sample from the Normal distribution, so that the sample mean is a good measure of the center.

References

[1] Krzywinski, M., & Altman, N. (2015) Points of Significance: Sources of variation. Nature Methods, 12(1), 5-6 doi:10.1038/nmeth.3224
[2] Krzywinski, M., & Altman, N. (2013). Points of significance: Importance of being uncertain. Nature methods, 10(9), 809-810.

[3] Blainey, P. & Krzywinski, M., & Altman, N. (2014). Points of significance: Replication. Nature Methods, 11(8), 879-880 doi:10.1038/nmeth.3091

7.4 - Sample Pooling

We say that samples are pooled when units that might be measured separately are processed together in such a way that the separate measurements can no longer be determined. For example, a tissue sample might be considered a pool of single cells. However, more typically when we discuss pooling we mean that biological replicates are processed together at some stage, such as RNA extraction or microarray hybridization.

Typically, the decision to use sample pooling is due to the inability to obtain enough experimental material from a single individual. In this era of single cell processing, this is seldom called for in well-established protocols, but may be required for protocols which require substantial amounts of starting material. On the other hand, pooling is similar to averaging, and properly done pooling can be used to improve precision when the processing costs are high relative to the costs of obtaining biological replicates. [1]

Suppose that you are comparing gene expression in wild-type versus mutant plants in the root tip and that you have the resources to measure 3 biological replicates of each type of plant. If the plants are readily grown and the root tips sectioned, you might start by sectioning the root tips of 30 plants of each genotype. You could then make pools of tips frm 10 plants, using different plants in each pool. The combined tissue samples could then be processed together and become a biological replicate. Alternatively, if RNA extractions are inexpensive, you could do individual extractions for each of the 30 plants, but then combine the RNA in equal aliquots to form pools of 10 before labeling.

To keep the variability and the variance estimates correct for subsequent statistical analysis, you need to do your pooling in exactly the same way for each pool. Even if one of the plants has the capacity to provide a larger sample, you cannot use that plant to replace independent replicates in the pool. This is because the pools are now your biological replicates and are assumed to be equally variable. If we consider the RNA from a single feature in a pool of 10 plants, it will have about the average amount of RNA; if the bulk of the pool comes from a single plant, the RNA in the pool will be more similar to a single plant than to an average.

A common mistake is to create a pooled sample, and then split into subsamples. Subsamples of a pool are technical replicates. This is the optimal way to design studies to determine the reproducibility of measurements from the measurement platform because there will be only technical variation. However, for studies meant to determine differential expression between different treatments, we need an estimate of biological variation. Sample splitting does not provide a measure of the biological variability - multiple independently pooled samples are required..

References

[1] Biswas, S., et al. Biological Averaging in RNA-Seq. arXiv:1309.0670v2, 2013.

7.5 - Example - The Colon Cancer Data

When all the samples are independent, differential expression can be tested using a two-sample t-test. When all the samples are paired, differential expression can be tested using a paired t-test. But when we have both paired and unpaired samples, use of a linear model that models both the difference in means and the shared noise components allows us to correctly incorporate all of the data into our statistical model.

A Model for the Colon Cancer Data

We have 62 tissue samples coming 40 patients. 18 patients contributed only a tumor sample, while 22 contributed both tumor and normal samples.

The systematic part of the model is the tissue type - either tumor or normal. We set up an indicator variable X=0 for normal samples and X=1 for tumor samples.

The shared noise in the samples comes from the subject (patient). We model this using \(b_i\) the subject "noise". Every subject has a value of \(b_i\) even if they contributed only one sample. In this study, \(b_i\) is considered to be a block effect because the two samples from the same patient are not technical replicates - they have different values of the "treatment" which is the tissue type.

The final component is the technical noise. This is always present in our model. So, for every gene, our model for log2(expression) for sample j from patient i is:

\[Y_{ij}=\beta_0+\beta_1X_{ij}+b_i+\eta_{ij}\]

In a normal samples X = 0 so the gene expression is:

\[Y=\beta_0+b_i+\eta_{ij}\]

And in a tumor sample the gene expression is:

\[Y=\beta_0+\beta_1+b_i+\eta_{ij}\]

So, the difference between the two systematic pieces is \(\beta_1\). This is followed by subject noise and sample noise.

Now we have a model for the systematic tumor effects versus normal and the subject effects. In some cases there are two samples with the same subject, and there are some cases where there is only one. When we submit the model and the data to the software, the estimates of the coefficients and their SEs will be appropriately adjusted to account for paired and unpaired samples.

As part of the analysis, the software estimated the intra-class correlation as discussed in Section 7.3. The estimate is 0.47. What does this mean? Recall that

\[\rho=\frac{\sigma^2_b}{\sigma^2_b+\sigma^2_\eta}.\]

So, the person to person variability is about the same as the noise between the samples within the same person. So, the variability in the 2-sample t-test (using only the unpaired samples) should be about double the variability in the paired t-test.

Now we have done 3 different t-tests on these data - paired only using the paired t-test, unpaired only using the two-sample t-test and all the data using the linear model. The histogram of p-values is below, along with a Venn diagram showing the genes declared as statistically significant for each case.

paired p-values, all samples vs unpaired samples

Paired samples (left), versus Paired samples (right) and all Samples Combined (below)

The combined analysis "found" 183 genes not "found" by the the paired t-test and 122 genes not "found" by the two-sample t-test, and all the genes "found" by both. On the other hand, we might be somewhat unhappy by the discrepancies in the results!

venn diagram of samples

Of these 3 analyses, I would select the linear model, which uses all the data. That means that I would accept that 74 genes found by the combined model but not by the other methods as more likely to be true detections than the 105 genes found by the other methods but not by the combined analysis.