Printer-friendly versionPrinter-friendly version

While most data have multiple sources of noise, we only need to model these sources when some samples share a noise component. Samples with a shared noise component are more similar than samples which have a different component of that noise source.  If we do not model the shared noise components, our statistical software assumes that all samples are equally variable.  This leads to estimated SEs that are too large for some contrasts, leading to false negative tests, and too small for others, leading to false positive tests.

The simplest example of a shared noise component is when we have technical replicates from the same biological replicate.  Suppose we have 3 individuals and take 2 technical replicates from each.  The 2 replicates from the same individual share the same biological noise component.  For this reason, they are more similar than 2 samples taken from different individuals.  All 6 samples have both biological and technical variation, but the pairs from the same individual share their biological component and so are correlated.

Technical replication is usually done to obtain a more precise estimate for an individual in the population by reducing technical noise.  Some sources of technical noise are discussed in [1].  Since we are usually interested in population parameters, rather than individuals, obtaining a more precise estimate for each individual is seldom as cost-effective as measuring more individuals.  The excepts to this are: 1) when we are testing our protocols and need to access the relative sizes of various sources of technical noise 2) when we are interested in the measurements for an individual, for example for a diagnostic test, and 3) when technical replication is much cheaper than biological replication and the technical noise is large.  An example of the last case would be samples of human brain tissue - since these are very difficult to obtain, any measurement we take should be taken with as much precision as possible.

Another example of a shared noise component comes from blocking in an experiment.  In this case, we expect block to block variability, modeled by the block effect.  The observations in the same block share the same block effect and so are less variable than observations in different blocks. 

Blocking is done so that comparisons among treatments can be made using samples which share the same block component.  Because these samples are similar (in the absence of a treatment effect) we have more precision to assess difference induced by the treatment.

Precision is gained by replication, because we are generally interested in population means.  As we have seen (e.g. [2]) when we average samples with the same mean and variance \(\sigma^2\), the variance of the sample mean is \(\sigma^2/n\).    To understand what occurs with different noise components, we need to see what happens when we average over samples that do or do not share noise components.  We also need to consider  the effect of shared noise components on the variability of estimated contrasts.

Whenever there are shared noise sources such as blocks or biological entites, these need to be included as factors in the model.  For example, if we have 6 samples from 6 individuals from the same treatment, we might model the data as

\[Y_i=\mu+\epsilon_i\]

We understand that the biological and technical variation are pooled into the "errors" \(\epsilon_i\).

However, if we have 6 samples which are pairs of technical replicates of 3 individuals from the same treatment, we need to include a term for the shared biological variation b:

\[Y_{ij}=\mu+b_i+\eta_{ij}\]

 If we are using the same measurement system, then we understand that \(\epsilon_{ij}=b_i+\eta_{ij}\) where \(\eta_{ij}\) expresses the part of the variability in the observations that is due to technical variation, but not biological variation.  We assume that \(\eta_{ij}\) is independent of \(b_i\).  Following the rules of probability, we find that the variance of \(Y_{ij}\) is \(\sigma^2_\epsilon=\sigma^2_b+\sigma^2_\eta\).  Notice that when we have technical replicates, \(Y_{i1}-Y_{i2}=\eta_{i1}-\eta_{i2}\).  So the variance of \(Y_{i1}-Y_{i2}\) is \(2\sigma^2_\eta\) which can be estimated from the data.   If we take the difference of observations from different biological units i and k then \(Y_{ij}-Y_{ks}=b_i-b_s+\eta_{ij}-\eta_{ks}=\epsilon_{ij}-\epsilon_{ks}\).  So the variance of\(Y_{ij}-Y_{ks}\) is \(2\sigma^2_b+2\sigma^2_\eta=2\sigma^2_\epsilon\).  So, if we have technical replicates, we can estimate \(\sigma^2_b\), \(\sigma^2_\eta\) and \(\sigma^2_\epsilon\).  However, if all the observations come from differential biological units, we do not have enough information to estimate the technical variation and biological variation -- they are said to be confounded.

In some experiments, we want to know the relative sizes of the biological and technical variation.  For these experiments we definitely need both biological and technical replication.  However, in most experiments, we are interested in contrasts of treatment means.

Consider the case without blocking and just 2 treatments, A and B.  We have observations \(Y_{Cij}\) where is either A or B, i is the biological replicate and j is the technical replicate of sample i.  The contrast \(\mu_A-\mu_B\) is estimated by \(\bar{Y}_A-\bar{Y}_B\).  If the biological replicates are independent, then the variance of the estimated contrast is just the sum of the variance of \(\bar{Y}_A\) and \(\bar{Y}_B\).

If the 12 samples are 6 biological replicates for each treatment, then

\[\bar{Y}_C=\mu_C+\bar{b}_C+\bar{\eta}_C\]

where \(\bar{b}\) and \(\bar{\eta}\) are both averaged over 6 independent values from treatment C.  So \(Var(\bar{Y}_C)=\sigma^2_b/6+\sigma^2_\eta/6=\sigma^2_\epsilon/6\).  The contrast is

\[\bar{Y}_A-\bar{Y}_B=\mu_A-\mu_B+\bar{b}_A-\bar{b}_B+\bar{\eta}_A-\bar{\eta}_B.\]

So, the variance of the contrast will be 2\(\sigma^2_\epsilon/6\).  As we can see, we do not require separate estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\) to estimate the variance of the contrast.

If the 12 samples are 3 biological replicates for each treatment with 2 technical replicates then averaging the 6 replicates for the treatment is the same as first averaging the technical replicates (which share the same biological noise) and then averaging the averages to again obtain

\[\bar{Y}_C=\mu_C+\bar{b}_C+\bar{\eta}_C\]

However, now \(\bar{b}\) is averaged over only 3 independent values, although \(\bar{\eta}\) is still averaged over 6 independent values. So \(Var(\bar{Y}_C)=\sigma^2_b/3+\sigma^2_\eta/6\).   The variance of the contrast will be 2(\(\sigma^2_b/3+\sigma^2_\eta/6\)) which cannot be estimated without the separate estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\).  As well the variance of the contrast is larger than it would have been if the 12 observations all came from different biological replicates.

Suppose instead we use the biological unit as blocks, and are able to apply treatments A and B to the two samples from the same biological unit (i.e. paired samples).  Now we have 

\[\bar{Y}_C=\mu_C+\bar{b}_S+\bar{\eta}_C\]

with the averages over the 6 samples from treatment C.  Notice however, that because the same biological units were measured for A and B, the term \(\bar{b}_S\) is the shared in both averages.  The contrast is

\[\bar{Y}_A-\bar{Y}_B=\mu_A-\mu_B+\bar{b}_S-\bar{b}_S+\bar{\eta}_A-\bar{\eta}_B=\mu_A-\mu_B+\bar{\eta}_A-\bar{\eta}_B.\] 

which has variance 2\(\sigma^2_\eta\).  Notice that this is less than the variance of the contrast without blocking (which is why we block).  However, to estimate the variance of the contrast, we need to estimate \(\sigma^2_\eta\).

From the perspective of setting up a model for our statistical software, the exercise above shows that we can only model shared sources of noise, and that for correct statistical inference we must model them.  All of the software require a term in the model for each shared source of noise and assume that there are unshared sources.  The statistical model entered into the software must include terms for all shared sources and must not include terms for unshared sources.

In the examples above:  

  • If there are technical and biological replicates, the model needs a term for the biological replicates (but not the technical replicates).
  • If there are blocks, the model needs a term for the blocks but not for the samples within blocks.

Although many experiments have several levels of replication [3] which require several noise factors, the software we will use for differential analysis (LIMMA) allows only one shared noise factor, which is designated by block.  So, if there biological and technical replicates, the biological replicates need to be designated as blocks.  However, if there is an actual blocking factor and technical replicates, the technical replicates should be averaged before the data are analyzed.

This requirement of only one shared noise factor is specific to LIMMA because of the way it handles variance moderation.  Software such as SAS (and lmeR in R) allow multiple shared noise factors, but do not do variance moderation.  MAANOVA, another bioconductor package for microarray data, does variance moderation for each shared noise factor, but requires simulation to obtain null distributions for the resulting statistical tests.  I prefer the model fitted by LIMMA and so far, I have been able to handle most situations using LIMMA by using blocking factors and averaging.

Intraclass Correlation and Shrinkage

The following material is expanded from Chapter 6.2.

When there are shared sources of variation, the units that share this source are correlated.  A measure of this correlation is called the intraclass correlation.

\[\frac{\sigma^2_b}{\sigma^2_b+\sigma^2_\eta}=\frac{\sigma^2_b}{\sigma^2_\epsilon}=\rho.\]

Notice that if we have an estimate r of  \(\rho\) and and estimate of \(\sigma^2_\epsilon\) then this gives us estimates of \(\sigma^2_b\) and \(\sigma^2_\eta\). 

Any software that can model shared sources of variation computes either the variance components or the intraclass correlation (or both) for each feature.  One problem with this approach is that there is often not much information for estimating the shared variance.  For example, if there are 2 treatments, with 3 biological replicates per treatment and 2 technical replicates per biological replicate, then there are 12 observations but only 6 biological replicates.  All 12 observations are used to estimate the technical variation but only the 6 biological replicates are used to estimate the biological variation.  (This is not entirely obvious without going through the computations in detail.) As a result, the biological variation may be poorly estimated. 

To estimate the shared variance component, the LIMMA software does an extreme shrinkage of the intraclass correlation.  This is a rather clever approach, because it assumes that each feature has its own variance, but that the ratio of the shared to total variance is similar.  An estimate of the intraclass correlation is computed for each feature.  A central summary of the intraclass correlation is computed.  All of the intraclass correlations are replaced by this central value r*. Then the variation components \(\sigma^2_b\) and \(\sigma^2_\eta\) are estimated using r* and an estimate of \(\sigma^2_\epsilon\).  Because we usually have thousands of features, the SE of r* is close to zero, so it is treated as a known (error-free) quantity.  A side effect of this shrinkage is a change of d.f. in the t-tests in which the variance estimates are used.  This approach to estimating the variance components is supported by simulation studies using synthetic data, and generally improves the FDR and FNR of the resulting tests.
The central value  r* is computed using what is called the normalizing transformation of the correlations computed from each feature. The normalizing transformation for correlations is tanh(r) where "tanh" is the function "hyperbolic tangent." The average of tanh(r) is taken over all the features, and than transformed back using hyperbolic arctangent. (See e.g. Wikipedia, for information about hyperbolic tangent.)
References

[1] Krzywinski, M., &  Altman, N.  (2015) Points of Significance: Sources of variation. Nature Methods, 12(1), 5-6  doi:10.1038/nmeth.3224
[2] Krzywinski, M., & Altman, N. (2013). Points of significance: Importance of being uncertain. Nature methods, 10(9), 809-810.
[3] Blainey, P. & Krzywinski, M., & Altman, N. (2014). Points of significance: Replication. Nature Methods, 11(8), 879-880  doi:10.1038/nmeth.3091