9.3 - Preprocessing and Normalization

Printer-friendly versionPrinter-friendly version

Once we have some understanding of our data, we can prepare the data for statistical analysis.

Assembling the reads into counts for each sample 

In many studies, each lane is a sample, but that is not always the case.  When very high read counts are desirable, for example, if analysis is to be done at the exon level or if very low expressing genes are of interest, then the samples may be split across multiple lanes.  Conversely, with the ultra-high throughput methods now available, samples may be multiplexed.  A common design has both sample splitting and multiplexing - first the samples are barcoded and mixed.  Then the mixed samples are split across several lanes.

Multiplexed samples must be split into separate samples using the unique bar code attached to each read in the sample.  This produces a read count (possibly zero) for each sample from each feature in each lane. If the samples are split across lanes, then the counts for each feature need to be added across lanes for each sample.  (Note that the samples are kept separate - it is only the technical replicates that are totalled.)

Filtering Low Expressing Genes

As we discussed when looking at the analysis of tabular data, since there are only a finite number of tables with each set of margins, there is only a finite set of p-values.  For RNA-seq data, we consider tables of the form:

  treatment 1 treatment 2
feature j \(n_{1j}\) \(n_{2j}\)
not feature j \(N_1 -n_{1j}\) \(N_2-n_{2j}\)

If the \(n_{ij}\) are Poisson then we can use Fisher's exact test to determine whether there are significant differences between the proportion of reads in the two treatments.  Recall that the number of possible tables with the same row and column totals is \(n_{1j}+n_{2j}+1\), so this is the maximum number of different p-values.  If the smallest possible p-value is greater than 0.05, we have no power to detect differences as we will never reject the null.  It turns out that when the library sizes are in the thousands or more, we have no power to detect differences with fewer than 7 reads for feature j (totaled across all samples).  

When we have extra-Poisson variation, Fisher's exact test detects too many false positives.  So, 7 reads total for a feature is a conservative limit on the number required to achieve statistical significance.  Of course, we are also going to apply multiple testing adjustments, so we will usually be rejecting at much smaller p-values than p<0.05.  

Since we cannot possibly detect significant differential expression when the total reads are very low for a feature, we usually remove low expressing features from the study.  I usually filter out features with fewer than 10 reads in total.  Some analysts use 25 as the boundary.  Some analysts use a cut-off based on RPKM - I do not advocate this.  If there are enough samples or very large libraries, even features with low RPKM can have enough reads to do a meaningful analysis.

Removing the genes for which we have zero power to detect differential expression means that we have fewer tests that we need to consider when doing multiple testing adjustments, and so improves the power of the entire study.

Normalization

Because our analysis uses the actual counts, and the library sizes differ, we need to account for library size in our analysis.  The log-linear model does this by using the library size as an offset for each sample.

However, we can have other effects, such as a small number of highly expressing genes, that affects the read counts.  As in microarrays, one of the assumptions is that the distribution of expression should be about the same in every sample.  Normalization should achieve this.

In RNA-seq data, we typically normalize the data by creating a normalization factor.  The product of the normalization factor and the true library size is the effective library size.  If we needed to obtain adjusted counts for some type of visualization, the appropriate adjustment would be the normalization factor times the true count.  However, for statistical analysis we use the effective library size as an offset - we never adjust the counts because we would then obtain the wrong variance estimate.

The simplest normalization method is to compute some summary of the data, pick a central value of the summary, and then compute the ratio of all the summaries to the central value.  That ratio is the normalization factor.  Because of the high skewness of the counts, often we use a quantile of the distribution.  Using the 75th quantile (25% of the counts are higher, 75% lower) often works well.  However, a slightly better method is the TMM method [1] which is available in edgeR.  TMM  appears to work well when we expect that most features do not differentially express.  It attempts to minimize the number of genes that appear to differentially express between any two samples.   Even when I use DESeq2  or  voom I still start by using edgeR to compute the normalization factors using TMM

Once you are at this point, you are ready to do the differential expression analysis.  This is covered in the R lab.

References

[1] Robinson, M.D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.