STAT 555
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 9: RNA Seq Data

Lesson 9: RNA Seq Data

Key Learning Goals for this Lesson:
  • Understanding RNA-seq data
  • Understanding how count data are modeled
  • Understanding dispersion and dispersion modeling
  • Understanding why and how RNA-seq data are normalized
  • Understanding how to set up differential expression analysis for sequencing data
  • Understanding multiple testing adjustments for count data

Introduction

High throughput sequencing technologies have greatly expanded the tools available for DNA and RNA exploration.  This chapter is focused on assessing differential gene expression by RNA-seq.  However, very similar statistical tools are available for other differential studies using sequencing.

Sequencing starts with an RNA sample from a tissue.  This may be preprocessed to enrich for certain types of RNA, such as RNA with poly-A tails.  Usually it is then converted into cDNA.  Typically there are some quality checks to ensure that there is sufficient RNA in the sample and that the RNA is not degraded.  The cDNA is then fragmented into pieces.  The length of the fragments is under loose control, in the sense that a mean length can be targeted.  In some studies there is further selection to obtain only fragments in a narrow size window.  In others, most sufficiently long fragments can be sequenced.  A sequenced fragment is called a "read".  

The basic sequencing unit is a "lane" which essentially holds one sequencing sample.  A set of lanes which are processed together is often called a "plate". A single RNA sample may be split across multiple lanes to increase the amount of sequencing done.  This is uncommon in current RNA-seq studies, because each lane can now sequence 100's of millions of RNA fragments, which is more than sufficient for RNA-seq, but it may be done in studies that need very high read counts.  

Nucleotide fragments can be labeled by synthesizing short sequences at the end of each fragment.  These sequences are called bar codes.  The bar code is sequenced along with the rest of the fragment.  By barcoding samples, we can mix different samples in the sequencing lane and then determine which reads belong to which samples using the bar code.  Barcoding and mixing samples before loading the sequencing lane is called multiplexing.  It is frequently used to reduce sequencing costs when the sequencer produces more reads per lane then required by the study.  For example, for RNA-seq studies, the recommendation is to obtain about 25 million reads per sample.  If the sequencer can produce 200 million reads per lane, 8 samples can be multiplexed on the same lane.

Sequencing Technologies

The most current technologies are "single molecule" sequencers which can sequence very long strands of DNA; as is usual with new technologies these are currently expensive, inaccurate, and not high enough throughput for quantitative studies.  Based on the rapid improvement in their precursors, it is likely that in a few years these short-comings will be overcome.  Currently, these technologies are mostly used to improve our knowledge of DNA and RNA sequence.

For quantitative studies, such as gene expression, much shorter fragments are sequenced - typically around 250 bases.  Current high throughput technologies typically sequence between 50 and 250 bases per fragment (called the read length) with longer reads being proportionately more expensive.  Paired end sequencing, in which each fragment was sequenced from both ends, was popular for a while to achieve greater read length.  Now that the read length can be quite comparable to the fragment length, paired end sequencing leads to sequencing the center of the read twice, and is not cost-effective.

Matching the reads to features

Once the RNA fragments have been sequenced, they need to be identified by matching to the features of interest. This is called mapping.  For many organisms, there is already a reference genome (all the DNA), transcriptome (all the RNA transcripts) or other type of reference.  If there is no reference, or if you need a reference more specific to the strain you are working with, the reads can be used to create a reference.  Longer and more accurate reads assist in good mapping, and are particularly important when building a reference.  Neither building a reference nor mapping are among the topics of this course.  Typically specialized mapping software is used, and then the mapped reads are matched to genomic regions corresponding to the features of interest.  

When I am working with a lab that has no expertise in mapping, I generally ask the sequencing facility to do the mapping.  Among other advantages, this means that I do not need to work with the raw data, which are large.  A typical RNA-seq read file is over 10 Gb.   Typically to obtain the raw data, my collaborators send a hard drive to the sequencing facility - courier service is faster than up and down-loading from the computing Cloud!

After mapping to the reference, the reads are converted to counts per feature.  Typically a small proportion of the reads do not match anything and another small proportion match 2 or more locations on the reference (which may be to similarities in the reference or read errors).  An expert bioinformatician can often get better results than someone following "off-the-shelf" script by writing scripts that deal in sensible ways with these unmapped reads.  In any case, the final result is an expression matrix, usually with features in the rows and samples (or lanes is the samples are split across lanes) in the columns.  These are the data that will go into the differential expression analysis.  You should also keep the information about how many reads did not map and how many mapped to multiple locations - this information is required to interpret your statistical results.

One note of caution! Often sequencing facilities and labs convert the mapped to other units such as counts per million reads (CPM) or counts per kilo bases per gene lane per million (RPKM). These types of data are not suitable for the types of analysis that will be doing here. The error structure of the data depends on on the number of reads. It doesn't matter whether one gene is very, very small and at the same expression level it produces 1/10 of the reads that a huge gene might, all that matters is how many reads you've actually counted. This is because, as we will see in the next section, the variance of a count is related to its mean.  If you convert to reads per anything you will not be able to recover the variance.  

Another important piece of information that is lost when converting to reads per anything is the total reads for the sample.  For example, in a cancer study we thought that we had a subpopulation of cells in which only a subset of the genes expressed.  It turned out that some of the samples produced only a few thousand reads, while others produced 10s of millions.  Many of the genes that expressed lowly in the samples were not detected in the samples with few reads.  This was not evident when we received the data as RPKM, but was very obvious once we saw the total reads for each sample.  The variability in the total reads was due to technical difficulties, not due to the cancer tissues.  

From Lanes to Samples

Our units of analysis are features and RNA samples.  In many studies, sequencing lanes and samples are not the same.  Mapping identifies the features.  We also need to summarize by sample

In some studies, the RNA samples are split across several lanes.  It turns out that the error structure is preserved if we simply sum up the reads from each sample to obtain the total reads for each feature in the sample.

In some studies, the RNA samples are barcoded and multiplexed so that several samples are sequenced together.  As the reads are mapped to the reference, the bar codes need to be read so that they can also be assigned to samples.

When we finish the mapping and sample assignment process, we should have a data matrix of counts.  Typically each row of the matrix is a feature and each column is a sample.  The data has the form of a count \(n_{ij}\) the number of reads mapping to feature i in sample j. 

Library Size

The RNA that was sequenced is called the RNA library.  With longer read lengths and more accurate sequencing, these days in most organisms, most of the reads are mapped. 

Library size could mean one of two things: the total number of reads that were sequenced in the run or the total number of mapped reads. We will use the total number of mapped reads as the library size in our analyses.  Normalization of RNA-seq data proceeds by computing an "effective" library size, which is computed from the actual library size and the distribution of the counts.

9.1 - Models for Count Data

Models for Differential Expression in Sequencing Studies

The models that we use for differential expression for RNA-seq data are extensions of the models used for microarray data.  Microarray data are intensities.  After taking log2,they are on a continuous scale and are modeled well (within gene and treatment) by the Normal distribution. Sequence data are counts. Because of this they have quite different properties than microarray data.  We need to account for the properties of count data when doing statistical analysis of RNA-seq data.

The assumption is that in unit i, there is some proportion of the reads that come from feature j. This is called \(\pi_{ij}\). We want to know whether \(\pi_{ij}\) varies systematically with the treatment, (gender, genotype, exposure or some combination these variables).

When we take a sample from unit i, we will obtain a library for sequencing with \(n_{ij}\) mapped reads from feature j. The total mapped reads in the library will be \(N_i\) and will vary among units. \(N_i\) is often referred to as the library size.  Fortunately, we always know \(N_i\) and we can account for it in our analyses.  The estimated proportion of reads from feature j will be \(p_{ij}=\frac{n_{ij}}{N_i}\), which will vary from \(\pi_{ij}\).  As well, biological replicates from the same treatment will have somewhat different values of  \(\pi_{ij}\) due to biological variability.

What we want to do is something similar to a t-test or ANOVA while taking into account that the data are counts.

Sources of Variability in Count Data

Count data have at least three sources of variability.

  • Poisson variability
  • biological variability
  • systematic variability

Poisson variability is the variability that we would have in counts for a single gene if we took multiple technical replicates of a single RNA library and sequenced it many times. It is due to the fact that each RNA fragment either is or is not captured. There is a "Law of Small Numbers" which pretty much guarantees that when selecting a small number of fragments of a certain type in a much larger pool, as long as the fragments are selected at random and independently the number of fragments selected will have a Poisson distribution.  "A much larger pool" can generally be interpreted to mean that \(\pi_{ij}<0.01\) and \(N_i>1000\).

The Poisson distribution has a specific form that depends on \(\pi_{ij}\) and \(N_i\) for each library.  However, what affects our analysis the most is that the mean and variance of the counts for feature j from library i will both be \(N_i\pi_{ij}\).  It is this relationship between the mean count and the variation in the count that is the most important difference between continuous and count data.

How do we check the Poisson assumption? If we took one RNA sample and split this across several lanes and counted the reads for any feature in each sample, the variability that you would expect to see for that feature would be Poisson. The data we will look at in the lab has samples that were sequenced in two lanes, and we will see that the Poisson assumption fits these data very well.

As in any study in which we hope to assess biological significance, we need biological replicates of units in the same treatment.  Biological variability has to do with the fact that the \(\pi_{ij}\)'s vary among the units, even in the same treatment.  For this reason, we expect "extra Poisson variability" (this is the technical term) for counts coming from different individuals for a given feature, even if we control for the library size.

The figure below is a mock-up of the read counts you might expect from a single feature if \(N_i=N\) for every library.  The each box is a biologial replicate and indicates the data you might obtain from a sample split across several lanes.  However, typically you will obtain only one count per feature per sample, which is the asterisk in the box.  Notice that because \(N\pi_{ij}\) is about the same for all i the boxes are all about the same width.   However, when you accumulate across the biological replicates you get more variability because each sample is varying around its own \(\pi_{ij}\). This is the extra-Poisson variability.The mean proportion of reads from feature j in all possible samples from the treatment population is \(\pi_{\bullet j}\).

extra poisson variability

In order to model the data and provide appropriate tests of differential expression we will need to model the extra Poisson variability.

Differential expression analysis is meant to assess systematic variability due to the 'treatments', (genotype, exposure, tissue type, etc.) The goal is to get a p-value associated with the systematic variability. Just as with microarray data, we expression differential expression as the the ratio of estimated differences in treatment percentages, i.e. fold differences. We assess the differences in the observed fold difference relative to the total variation in the system.

Early genomics papers using sequencing data, which include SAGE tag data, often assumed that all the variability was Poisson. This means that they did not have all of the sources of the variability and it also meant that the tests used SEs that were too small yielding too many false positives.

Models that account for Extra-Poisson Variability

When all of the samples have the same library size, (which never happens),  there are two very commonly used models for the counts which account for the extra Poisson variation. The models assume that there is a probability model for the expected counts \(N_i\pi_{ij}\).

The two models both have \(n_{ij}\) distributed Poisson(\(N_i\pi_{ij}\)) with:

1) \(N_i\pi_{ij}\) ~ Gamma(a,b) which gives you Negative Binomial count data \(n_{ij}\). 

2) \(log(N_i\pi_{ij})\) ~ Normal(\(\mu_j, \sigma_j^2\)) which assumes that the log of \(N_i\pi_{ij}\) is Normal and\(n_{ij}\) is Poisson with mean \( e^{N_i\pi_{ij}}\).This is called the Poisson-LogNormal model for count data.

Most of the popular software for doing differential expression for sequence data use one of these two models with some type of adjustment to account for the fact that the library sizes are not equal.

To the left is the distribution of \(N_i\pi_{ij}\) if it is Gamma(a,b) and to the right is the distribution if it is Poisson-LogNormal

gamma versus normal plot

These densities are practically the same. You can see that it would be very hard to distinguish between them without extremely large sample sizes.  The software we will use assumes that the data are Negative Binomial.

Modeling the Variance of Count Data

Packages for RNA-seq differential expression analysis usually model the variance as a function of mean expression for the gene \(\mu_{ij}=N_i\pi_{\bullet j}\).  In the Negative Binomial model \(Var(n_{ij})=\mu_{ij}(1+\mu_{ij}\phi_j)\) where \(\phi_j\) is the dispersion for gene j.  When the dispersion is 0, the expression has a Poisson distribution. When \(\phi_j > 0\) then the gene has extra Poisson variation. In theory, \(\phi_j\) can also be negative but this is rarely the case.

The packages vary in how they model the dispersion.  Both edgeR and DESeq2 start with a comparison of the estimated values of \(Var(n_{ij})\), estimated from the biological replicates with adjustment for the unequal library size, versus \(N_i\hat{\pi}_{\bullet j}\) where \(\hat{\pi}_{\bullet j}\) is the estimate of \(\pi_{\bullet j}\).  The default in edgeR is to plug these values in for the true variance and mean and solve for \(\phi_j\).  These values are then shrunk towards the mean dispersion, quite similarly to the variance shrinkage in LIMMA.  The amount of shrinkage is determined by a parameter called the prior d.f. which controls the relative weight of the mean dispersion versus the genewise dispersion.  The plot below shows the effects of weights prior d.f. 4, 8 and 100.  The unmoderated dispersions are very spread out, which greatly expands the y-axis of this plot and so is not shown.  The default value of 10 seems to work well in the analyses I have done.  

plot of variance shrinkage

The effect of dispersion moderation of this type is to slightly reduce the power for testing the genes with low dispersion and to much increase the power for testing the genes with very high dispersion.  There are usually only a few of these very high dispersion genes, and they often have an unusual expression pattern with zero reads in many samples but high number of reads for biological replicates in the same treatment.  For this reason, I usually maintain a list of the high dispersion genes for a closer look.

Another option in edgeR and the default in DESeq2 is to estimate the variance by smoothing the variance for each gene against \(N_i\hat{\pi}_{ij}\) using a method similar to loess.  (Recall that we used loess to normalize 2-channel microarrays.)  Alternatively, regression models can be fitted (which is the default in edgeR when complicated log-linear models are used for modeling the log(mean) and the variance.)   In any case, the individual dispersion estimates are then shrunk toward the point on the curve, instead of to the overall mean.  

edgeR and DESeq2 use very similar models and somewhat similar dispersion shrinkage methods.  They each have their advocates.  The biggest difference is how they treat the very high dispersions.  These dispersions are shrunk the most in edgeR.  DESeq2 flags these high dispersions as outliers and does not shrink them.  As a result, these genes are often declared as discoveries by edgeR but not by DESeq2.  We will try out edgeR, but this should not be interpreted as advocating for edgeR over DESeq2.

We will also try out voom which is part of LIMMA.  voom models and shrinks the variance, rather than the dispersion.  voom does not assume that the data are Negative Binomial  - instead it uses a method called quasi-likelihood in which the regression coefficients and variance components can be estimated using weighted least squares.  edgeR and voom both come out of the Smyth lab.  Smyth seems to feel that voom is slightly better than edgeR for differential expression.  Right now, the choice among edgeR, DESeq2 and voom seems to be investigator preference.  However, there is a previous version of DESeq2 called DESeq which has lower power and should no longer be used.

Recently a students in this class tested all 4 packages with a small RNA-seq data set and considered the genes that were found to be significantly differentially expressed.  A post doctoral fellow in my lab subsequently tried all 4 packages on another data set with lots of treatments, but few replicates.  In both cases, most of the "discovered" differentially expressed genes were found by all or most of the packages.  However, both investigators found that many of the genes declared significantly differentially expressed by edgeR, DESeq or DESeq2 but not by voom had large outliers in a single sample that appeared to drive the statistical significance.  This is clearly a bad property, and so I am currently recommending voom for small datasets.  When the number of replicates is sufficiently large, it seems likely that all 4 packages will produce similar gene lists but to date I have not worked with data sets with more than a few replicates per treatment.

Linear models for the log2(expression) ,

voom uses a linear model for the data that is very similar to the usual LIMMA model.  \(Y_{ij}=log2(n_{ij})\) and

\[Y_{ij}=\beta_0+\beta_1 x_{i} +\epsilon_{ij} \]

where \(x_i\) is a variable such as an indicator for a treatment group, \(\beta_0\) and \(\beta_1\) are the slope and intercept (which are gene specific and so should depend on j).  However, unlike the usual LIMMA model, the variance of the errors, \(\epsilon_{ij}\) depends on \(N_i\pi_{\bullet j}\).

edgeR  and DESeq2 use a slightly different model called the log-linear model

\[n_{ij} \sim Negative Binomial(\mu_{ij},\phi_j)\]

where \(\mu_{ij}=N_i\pi_{\bullet j}\) and \(log2(\mu_{ij})=log2(N_i)+\beta_0+\beta_1 x_i\).

In this context, \(log2(N_i)\) is called the offset.  Since \(N_i\) the library size is known, the offset is a known constant that varies for each sample.  This is an important point in the modeling, because for normalization we do not adjust the counts, but instead we adjust the offsets using the "effective" library size.

In the linear model, the mean of \(log2(n_{ij})\) is modeled by the linear piece.  In the log-linear model, log2 of the mean of \(n_{ij}\) is modeled by the linear piece.

Except for this small difference, the model is set up in exactly the same way.  For example when I am doing ANOVA, I usually use the cell means model, with an indicator variable for each treatment and no intercept.  

Power

Recall that for a t-test, our power to detect differential expression depends on the relative size of the difference in means compared to the SE of the difference.  For count data, when there is no extra-Poisson variation, the mean and variance are the same, \(\mu_{ij}=\sigma^2_{ij}\).  This means that the ratio \(\frac{\mu_{ij}}{\sigma_{ij}}=\sqrt{\mu_{ij}}\).  The implication of this is that for a fixed fold change F in mean expression, the power to detect fold-change not equal to 1 increases when the library size increases, as well as when the number of biological replicates increase.  In sequencing studies, we therefore have more power to detect differences in more highly expressed genes.  Some downstream analyses have been developed to which take these differences into account.

We generally try to estimate the necessary sample sizes to achieve a desired power for a targeted fold change.  In sequencing studies, we might also consider the so-called sequencing depth, which can be thought of as the number of reads for the feature in the sample.  Sequencing depth can be increased by decreasing the amount of multiplexing or increasing the number of lanes on which the sample is sequenced.  Since doubling the sequencing depth for a sample costs about the same (for labeling and sequencing) as sequencing another sample, there is a trade-off that must be determined when deciding sample sizes for the study.

9.2 - Checking Data Quality

As in all statistical analyses, it is wise to check your RNA-seq data for data quality by visualization and simple summaries before proceeding with a formal analysis.  We will use the LiverData which you looked at in the RNA seq R Lab.  It consists of 36 sequencing lanes of read counts.  The data came from a study of gene expression in liver from male and female (M or F), humans (HS), chimpanzees (PT) and rhesus monkey (RM).  There were 3 biological replicates for each treatment combination and each replicate was split into 2 sequencing lanes. 20689 features (genes) were recorded.

After reading in the data, I generally check the dimension (which you did in Homework 2) and also have a look at some of the entries.  It is amazing how many errors you can prevent with these two simple steps!  RNA-seq data should be counts (whole numbers).  As we see below, that is what we have.  We also have a lot of zeroes in the data.

observation values

For the statistical analysis, we use the counts.  However, so visualization it is convenient to use log2(counts).  To make sure that the zeroes turn  up in our visualizations, we need to add 0.5 or 0.25 to every observation.  This places the zeroes at -1 or -2 respectively, but has little effect for counts above 2.

Data Visualization

The histograms below are done for each lane (column) of the matrix with 0.25 added to the data before taking log2.  (In the data table above, the column names were truncated.  Here we have the full column name which is Run# Lane#, species, gender and biological replicate.)   We can see that for each lane there is a large preponderance of zeroes.

read counts plots

Even after aggregating across samples by taking row sums, there are lots of undetected genes as seen below. These need to be removed, since genes which do not express at all cannot be differentially expressed.  

aggregated plot

As with microarrays, it can be helpful to plot each sequencing lane against every other sequencing lane.  We have done this below for the 4 lanes of data from 2 of the human males.  Even after removing genes that have zero reads in the entire study, there are lots of genes that have no reads in any pair of lanes.  These are the black dot at (-2,-2).

 scatterplot matrix

Notice how tightly the 2 lanes from the same individual follow each other.  These are technical replicates and have only Poisson variation.  Any pair of lanes from different biological replicates have extra-Poisson variation.

A cluster diagram of the samples provides a type of "reality check".  The diagram is built on the correlation of the samples.  We expect the two lanes from the same individual to be tightly clustered.  Since these are liver samples, we might not expect large gender differences.  On the other hand, we might expect species differences (especially as all the data were mapped to the human reference - chimpanzees and rhesus monkeys might have some mismatches which would lower the read counts).

cluster dendogram

The cluster diagram shows a pattern that is about what we might expect.  The 2 lanes from the same individual cluster most tightly, and then the clustering is by species.  The humans cluster more closely to chimpanzees than to rhesus monkeys, as we would expect from the evolutionary history of these 3 species.

Numerical Summaries

The next thing to look at is library size - the total number of mapped reads in each lane.  As we can see below, this can vary considerable.  (This is from one of the early studies - library sizes are much bigger today.)  A lane that is an order of magnitude smaller than the others might indicate a technical problem which might require another sequencing run for that lane.

library size

At this point we should also look at the percentage of reads from each sample which were not mapped.  This should usually be about the same for each sample.  If they are very different, then some samples may be degraded or contaminated.  Of course, in the current study, this might be species specific, because the rhesus samples might not map to the human reference genome as well as the human samples.

Typically, even highly expressing features contribute only a small percentage of the reads in each lane, but this is not always the case.  Since the log-linear model shows we are analyzing the percentage of reads in each sample that come from each feature, features that take up a large percentage of the library artificially depress the percentage that come from other features.  

Below we have the percentage of each lane taken up by the most highly expressing feature.  (It might be a different feature in each lane, but in this case it is not.)  Notice that this one feature (Albumin) is 4 fold up in some lanes compared to others and that this effect does not appear to be species dependent.

library size

In these samples, the most highly expressed gene is a large percentage of each sample. In fact, the nine most abundant transcripts account for 25% of the reads.  These features do not follow the assumptions needed for Poisson variation and may not follow the same distribution as the other features. 

9.3 - Preprocessing and Normalization

Once we have some understanding of our data, we can prepare the data for statistical analysis.

Assembling the reads into counts for each sample 

In many studies, each lane is a sample, but that is not always the case.  When very high read counts are desirable, for example, if analysis is to be done at the exon level or if very low expressing genes are of interest, then the samples may be split across multiple lanes.  Conversely, with the ultra-high throughput methods now available, samples may be multiplexed.  A common design has both sample splitting and multiplexing - first the samples are barcoded and mixed.  Then the mixed samples are split across several lanes.

Multiplexed samples must be split into separate samples using the unique bar code attached to each read in the sample.  This produces a read count (possibly zero) for each sample from each feature in each lane. If the samples are split across lanes, then the counts for each feature need to be added across lanes for each sample.  (Note that the samples are kept separate - it is only the technical replicates that are totalled.)

Filtering Low Expressing Genes

As we discussed when looking at the analysis of tabular data, since there are only a finite number of tables with each set of margins, there is only a finite set of p-values.  For RNA-seq data, we consider tables of the form:

  treatment 1 treatment 2
feature j \(n_{1j}\) \(n_{2j}\)
not feature j \(N_1 -n_{1j}\) \(N_2-n_{2j}\)

If the \(n_{ij}\) are Poisson then we can use Fisher's exact test to determine whether there are significant differences between the proportion of reads in the two treatments.  Recall that the number of possible tables with the same row and column totals is \(n_{1j}+n_{2j}+1\), so this is the maximum number of different p-values.  If the smallest possible p-value is greater than 0.05, we have no power to detect differences as we will never reject the null.  It turns out that when the library sizes are in the thousands or more, we have no power to detect differences with fewer than 7 reads for feature j (totaled across all samples).  

When we have extra-Poisson variation, Fisher's exact test detects too many false positives.  So, 7 reads total for a feature is a conservative limit on the number required to achieve statistical significance.  Of course, we are also going to apply multiple testing adjustments, so we will usually be rejecting at much smaller p-values than p<0.05.  

Since we cannot possibly detect significant differential expression when the total reads are very low for a feature, we usually remove low expressing features from the study.  I usually filter out features with fewer than 10 reads in total.  Some analysts use 25 as the boundary.  Some analysts use a cut-off based on RPKM - I do not advocate this.  If there are enough samples or very large libraries, even features with low RPKM can have enough reads to do a meaningful analysis.

Removing the genes for which we have zero power to detect differential expression means that we have fewer tests that we need to consider when doing multiple testing adjustments, and so improves the power of the entire study.

Normalization

Because our analysis uses the actual counts, and the library sizes differ, we need to account for library size in our analysis.  The log-linear model does this by using the library size as an offset for each sample.

However, we can have other effects, such as a small number of highly expressing genes, that affects the read counts.  As in microarrays, one of the assumptions is that the distribution of expression should be about the same in every sample.  Normalization should achieve this.

In RNA-seq data, we typically normalize the data by creating a normalization factor.  The product of the normalization factor and the true library size is the effective library size.  If we needed to obtain adjusted counts for some type of visualization, the appropriate adjustment would be the normalization factor times the true count.  However, for statistical analysis we use the effective library size as an offset - we never adjust the counts because we would then obtain the wrong variance estimate.

The simplest normalization method is to compute some summary of the data, pick a central value of the summary, and then compute the ratio of all the summaries to the central value.  That ratio is the normalization factor.  Because of the high skewness of the counts, often we use a quantile of the distribution.  Using the 75th quantile (25% of the counts are higher, 75% lower) often works well.  However, a slightly better method is the TMM method [1] which is available in edgeR.  TMM  appears to work well when we expect that most features do not differentially express.  It attempts to minimize the number of genes that appear to differentially express between any two samples.   Even when I use DESeq2  or  voom I still start by using edgeR to compute the normalization factors using TMM

Once you are at this point, you are ready to do the differential expression analysis.  This is covered in the R lab.

References

[1] Robinson, M.D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25. 

9.4 - Multiple Testing with Count Data

Multiple Testing

Another problem with RNA-seq data has to do with multiple testing. Multiple testing adjustments all assume that if the null hypothesis is true then the p-values will be uniform. If you recall, we discussed this earlier.

This does not happen when you have count data unless you have a many, many samples or very high counts. Then things seem to average out based on the central limit theorem and look more continuous.

Here are the p-values from all of the genes from a study of gene expression in liver and kidney, using Fisher's exact test to compare two samples.  The weird peaks are due to the low expressing genes that have only a few possible p-values.  The peak at p=1.0 is due to the fact that no matter how many total reads there are for a feature, there is a finite probability of obtaining p=1.0 so that the p-values begin to pile up at this point - especially if there are many tests with very few reads and not much differential expression.

non-uniformity

As discussed in the section on preprocessing, we should filter out (remove) genes with fewer than 10 reads in total.  Below are the p-values after we have filtered out genes that have fewer than 10 reads per feature. We still have a peak at p=1.0 (and this is really at p=1.0, not the bin from 0.95 to 1.0) for the comparison of liver samples, in which there is no differential expression. There is also a small peak at p=1.0 for the comparison of liver and kidney samples, but it is obscured because there is a lot of differential expression.  Note that with continuous data such as microarray intensity data, there should never be a peak near p=1.0, and p-values should also never be 1.0 (although it could occur due to round-off error).

plot of p-values

 

simulated plot of p-values with continuous data

Here are some simulated data to show this again. The first plot is simulated data which mimics what you might obtain in a microarray study.The data are continuous and were analysed using a two-sample t-test.  The blue p-values came from features that were simulated to have no difference in the means, i.e. they were all null. The red were simulated to have a difference in means.

The blue p-values are not completely flat, although statistical theory states that they should be.  This is stochastic variability - the slight ups and downs will be in different places in different samples, and will flatten out if there are lots of features. Most of the red p-values are skewed towards p = 0 where they should be, as this was a test with high power. There is a thin layer of spread crossed the remaining bins, because even with high power there is some probability of a large p-value. 

simulated plot of p-values with count data

 

 

 

 

This is a similar plot for simulated RNA-seq data. The blue p-values from the features with no difference in means doesn't look the least bit uniform. Most of the peak at p=1.0 resulted from features that had two or three reads, the others came from features that have slightly more. The dip near p=1.0 is due to the fact that when there are very few total reads, there are no p-values between .9 and 1.0.  As in the microarray data, the features that were simulated to have differential expression mostly have very small p-values.  However, even when the null hypothesis is false, a fair number of the tests give p=1.0.  These mostly come from features with very few reads.

 

 

 

 

 

 

 

The bottom line is that multiple testing has problems if you don't throw out the the features with low counts. One reason is that there is not enough information to reject and the other is this weird shape to the histogram  of p-values.  Although there are a few papers suggesting different ways to fix the p-value histograms to use with multiple testing adjustments, work with my students strongly suggests that after filtering out the features with very few counts, the standard methods such as BH and Storey's method work fine for FDR adjustment.


Source URL: https://onlinecourses.science.psu.edu/stat555/node/13