# Lesson 9: RNA Seq Data

Printer-friendly version
 Key Learning Goals for this Lesson: Understanding RNA-seq data Understanding how count data are modeled Understanding dispersion and dispersion modeling Understanding why and how RNA-seq data are normalized Understanding how to set up differential expression analysis for sequencing data Understanding multiple testing adjustments for count data

### Introduction

High throughput sequencing technologies have greatly expanded the tools available for DNA and RNA exploration.  This chapter is focused on assessing differential gene expression by RNA-seq.  However, very similar statistical tools are available for other differential studies using sequencing.

Sequencing starts with an RNA sample from a tissue.  This may be preprocessed to enrich for certain types of RNA, such as RNA with poly-A tails.  Usually it is then converted into cDNA.  Typically there are some quality checks to ensure that there is sufficient RNA in the sample and that the RNA is not degraded.  The cDNA is then fragmented into pieces.  The length of the fragments is under loose control, in the sense that a mean length can be targeted.  In some studies there is further selection to obtain only fragments in a narrow size window.  In others, most sufficiently long fragments can be sequenced.  A sequenced fragment is called a "read".

The basic sequencing unit is a "lane" which essentially holds one sequencing sample.  A set of lanes which are processed together is often called a "plate". A single RNA sample may be split across multiple lanes to increase the amount of sequencing done.  This is uncommon in current RNA-seq studies, because each lane can now sequence 100's of millions of RNA fragments, which is more than sufficient for RNA-seq, but it may be done in studies that need very high read counts.

Nucleotide fragments can be labeled by synthesizing short sequences at the end of each fragment.  These sequences are called bar codes.  The bar code is sequenced along with the rest of the fragment.  By barcoding samples, we can mix different samples in the sequencing lane and then determine which reads belong to which samples using the bar code.  Barcoding and mixing samples before loading the sequencing lane is called multiplexing.  It is frequently used to reduce sequencing costs when the sequencer produces more reads per lane then required by the study.  For example, for RNA-seq studies, the recommendation is to obtain about 25 million reads per sample.  If the sequencer can produce 200 million reads per lane, 8 samples can be multiplexed on the same lane.

Sequencing Technologies

The most current technologies are "single molecule" sequencers which can sequence very long strands of DNA; as is usual with new technologies these are currently expensive, inaccurate, and not high enough throughput for quantitative studies.  Based on the rapid improvement in their precursors, it is likely that in a few years these short-comings will be overcome.  Currently, these technologies are mostly used to improve our knowledge of DNA and RNA sequence.

For quantitative studies, such as gene expression, much shorter fragments are sequenced - typically around 250 bases.  Current high throughput technologies typically sequence between 50 and 250 bases per fragment (called the read length) with longer reads being proportionately more expensive.  Paired end sequencing, in which each fragment was sequenced from both ends, was popular for a while to achieve greater read length.  Now that the read length can be quite comparable to the fragment length, paired end sequencing leads to sequencing the center of the read twice, and is not cost-effective.

Once the RNA fragments have been sequenced, they need to be identified by matching to the features of interest. This is called mapping.  For many organisms, there is already a reference genome (all the DNA), transcriptome (all the RNA transcripts) or other type of reference.  If there is no reference, or if you need a reference more specific to the strain you are working with, the reads can be used to create a reference.  Longer and more accurate reads assist in good mapping, and are particularly important when building a reference.  Neither building a reference nor mapping are among the topics of this course.  Typically specialized mapping software is used, and then the mapped reads are matched to genomic regions corresponding to the features of interest.

When I am working with a lab that has no expertise in mapping, I generally ask the sequencing facility to do the mapping.  Among other advantages, this means that I do not need to work with the raw data, which are large.  A typical RNA-seq read file is over 10 Gb.   Typically to obtain the raw data, my collaborators send a hard drive to the sequencing facility - courier service is faster than up and down-loading from the computing Cloud!

After mapping to the reference, the reads are converted to counts per feature.  Typically a small proportion of the reads do not match anything and another small proportion match 2 or more locations on the reference (which may be to similarities in the reference or read errors).  An expert bioinformatician can often get better results than someone following "off-the-shelf" script by writing scripts that deal in sensible ways with these unmapped reads.  In any case, the final result is an expression matrix, usually with features in the rows and samples (or lanes is the samples are split across lanes) in the columns.  These are the data that will go into the differential expression analysis.  You should also keep the information about how many reads did not map and how many mapped to multiple locations - this information is required to interpret your statistical results.

One note of caution! Often sequencing facilities and labs convert the mapped to other units such as counts per million reads (CPM) or counts per kilo bases per gene lane per million (RPKM). These types of data are not suitable for the types of analysis that will be doing here. The error structure of the data depends on on the number of reads. It doesn't matter whether one gene is very, very small and at the same expression level it produces 1/10 of the reads that a huge gene might, all that matters is how many reads you've actually counted. This is because, as we will see in the next section, the variance of a count is related to its mean.  If you convert to reads per anything you will not be able to recover the variance.

Another important piece of information that is lost when converting to reads per anything is the total reads for the sample.  For example, in a cancer study we thought that we had a subpopulation of cells in which only a subset of the genes expressed.  It turned out that some of the samples produced only a few thousand reads, while others produced 10s of millions.  Many of the genes that expressed lowly in the samples were not detected in the samples with few reads.  This was not evident when we received the data as RPKM, but was very obvious once we saw the total reads for each sample.  The variability in the total reads was due to technical difficulties, not due to the cancer tissues.

From Lanes to Samples

Our units of analysis are features and RNA samples.  In many studies, sequencing lanes and samples are not the same.  Mapping identifies the features.  We also need to summarize by sample

In some studies, the RNA samples are split across several lanes.  It turns out that the error structure is preserved if we simply sum up the reads from each sample to obtain the total reads for each feature in the sample.

In some studies, the RNA samples are barcoded and multiplexed so that several samples are sequenced together.  As the reads are mapped to the reference, the bar codes need to be read so that they can also be assigned to samples.

When we finish the mapping and sample assignment process, we should have a data matrix of counts.  Typically each row of the matrix is a feature and each column is a sample.  The data has the form of a count $n_{ij}$ the number of reads mapping to feature i in sample j

Library Size

The RNA that was sequenced is called the RNA library.  With longer read lengths and more accurate sequencing, these days in most organisms, most of the reads are mapped.

Library size could mean one of two things: the total number of reads that were sequenced in the run or the total number of mapped reads. We will use the total number of mapped reads as the library size in our analyses.  Normalization of RNA-seq data proceeds by computing an "effective" library size, which is computed from the actual library size and the distribution of the counts.