1.11 - Microarrays versus Sequencing

Printer-friendly versionPrinter-friendly version

One question that frequently arises is whether, with the cost of sequencing falling so rapidly, it is worth learning the analysis of microarray data.? Microarrays are clearly not competitive for many applications such as discovering transcribing but untranslated regions of the genome and finding anything de novo.  A microarray array can measure only features that are complementary to features that are printed on the array, and hence are determined in advance.   However, I think that microarrays will continue to be used for measuring nucleic acids at least for the short time horizon, especially for genotyping with SNPs.  As well, microarray technology is now being used for other types of "omics" data such as measuring proteins. 

The two main advantages of microarrays are cost and the ease of handling the data.  Microarrays are now cheap to create and inexpensive on a per sample basis.  More importantly, a data summary is produced for each probe on the array, rather than for each DNA or RNA fragment in the sample, which creates a tremendous savings in data management - from about 25 million reads per sample to about 50 thousand probe intensities for gene expression.  As well, it is not necessary to map the results, which requires bioinformatics skills.  If you have a limited budget or limited bioinformatics personnel,     microarrays will likely give you more data for less cost if you are doing standard gene expression studies or genotyping.

Moreover, from a statistical point of view, microarray data analysis is worth learning because the statistical models behind the analysis is more basic than the models for sequence data. However, the analysis of the sequence analysis is a generalization of the analysis of microarray data. So, if you understand the analysis of microarray data, then it is easier to understand how to do the analysis for sequencing data.

 Microarray data are intensities from a laser reflectance that are treated as continuous data. We usually assume that the logarithm of intensity is approximately normally distributed, which greatly simplifies statistical analysis. With sequencing data, the data are the number of reads mapped to each feature. Count data has its own particular set of statistical properties that are not shared by intensity data. 

Sequencing data can be more informative than microarray data because whatever can be sequenced can be found.  So, if you have some way of pulling down part of the RNA or DNA, or anything else, you can find something new.  On the other hand, the data can be difficult to work with.  Even if there is a good reference sequence, mapping is computer-intensive and requires expertise.  For example, in a recent cancer study in which I was involved, routine settings of the mapping software mapped only about 50% of the reads, even though these were human samples, which have the best reference data.  Working with a collaborator who has expertise in this area, we were able to map over 80% of the reads as well as discovering a number of novel transcript variants.  

If there is no reference, a reference can be built using the sequenced samples, but this is even more computer-intensive and requires a different set of skills.  Once the reference has been built, the reads from each sample need to be mapped against it.  Considerable computational resources are required for these efforts, not the least of which is data storage.

Bias

Each of the technologies has his own biases.  The biases in microarray data come from the selection of the probes on the array, which may be incorrect or match multiple features.  As well, the probes are developed using a reference, which may not be the same genotype as the samples. Probes bind most tightly to exactly matching complements, so differences in genotype can affect the intensity measure.

Sequencing data starts with the millions of reads, which need to be mapped to the reference for allocation to features. As with microarray data, there may be errors due to sequencing errors and possibly errors in the reference (especially for recently built references).  Even if there are no errors, the sample that you are sequencing is not identical to the reference due to differences in genotype.  As well, there are always some reads which do not appear to match anything (which is always a mystery!)  These days with a good reference, it is common to have 90% or more of the reads map to features of the reference.  

One difficulty in mapping is handling reads that map to multiple features.  Genes come in families which have similar sequence - just how similar depends on the evolutionary history of the family.  As well, some organisms (including humans) have multiple small repeats such as transposable elements or other phenomena.  Mapping software will deal with reads that map to multiple features in different ways - e.g. they may not be mapped to any features, they may be mapped to the first feature they match or they may be assigned at random among the features they map.  It is important to understand how your features were mapped when you want to interpret the results of your analysis.

Another bias introduced during the mapping step is quite unexpected - the mapping software might discard features that have very high mapping rates, under the assumption that they are transposable elements or other features not usually considered interesting for the downstream analyses.  As a result, when mapping very large libraries for RNA-seq analysis, some very highly expressed genes might be discarded.

It can be helpful to visualize where the reads are mapping on the features.  A genome browser is a visualization tool that allows us both to see where the reads are falling with respect to feature boundaries, and to compare with characteristics of the genome - perhaps reads from other samples, or GC content, percentage of repeats, etc.   The example below compares gene expression in brain and muscle tissue with the annotation of the exons and isoforms for this gene.  Each line of the graphic is called a track.  The bottom of the image displays the area that is uniquely mappable so that you can interpret this, comparing where reads show up or do not show up compared to where the exons fall.


RNA-seq data from several tissues versus known exons and "mappability". Notice that most of the reads are in exonic regions (rectangles in the MYH7 lines), but the brain sample has reads in an intron. 

 

If there is no reference genome or transcriptome, reads can assembled to create 'contigs'.

    reads           CCTGATTCAT           TT--GATAATG         ACGTGTAC
                   AGCCT--ATT                    TAGAT--ATGG           GTGTACCAT
                           CTGATTCATTA      TTAGATAA           CACGTGT
 
contigs       AGCCTGATTCATTA       TTAGATAATGG    CACGTGTACCAT

 Here the reads are overlapped by looking at matching pieces. The – indicates the matching software placed a gap there so that it would match with another string. Or it might indicate a location with such a low quality score that was taken out. These 'contigs' become your reference.

Assembly might be done even if there is a reference to correct errors in the reference or to investigate genomic or transscriptomic variants. Assembly is improved with longer  reads, i.e. the longer the pieces that you have the better chance you have of being able to match things up correctly. 

Now that we have so much sequencing capacity in a single run, we often run multiple samples in the same run.  This is called multiplexing.  To identify the nucleotides from the individual samples, a short identification sequence of C,G,A,T called a barcode is added during the sample preparation.  Each fragment from the same sample should have the same barcode, so after sequencing the reads can be sorted into samples.  It is quite common to run 2 to 8 samples in the lane and currently up to 44=256 barcodes can be used with some sequencing technologies.

Multiplexing is particularly convenient if you are trying to assemble a transcriptome.   You can extract RNA from several tissue types that might have different gene expression, bar code and then run them together, so that you are doing the assembly from a very rich set of RNAs. You can cover most of the gene expression, and at the same time you can get tissue specific information.

Massively parallel sequencing is also useful when you have mixed samples such as virus-infected cells or entire microbiomes, such as the intestinal organisms or soil samples.  Everything in the sample can be sequenced.  The mapping programs are then used to assign the reads to their organisms of origin.