STAT 555
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 1: Introduction to Cell Biology

Lesson 1: Introduction to Cell Biology

Key Learning Goals for this Lesson:
  • Review cell biology of transcription
  • Review main technologies for measuring nucleic acids
  • Introduce some of the main research questions we will address
  • Introduce some of the methods for targeted measurement

First, let's review the basics so that everybody has the same basic background in biology.

"Genomics" refers to the study of (almost) all the genes in the cell simultaneously.  The meaning of "gene" has changed dramatically over the past 20 years.  Here we use it to mean any biological relevant piece of DNA in an organism regardless of whether it is directly involved in the protein-building mechanism. 

The field of genomics is moving extremely fast. Biology has gone from being 'data poor' to being very 'data rich' practically overnight. There are four main technologies that are driving this. Within the biological tools we have microarrays and sequencing. In terms of informatics tools we have very sophisticated computational tools and we can share data via the Internet.

This course will focus on the statistical analysis of microarray and sequencing data.  Our primary computational tools will be R and Bioconductor.

Scientists using these tool have many different objectives. They might want to characterize an organism, know which genes are expressed or are not expressed and understand the pathways through which expression is regulated. Or, they might want to understand a particular process such as the immune response or the development of a seed to a full-grown plant. They may want to understand a disease better. Some are interested in biological variation within a group of organisms and others in how organisms evolve. And, we can also use genomics tools to characterize a sample of mixed organisms (meta-genomics) such as the micro-organisms living in the human gut (the microbiome).

Here is a basic outline of what we will cover in the class.

In terms of biology we need to ask the questions "what are we measuring?" and, "why are we measuring?"

In terms of technology we will take a look at how we are measuring in order to understand the sources of bias and variance, essentially the noise involved.

In terms of statistics this course will cover differential expression and related "differential" analyses.

Throughout, we will be emphasizing reproducible research - using the same data could another researcher reproduce our analysis; using a similar technology and population of organisms could another researcher reproduce our biological inferences?  For this we will need to understand how to design our studies, do valid statistical analyses and document our results.

1.1 - Cell Biology

The Genome

We will be looking at some tools that are used for measuring DNA and RNA - collectively known as nucleic acids. There are multiple objectives of performing these measurements. These include:

  • characterizing organisms
  • understanding particular processes (e.g. tumor growth)
  • understanding development
  • understanding diseases
  • inferring evolutionary history
  • characterizing a sample of mixed organisms

DNAThe genome is the set of all the DNA in the organism. This usually only refers to what is in the nucleus. Most cells have other places where DNA is stored such as the mitochondria. Plant cells have chloroplasts which have DNA. These other cell organelles are not always included when people talk about the genome. Most often when scientists are interested in these other cellular parts they will reference them specifically.

The DNA is a double helix, a sort of twisted ladder where the rungs are the "base pairs". These base pairs consist of two bound nucleotides which are designated C, G, A, T. C binds only to G and A binds only to T. This is why the cell has memory - you can split the DNA apart by breaking the rungs and reassemble it because of these matches. In a diploid population, most cells have two copies of each chromosome, and so of each gene.

The basic model states that every cell in an eukaryotic organism (excepting reproductive tissues) has the same genetic material, stored in a double helix.  Nowadays we know that this is not necessarily true. For example cancer cells often have mutations that cause them to differ from normal tissue and from each other.  And there are other examples where this might not be the case such as chimeric embryos. 

We are often interested in why cells differ. When the genetic material is identical, many differences depend on what DNA is active. Gene expression studies look at which genes are active and how active. Other studies look at the mechanism of how genes are activated or inactivated - i.e. gene expression regulation.  When the genetic material may not be identical differences in the DNA itself will be of interest, particularly when looking at cancer versus normal tissue or differences between individuals.  In those cases, we might look for differences in the DNA sequence (genomic variants) or for gene duplications or deletions (copy number variations). 

Some of the fundamental problems or questions that scientists ask include:

  • What is the sequence of DNA?
  • Which genes are active, where, when and how are they activated?
  • How do changes in DNA or gene activation affect the organism?

1.2 - What Can We Measure?

These days, given enough time and money, we can measure just about anything about nucleic acids: their sequence, their abundance, what is binding to them and where, how tightly they are coiled, how they interact with cell components, and so on. 

When measuring DNA, we might measure both strands simultaneously, or each of the strands singly.  Although the sequence information is redundant due to complementarity, coding is done from the single strands, and so coding information is strand specific. The two strands of the chromosome are polar, so they have a coding direction. Some genes are on one strand and some genes are on the other, and they can overlap. The genes on sister strands will code in opposite directions.  Between the genes is an area called the intergenic region.  Originally it was thought that the intergenic regions were "junk" but increasingly it has been found that these regions may encode small functional units or be involved in gene expression regulation.  Biologists often measure some of the small functional RNAs, such as silencing RNAs and are also interested in regulatory regions.

double stranded chromosome

Transcription is the process of going from the DNA (the storage molecule) to RNA (the biologically active molecule).  The region of the chromosome upstream (before the start) of the genes is called the promoter region. Proteins called transcription factors bind to this region and get transcription started. Transcription factor binding sites (and other protein binding sites) are often of interest.  The set of all transcripts is called the transcriptome.  Gene expression analysis usually involves measurement of the transcriptome or the protein-coding part of the transcriptome.

parts of the gene

In eukaryotes, many genes are made up of short contiguous chunks on the chromosome. The exons are the parts that have the protein encoding bases (the codons) and in between these are spacers called introns that have regulatory mechanisms. The regulatory mechanisms determine which exons are used to create transcripts.  The gene also has start and stop sites to direct where transcription begins and where it ends. The exons can be put together in various ways to create different proteins, often called isoforms. One way to think of the exons is like syllables (in English) or like characters (in Chinese).  They can be combined to make many words. 

The set of all exons is call the exome.  The total bases in the exome is usually only a few percent of the total bases in the DNA.  Recent technology has enabled biologists to sequence just the exome, particularly when looking for genomic variants.  However, this may miss important parts of the genome both because exons may not be recognized and because the regulatory regions are excluded.

An important type of gene regulation is called methylation.  It is the addition of a methyl group to what is called a CpG site, a location on the chromosome where C and G nucleotides are adjacent.  Methylation is what is called an epigenetic factor - it is a direct but reversible chemical change in the DNA. Various environmental and evolutionary processes can create methylation, and some of this methylation is passed on to offspring. There are a number of interesting biological processes that seem to be regulated by methylation such as stress reactions. Smoking, drinking and other drug-like substances can change methylation patterns.  The location of methylation sites and their methylation state is another genomic feature of interest to biologists.

1.3 - Measurement

Modern technologies such as microarrays and sequencing allow us to measure most features of the DNA and RNA.  For example, using sequencing technologies we can sequence strands of DNA or RNA.  Alternatively, we can use microarrays to bind to predetermined sequences.  We are limited only by our ability to retrieve the nucleic acids of interest from the tissue.  For example, we can sequence pieces of chromosomes or transcripts.  We can enrich the sample for exons, and sequence or bind primarily exons.  We can allow proteins to bind to the DNA and enrich for the sites at which they are attached.  We can find the methylation sites.  Because nucleic acids are fairly simple chemically, the main impediments to what we can measure is finding a way to enrich the samples for the molecules of interest (or fragments of these molecules).

However, no measurement system is perfect.  Although the instrumentation is continually improving, there are many types of error that can be introduced.  Good study designs attempt to minimize these, and to quantify those that remain.

Measurement error - This term is used for the non-reproducible noise introduced during measurement.  It may be due to sample preparation, instrumentation or other problems.

Bias - A statistician refers to bias when the measurement is systematically (and reproducibly) wrong.   For example, some of genomic regions are more difficult to retrieve than others. Therefore we might get under-representation of these regions in every sample. For many of our measurements, we need to fragment the DNA - if  certain regions are weaker than others and tend to break preferentially there may be over-representation of those regions.  Bias can occur at many stages in the study - for example, methods for handling the organisms might induce stress reactions, minor differences in how different investigators harvest tissue might induce gene expression differences, and so on.

Contamination - Sample contamination may introduce DNA or RNA from another organism into the data. 

Mapping problems  - Modern sequencing data are often mapped to a reference to determine what is in the sample.  A number of problems can arise.  Since the reference for an organism is developed from a limited number of samples, the current sample may have genomic variants that interfere with the match.  Sample preparation and sequencing can introduce errors that interfere with the match.  Similarities among regions of the DNA can create ambiguities in where the match occurs. 

Platform specific biases - These are biases that are due to  the way that we do the measurement.  For example, using microarrays, we can only detect features that have a match on the array.  Using sequencing, we may not be able to measure repetitive items in the genome or transcriptome.

1.4 - Research Questions of Interest

Disciplines that are now using nucleic acid measurement technologies range from basic biology to medicine to environmental science to biotechnology.  Each of these will use the basic data to answer different questions.  Below are some of the fundamental questions.

Questions about DNA:

  • What is the sequence? (This is basic).
  • How is the sequence organized on the chromosome? What is adjacent to what along the strand of DNA? Are there adjacency issues because of the coiling involved, i.e. spatial adjacency of features due to 3-D coiling that are not adjacent on the chromosome.
  • What are the molecules that bind to the DNA and where do they bind? How does this vary with genomic variations?
  • We want to know about individual sequence variations in a population such as changes at single loci (SNPs), larger mutations such as deletions, insertions, duplication, etc. and whether they are associated with phenotypes such as disease susceptibility, growth and other traits of interest
  • We want to know how species vary and evolve.  This is of interest not only to evolutionary biologists, but also for understanding phenotypic variation.
  • We want to know about communities of microorganisms (microbiomes) which may be important for reasons such as human health (e.g. the gut and skin microbiomes), environmental health (e.g. healthy and unhealthy mixes of soil micro-organisms), industrial processes (e.g. clean-up of contaminated sites, biofuel production.)     We would like to associate the community composition with measures of health (of the organism or of the environment). 

Some very popular analyses include looking at genetic markers, which are readily detected genomic variants. These include single nucleotide polymorphisms (SNPs) and microsatellites. A SNP is a change in a single base in the sequence. A microsatellite is sometimes called a small repeat -  a sequence of two or three or four nucleotides which repeat several times in adjacent locations. SNPs are very easy to measure and vary over time scales of thousands of years.  Microsatellites are very popular for marking disease associations because they are easy to measure and the number of repeats can vary quite a bit over populations in short time periods compared to other kinds of mutations.

Genome Wide Association Studies (GWAS) involve associating these markers on a genome-wide (or at least single chromosome) scale with phenotypes or traits of interest such as disease stage, tendency to become overweight, having resistance to a pathogen, and so on.

 In looking at how organisms of the same species differ, and also organisms the same evolutionary branch differ biologists may also look at genomic duplications and rearrangements. Copy number variation occurs when duplicate copies of the gene (or segment of DNA) gets added to the chromosome. Alternatively,  a copy gets deleted and then the individual won't have that gene or express that protein.  Studies of copy number variation are quite common when looking at cancer, which often seems to involve large numbers of chromosome defects.  Duplications of genes, segments of DNA or even the entire genome are thought to be involved in evolutionary processes.

copy number variation gene duplication or deletion

Another type of genomic rearrangement is an inversion.  The chromosome is cut and segment of DNA is inserted backwards.

inversion

A number of molecules bind to the DNA and regulate transcription in various ways.  Transcription factors are proteins that bind to promoter regions and regulate gene expression.  Small RNAs may also bind to regions of the chromosome and have a role in gene regulation.   Methylation involves changes in the DNA molecule but can be measured in a very similar way.  Molecular markers are used to bind to the molecules of interest or the methylated sites.  The DNA is fragmented.  The fragments bound to the markers are retrieved and the other fragments are washed away, enriching the sample of nucleotides for the bound fragments.   The binding is reversed and the markers are washed away.  What remains are fragments of DNA which can be measured using microarrays or sequencing tools.

Questions about RNA:

  • What RNAs are in the sample (particularly for studies of small RNAs)
  • What isoforms are in the sample?
  • How much of each transcript is in the sample?
  • Are certain alleles (or maternally or paternally derived alleles) more likely to create transcripts?

Typically questions about RNA involve quantification, as well as identification.  We discuss this more in the next section on gene expression.

1.5 - Gene Expression

The first half of this course will focus on gene expression and its measurement using microarrays and sequencing.

As mentioned earlier, the chromosomes are storage devices made out of DNA (except a few organisms which use RNA).  The active molecule is RNA, which is based on the DNA as a template.  Some of the RNAs have their own function.  Some RNAs called messenger RNAs (mRNA) are used as templates for protein production.  mRNAs are called coding RNAs.  The functional non-coding RNAs are called ncRNA.

Transcription is the process by which the DNA template is used to create a complementary RNA.  Many RNAs then go through post-processing steps such as folding for ncRNAs or splicing (removal of introns) for mRNAs.  

Transcription begins with proteins called transcription factors binding to the promoter region upstream of the gene.  There are complex chemical processes involving things like the coiling structure of the DNA and methylation that enhance or dampen the process.  Once transcription starts, the DNA unzips and RNA is matched using one strand of the DNA as a template.  After post-transcriptional processing, relatively stable transcripts are created.  Because of the simple chemical structure of DNA and RNA, and with the use of an important biochemical process called reverse transcription, it is relatively simple to determine the sequence of the DNA and RNA and to quantify the transcripts.  Microarrays and high-throughput (massively parallel, next generation) sequencing are both methods which capture, identify and quantify pieces of DNA.

Here is a very simplified picture of the transcription process.

transcription process

Transcription starts when a transcription factor finds the promoter region and this initiates the process of transcribing each base in the DNA to a matching RNA. This is called the pre-RNA because it is not in its final form.  This transcription process always begins at the start codon at 5' of the gene and goes towards the 3' end until a stop codon is reached. (A codon is a set of three bases. For coding RNAs, each codon is either a start, stop or matches a single amino acid, a basic building block of a protein.)

transcription direction

For coding RNA, the pre-RNA contains both the introns and the exons. In post-transcriptional modification, the introns and possibly some exons are excised and the mRNA is created. These exons can be put into many different combinations to form different isoforms (also called splice variants). For most coding RNAs, a string of "A" nucleotides is attached to the 3' end and is called the"poly-A tail".

gene expression

A gene that has introns can make several different transcripts based on which exons are used.  Different splice variants of a single gene may encode different proteins or have different regulatory regions that determine how they are used in the cell.  For example, in humans different isoforms of the haemoglobin protein are expressed during fetal development and as an adult.  Expression of the wrong isoform at a particular developmental stage or in a particular tissue can lead to disease.

splice variants

The poly-A tail is useful to biologists because it can be used to chemically retrieve transcripts by matching the tail.  Since a lot of RNA fragments not of interest may be in a sample, the ability to retrieve only the interesting transcripts is very important.  On the other hand, use of the poly-A tail to retrieve transcripts means that transcripts without poly-A tails, including some mRNAs, may be systematically missing from the samples.

ncRNAs also come from post-transcriptional processing which may including dicing the pre-RNA to obtain the shorter functional fragments, and possibly folding or other structural changes.  ncRNAs do not have a poly A-tail.

For mRNA, the next biological step involves the transcript being turned into a protein. This is called translation.  In many cases we would prefer to measure the proteins in the cell, but they are much more chemically complex and hence harder to quantify.  Once they are quantified, however, the statistical methods used for their analysis is very similar to methods for analysis of nucleic acid data.

Questions that researchers ask about RNA include:

  • What genes are expressing RNA in a tissue?
  • What splice variants are being expressed in a tissue?
  • What differences are there in RNA expression between conditions (e.g. different tissues or under different treatments)?
  • What biological process (e.g. methylation, protein or ncRNA binding) is turning the genes on and off?

Many of these items will be covered in more depth in this course, especially the statistical analyses for these procedures. 

1.6 - Some Important Technology for Characterizing RNA and DNA

A number of important technologies have made it possible to measure nucleic acids.  The most important of these are outlined below.

Polymerase Chain Reaction (PCR)

Measuring DNA or RNA requires several steps to isolate the molecules from the organism.  Once we have a DNA or RNA sample, there are several requirements to obtain good measurements: the molecules must be stable enough to be maintained throughout the measurement process, the molecules must be present in quantities above detection level, and for some measurement devices, molecules must be tagged with a label.  PCR provides all of these requirements.

In the cell, DNA is synthesized during cell division using enzymes which attach to the unzipped DNA and attach matching nucleotides to each base from the 5' to 3' end of the DNA molecule.  Mimicking this in the lab was a very slow process until the development of the polymerase chain reaction in the late 1970's.  Modern use of PCR seems to have started in 1983 with the process devised by Mullis. Advances in this technology have led to more efficient and accurate duplication (amplification).

Basically, the DNA reproduction enzymes are attached to locations on the DNA using primers, which are short segments of DNA which complement the piece that you want to replicate.  Each cycle of PCR duplicates the segment of DNA, so there is an exponential growth in the quantity of DNA in the sample.  For whole genome work, a large set of primers are used so that every region of the DNA can be duplicated in each cycle.  During PCR, the nucleotides in the newly synthesized DNA can be labeled so that after several cycles of duplication most of the DNA is labeled.

Duplication proceeds from the 5' to 3' end of the molecule.  Since the binding of the duplication mechanism is not perfect, the length of the molecule that can be accurately and completely duplicated is limited.  A common result of sample amplification is loss of sequence at the ends of the molecule.  This is one of the reason that shot-gun methods are used - the DNA is fragmented into shorter pieces that can be duplicated.   Over time, reagents and processes have been improved to improve the length of fragments that can be duplicated.

The duplication process is not perfect.  As amplification proceeds, errors may be introduced into the synthesized DNA.  The more amplification is done, the more errors are introduced.  For this reason there is a trade-off between starting from very small samples of DNA (such as the DNA from a single cell or a small region such as a tissue interface) which need to be amplified so that the DNA can be detected by the measurement instrument, and starting from a larger sample that comes from a less homogeneous biological sample.

Because the number of duplications is controlled during the PCR reaction, PCR can be used to quantify the amount of DNA of a specific type in a sample using a method called real time or quantitative PCR (RT-PCR or qPCR).  A primer is used to select a single region of DNA.  At each PCR cycle, the amount of label detected in the sample is recorded.  Since this should be a logarithmic growth curve, it can be extrapolated back to the zero-th cycle to assess the amount of DNA in the original sample - even if the label is below detection limits for the first few cycles.  This method is often used as the gold standard for quantification.  These days it is pretty cheap if a primer is already available.  Note however that it is a "one feature at a time" method as opposed to the methodologies we use in bioinformatics (although it is usually performed in sets of 96 wells).

PCR acts only on DNA.  We want to measure DNA when we are sequencing the genome, looking for genomic variants such as SNPs and microsatellites or looking at DNA modifications such as protein binding and methylation.  However, for gene expression, ncRNA and isoform detection we need to measure the RNA in the cell.

1.7 - Reverse Transcription - PCR

In 1970, two  investigators Temin and Baltimore, independently discovered reverse transcriptase, an enzyme used by retroviruses.  Retroviruses store their genetic information in RNA, but reproduce by  reverse transcribing their RNA into DNA which is then inserted in the chromosome of the infected cell.  When the infected cell replicates, so does the retrovirus.  Temin and Baltimore shared a Nobel prize in 1975 for discovering the enzyme that controls reverse transcription.

Although some new technologies can measure RNA molecules directly, most RNA measurements are taken by extracting RNA from the tissue sample and doing a cycle of reverse transcription PCR (RT-PCR) to synthesize complementary DNA (cDNA). Then ordinary PCR can be applied to the sample.  This is handy for a number of reasons - cDNA is more stable than RNA; primers can be used to select the RNAs of interest (often mRNA with poly-A tails or precursors of ncRNAs, or ribosomal RNA for metagenomics).  Most importantly, any chemical process or measurement device that can be used with DNA can be used with cDNA and so with RNA.

DNA in the cell and then in the test tube

 

1.8 - Chromatin Immunoprecipitation

chromatin immunoprecipitation processAnother important technology is chromatin immunoprecipitation which is used to measure proteins bound to the DNA.  We would like to know where the binding takes place because it might tell us something about how genes are activated or suppressed. (1).

The bonds between the proteins and the DNA are quite unstable so the first step is to stabilize them using a chemical, usually formaldehyde (2). This makes the bonds more permanent so that the proteins remain bound during labeling and other procedures. The proteins of interest are labeled using a target antibody (3). Now we have a complex combination of antibody, protein, the binding agent and the DNA.

The DNA is then fragmented (4).  The fragments including the antibody are captured while the other fragments are washed away.  Ideally  all the remaining fragments include the tag and the bound protein, although in practice some tagged fragments are lost and some untagged fragments remain (5).  As a result, our DNA sample is enriched for tagged fragments, but still has a background of other fragments.

Finally, the bond between the protein and the DNA is broken, releasing the DNA. The DNA can then be measured using standard technologies. (6).

Very similar methods can also be used for other chemical modifications to the DNA such as methylation - tag, fragment, wash away untagged fragments, and then sequence what is left.

1.9 - Microarrays

A microarray is a substrate to which are attached millions of single-stranded (c)DNA complementary to the (c)DNA you wish to detect. They come in two flavors, either the substrate is a small plastic or glass slide like a microscope slide or a plastic bead.  There can be from a few thousand to 1 million probes each consisting of thousands of single strands with the same sequence. The probes are attached to the substrate.

microarray

A labeled sample of DNA or cDNA is allowed to hybridize or attach to the probes. Each fragment should bind only to a complementary probe.  After hybridization, the substrate is washed which removes any material that didn't bind to a probe. Each probe on the microarray should be bound to a sample of the targeted complementary DNA.  The quantity of bound (c)DNA is expected to be roughly proportional to the amount of that type of (c)DNA in the labeled sample.

microarray

A dye intensity for each probe is summarized by a scanning microscope which essentially takes a photo of the microarray at the wavelength of the label.  The raw data is the intensity of reflectance of the label at each pixel. This intensity is expected to be proportional  to the amount of material in the labeled sample.

The most fundamental data is a digitized photo of the array giving the label intensity.

digitized photo of light intensities of microarray

This is a digitized and color-enhanced photo of an older array. You can tell that it is older because the spots on this image are not uniform in intensity due to the technology used to create the probes. On the newer microarrays used since 2005 the probes are printed on the array surface using print technology and appear absolutely perfect at this scale.  

Microarrays come in many formats.  The microarray pictured is sometimes called a "spotted" microarray, due to the original printing technology.  Each spot represents a cluster of probes of the same oligonucleotide (fragment of DNA).  Usually the entire cluster is referred to as a probe.  Either one sample (single channel) or two samples with different labels (two channels) are hybridized to this type of microarray.  There may be multiple probes for a single gene, sometimes actual duplicates and sometimes different oligonucleotide sequences  (oligos) from the same gene.  Often several identical microarrays are printed on a single glass or plastic slide, with a barrier around each to keep the samples from mingling.  When an entire experiment can be run on a single slide, uniformity of the hybridization conditions is assured.

Another older format that is still in use is the Affymetrix microarray.  Affymetrix was the first manufacturer to synthesize the probes on the array surface, allowing very accurate probe synthesis and close spacing.  Each gene is represented by one or more "probe sets".  A probe set is a collection of probes made up of 25-mer oligos selected from the DNA sequence.  The older arrays had both "perfect match" probes, which exactly match the reference genome (at least the reference when the array was first designed) and "mismatch probes" which are paired with the "perfect match" probes but differ in the central nucleotide.  The "mismatch probes" were supposed to assist with background correction and correcting for cross-hybridization; however as evidence mounted that this strategy did not work well, newer Affymetrix microarrays have only "perfect match" probes.  Usually a gene is represented by 9 to 11 "perfect match" probes, with the number of probes per probe set constant in any given microarray.  Probes in the probe set are chosen for uniformity of hybridization and uniqueness.  On many microarrays, most of the probes are selected from the 3' end of the gene.  Cartoons showing the construction of a probeset can be found at the Affymetrix website.  Affymetrix microarrays are still very popular, especially for genotyping using a set of known SNPs.  This is because the "prefect match"/"mismatch" pair is readily replaced by the 2 SNP variants at a locus.

Another microarray technology is the bead array.  We will not be discussing bead arrays in this course, but you can read about them at https://www.ncbi.nlm.nih.gov/probe/docs/techbeadarray/ [1].

The analysis technology for microarrays, from probe intensity summary of the image pixels through the analysis of gene expression or gene variant is very mature.  Different preprocessing methods are required for each type of array to achieve a summary.  For gene expression microarrays, this might include probe location detection, summary of the pixels in the area defining the probe, background correction, assembling the probe intensities into gene intensities.

 Here is an example of what the data might look like for a spotted two-channel array. This is an intermediate probe summary produced by the scanner.

table of probe intensity data

The table is a rectangular array. It has row and column numbers, probe name, X and Y coordinates for the center of each probe and the diameter of the probe. Then the intensities are reported for each probe. Each pixel of the photo represents an intensity.  Since the probes consist of many pixels, a summary such as the median, mean and standard deviation are given so that you have choices to use for analysis.  On modern microarrays the median and the mean are practically the same because the spots are accurate and uniform.  However, if you use historical data, they might be quite different.  There are also summaries of the background local to each probe.

 The probes are designed to detect various features by selecting parts of the reference genome or transcriptome to match. For this reason, microarrays are quite flexible. You can use them for gene or exon expression, SNP detection, ChIP, methylation and other features of interest.  

Microarrays have some pros and cons. There are an older technology so we know in great detail how to process microarrays, both in the lab and statistically. They are relatively cheap.  Data storage is minimal - you can store the outcome of your entire experiment on your keychain flash drive.

However, microarrays are species or genotype specific, although the same array can sometimes be used for closely related species e.g. humans and chimpanzees. A serious downside is that you have to know to advance what you're looking for because microarrays require known sequences be used as probes. You can't capture anything for which you don't have a probe. In addition, genetic variation affects how intensely the complemental DNA hybridizes.  

Sequencing technologies by contrast can detect unknown sequences.  However, you will not know what you have found unless you have a reference sequence or have enough sequences to create a de novo reference.  This sounds contradictory, but is not.  For example, since we have sequenced almost all of the human genome, previously unknown protein binding sites, methylation sites, etc can be identified by sequencing and their locations on the genome identified by matching to the reference.  

A strategy that is now commonly used for species with few genomic resources is to do a small amount of transcriptome sequencing and use it to create microarray probes.  Annotation can be challenging, but once highly expressed or differentially expressed probes are identified, lower throughput technologies can be used to identify the genes.

As well as DNA microarrays, there are now microarray technologies for proteins.  Instead of DNA probes, the microarray is printed with "capture probes" which bind to the proteins of interest.  The probe intensity is then measured, yielding a set of intensities that are similar to DNA microarray data.  While preprocessing of the arrays may differ from DNA arrays, the statistical analysis of the processed data is quite similar.

1.10 - Massively Parallel Sequencing

Around 2005, new methods for very fast, accurate DNA sequencing became available. This means that we can also sequence cDNAs. These next-generation technologies can read the genomic sequences of millions of fragments of DNA.  Interestingly enough, you don't need to know anything about the organisms, and a priori sequence information is not required.   The most recent technologies allow sequencing of RNA directly, without conversion to cDNA.  Although these methods are currently not as high throughput as DNA sequencing, technological advances are very rapid in this area, increasing throughput, accuracy and the lengths of the fragments that can be sequenced.

The cost of sequencing has come down considerably, although sample preparation is still somewhat expensive compared to microarray sample prep. However, the biggest problem is transporting and storing the raw sequence data. Sequencing data sets run anywhere from 10 to 15 Gb. Some of these data sets are so large that it is faster and cheaper to load up an external hard drive with terabytes of data and ship it than it would be to rely on downloading this over the Internet.  Long-term storage of the data is also problematic.

The newest technologies will give you 1 - 250 million fragments (reads) per sample. Shorter fragments are cheaper per read, but are also less informative. Most of the technologies allow a variety of read lengths from about 50 to a few hundred nucleotides.  In general, technologies that allow longer reads are more expensive per nucleotide than those that have shorter maximum read length.  

Here is some data from Marioni et al., 2008.  This is the raw data.  The reads were oligos of length 36 (36-mers).  The output is stored in a FASTA format or FASTQ (which also has quality scores). 

GGAAAGAAGACCCTGTTGGGATTGACTATAGGCTGG
GGAATTTAAATTTTAAGAGGACACAACAATTTAGCC
GGGCATAATAAGGAGATGAGATGATATCATTTAAG

When these data were collected, read lengths of 50 bases were considered long.  Now, the reads are longer.  100 bases is usually adequate for gene expression in sequenced organisms which have a good reference genome or transcriptome.  Much longer reads are desirable for assembling a genome or transcriptome or for complex genomes which have duplications such as tetraploids.

The strings of characters need to be identified.  When there is a reference genome or transcriptome, this is usually done by mapping - i.e. matching the sequences to sequences (or complementary sequences).  When there is no reference, the reads are matched to one another to build a de novo assembly which is then used as the reference.

Sequencing is very good at the following tasks with DNA:

  •  "de novo" sequencing
  •  metagenomics i.e. sequencing mixed communities of multiple species
  •  resequencing (detecting biological variation in a population)
  •  SNP detection and quantification (detecting biological variation in a population)
  •  protein binding site discovery and quantifying binding
  •  methylation site discovery and quantifying methylation
  • chromatin interaction (detecting regions of chromosomes that interact)
  • and anything that you can think of that measures a piece of DNA, including creating micro arrays.  

Sequencing is also very good at the following tasks with RNA:

  •  quantifying gene expression
  •  quantifying exon expression
  •  quantifying non-coding RNA expression
  •  isoform discovery
  •  quantifying isoform expression
  •  microarray probe construction

Fragmenting the (c)DNA can introduce biases.  If you use a certain enzyme it might preferentially chop up the molecules at certain sites. Therefore, since the sequencer starts at one end, all the reads from that fragment contain that same bias. If you use physical force (e.g sonication) to fragment the molecules then there may be weak sites and they would preferentially break. You might think that you have higher expression or higher methylation at these loci when in fact you'll see more of these because this is where the molecules break.  This is not the main focus of what this class intends to cover, but you need to be aware of this problem.   The newer technologies that do not fragment the RNA or DNA have fewer biases.  

Right now there are two choices for short sequencing technology: single and paired end. In single end sequencing the sequence of the fragments is determined by a sequencer starting from either the 3' or 5' end of the fragment. Paired end technology sequences from both ends possibly leaving an unsequenced link in the center.  When read lengths were short, paired end sequencing was a good option to obtain longer, more informative fragments.  Now that read lengths can be quite long, there may be no unsequenced link - in fact, the center of the read might be sequenced from both ends.  For this reason, paired end sequencing is becoming less popular.  As they become less expensive, whole molecule sequences will likely supercede short sequence technologies.

Reads are usually not useful unless we can identify them in some way.  Both for mapping and assembly, longer reads are easier to use.  Firstly, with longer reads the matching software can be instructed to tolerate more mismatches.  Mismatches occur due to heterozygosity, differences between the reference and the samples, PCR errors and sequencing errors.  Secondly, most genomes include repeat elements and gene duplications which create highly similar sequences - longer reads are less likely to match multiple locations on the reference or create a false assembly due to joining fragments from different genomic regions.  

Statistically, for most of the analysis we do we will won't care about any of this. We will be talking about the data AFTER mapping.  However, when you are planning a study, you need to carefully determine the appropriate sequencing length and single/paired end technology to ensure sufficient high quality data for your purposes.

The methods of sequencing differ among the 4 or 5 different types of technologies being used but they mostly have a similar flavor. The fragments are captured. PCR is used form a cluster from each fragment, and the sequencer starts reading from the end of the cluster. A chemical process is used to expose and tag the nucleotide at the end of the fragment and then the entire plate is captured as a photo at the tag frequeny. (The raw data are therefore 4 photos at each location on the fragment.)  The exposed base is then stripped off and a new base is exposed.

laser process

(Image used with permission from the America Journal of Clinical Pathology)

If the read length is 100, there will be  100 × 4 photos from this sample.  This is because at each step a fluorescent label is added that is specific to one of G, C, A or T.  Each photo therefore has a bright spot at the currently labelled clusters.  In a high quality "base call" only one of the 4 nucleotides is detected in a cluster in a particular read position.  In the graphic below you will see peaks of fluorescence for each of the different bases for several clusters. The peaks shown below are all high quality, because only one nucleotide is fluorescing at each location.  But recall that each cluster is made up of thousands of identical strands.  As the process proceeds, errors may occur because the reagents are degrading or because the already sequenced nucleotides are not detaching from the strand.  As a result, the clusters may gain signal for several different nucleotides, which would show up as peaks for several nucleotides, instead of one.  The quality score summarizes this.  Usually the quality score decreases for nucleotide locations further down the strand.  When the score is too low we often trim the reads by truncating to a smaller number of nucleotides per read.  

sanger sequencing
re-size image  (copyright needed?)

This vector of quality scores for each base would add considerably to the size of the file. Usually all that recorded is the most likely nucleotide and the quality score for that nucleotide.  Recall that the typical size of the raw data file for single sample is about 10Gb.  If the data for all 4 nucleotides is retained, that increases to 40 Gb.  It is often less expensive to resequence the sample than to store all the data.

1.11 - Microarrays versus Sequencing

One question that frequently arises is whether, with the cost of sequencing falling so rapidly, it is worth learning the analysis of microarray data.? Microarrays are clearly not competitive for many applications such as discovering transcribing but untranslated regions of the genome and finding anything de novo.  A microarray array can measure only features that are complementary to features that are printed on the array, and hence are determined in advance.   However, I think that microarrays will continue to be used for measuring nucleic acids at least for the short time horizon, especially for genotyping with SNPs.  As well, microarray technology is now being used for other types of "omics" data such as measuring proteins. 

The two main advantages of microarrays are cost and the ease of handling the data.  Microarrays are now cheap to create and inexpensive on a per sample basis.  More importantly, a data summary is produced for each probe on the array, rather than for each DNA or RNA fragment in the sample, which creates a tremendous savings in data management - from about 25 million reads per sample to about 50 thousand probe intensities for gene expression.  As well, it is not necessary to map the results, which requires bioinformatics skills.  If you have a limited budget or limited bioinformatics personnel,     microarrays will likely give you more data for less cost if you are doing standard gene expression studies or genotyping.

Moreover, from a statistical point of view, microarray data analysis is worth learning because the statistical models behind the analysis is more basic than the models for sequence data. However, the analysis of the sequence analysis is a generalization of the analysis of microarray data. So, if you understand the analysis of microarray data, then it is easier to understand how to do the analysis for sequencing data.

 Microarray data are intensities from a laser reflectance that are treated as continuous data. We usually assume that the logarithm of intensity is approximately normally distributed, which greatly simplifies statistical analysis. With sequencing data, the data are the number of reads mapped to each feature. Count data has its own particular set of statistical properties that are not shared by intensity data. 

Sequencing data can be more informative than microarray data because whatever can be sequenced can be found.  So, if you have some way of pulling down part of the RNA or DNA, or anything else, you can find something new.  On the other hand, the data can be difficult to work with.  Even if there is a good reference sequence, mapping is computer-intensive and requires expertise.  For example, in a recent cancer study in which I was involved, routine settings of the mapping software mapped only about 50% of the reads, even though these were human samples, which have the best reference data.  Working with a collaborator who has expertise in this area, we were able to map over 80% of the reads as well as discovering a number of novel transcript variants.  

If there is no reference, a reference can be built using the sequenced samples, but this is even more computer-intensive and requires a different set of skills.  Once the reference has been built, the reads from each sample need to be mapped against it.  Considerable computational resources are required for these efforts, not the least of which is data storage.

Bias

Each of the technologies has his own biases.  The biases in microarray data come from the selection of the probes on the array, which may be incorrect or match multiple features.  As well, the probes are developed using a reference, which may not be the same genotype as the samples. Probes bind most tightly to exactly matching complements, so differences in genotype can affect the intensity measure.

Sequencing data starts with the millions of reads, which need to be mapped to the reference for allocation to features. As with microarray data, there may be errors due to sequencing errors and possibly errors in the reference (especially for recently built references).  Even if there are no errors, the sample that you are sequencing is not identical to the reference due to differences in genotype.  As well, there are always some reads which do not appear to match anything (which is always a mystery!)  These days with a good reference, it is common to have 90% or more of the reads map to features of the reference.  

One difficulty in mapping is handling reads that map to multiple features.  Genes come in families which have similar sequence - just how similar depends on the evolutionary history of the family.  As well, some organisms (including humans) have multiple small repeats such as transposable elements or other phenomena.  Mapping software will deal with reads that map to multiple features in different ways - e.g. they may not be mapped to any features, they may be mapped to the first feature they match or they may be assigned at random among the features they map.  It is important to understand how your features were mapped when you want to interpret the results of your analysis.

Another bias introduced during the mapping step is quite unexpected - the mapping software might discard features that have very high mapping rates, under the assumption that they are transposable elements or other features not usually considered interesting for the downstream analyses.  As a result, when mapping very large libraries for RNA-seq analysis, some very highly expressed genes might be discarded.

It can be helpful to visualize where the reads are mapping on the features.  A genome browser is a visualization tool that allows us both to see where the reads are falling with respect to feature boundaries, and to compare with characteristics of the genome - perhaps reads from other samples, or GC content, percentage of repeats, etc.   The example below compares gene expression in brain and muscle tissue with the annotation of the exons and isoforms for this gene.  Each line of the graphic is called a track.  The bottom of the image displays the area that is uniquely mappable so that you can interpret this, comparing where reads show up or do not show up compared to where the exons fall.


RNA-seq data from several tissues versus known exons and "mappability". Notice that most of the reads are in exonic regions (rectangles in the MYH7 lines), but the brain sample has reads in an intron. 

 

If there is no reference genome or transcriptome, reads can assembled to create 'contigs'.

    reads           CCTGATTCAT           TT--GATAATG         ACGTGTAC
                   AGCCT--ATT                    TAGAT--ATGG           GTGTACCAT
                           CTGATTCATTA      TTAGATAA           CACGTGT
 
contigs       AGCCTGATTCATTA       TTAGATAATGG    CACGTGTACCAT

 Here the reads are overlapped by looking at matching pieces. The – indicates the matching software placed a gap there so that it would match with another string. Or it might indicate a location with such a low quality score that was taken out. These 'contigs' become your reference.

Assembly might be done even if there is a reference to correct errors in the reference or to investigate genomic or transscriptomic variants. Assembly is improved with longer  reads, i.e. the longer the pieces that you have the better chance you have of being able to match things up correctly. 

Now that we have so much sequencing capacity in a single run, we often run multiple samples in the same run.  This is called multiplexing.  To identify the nucleotides from the individual samples, a short identification sequence of C,G,A,T called a barcode is added during the sample preparation.  Each fragment from the same sample should have the same barcode, so after sequencing the reads can be sorted into samples.  It is quite common to run 2 to 8 samples in the lane and currently up to 44=256 barcodes can be used with some sequencing technologies.

Multiplexing is particularly convenient if you are trying to assemble a transcriptome.   You can extract RNA from several tissue types that might have different gene expression, bar code and then run them together, so that you are doing the assembly from a very rich set of RNAs. You can cover most of the gene expression, and at the same time you can get tissue specific information.

Massively parallel sequencing is also useful when you have mixed samples such as virus-infected cells or entire microbiomes, such as the intestinal organisms or soil samples.  Everything in the sample can be sequenced.  The mapping programs are then used to assign the reads to their organisms of origin.

1.12 - Final Thoughts

The Internet

Although we tend to take the internet for granted these days, the "omics" revolution could not have taken place without it.  The internet has made it possible for scientists to develop and share software and store massive amounts of data including the reference sequences, annotation, and the scientific journal articles we all use to help us design studies and interpret the outcomes.  What is more, much of the software and information is freely available including

  • documentation
  • documentation tools
  • sequence matching tools
  • statistical analysis tools
  • visualization tools
  • tools to organize the tools!

However, just like other information on the internet, the quality of what you can find is uneven.  It is best to use data and software from trusted sites.

In the US, any "omics" project funded from federal courses such as the National Science Foundation (NSF), National Institutes of Health (NIH), US Dept. of Agriculture (USDA) etc should be freely available via the Internet. Many of the higher quality journals also insist that the data be available either from curated repositories or from the journal website.   However, although the data are available, the quality of the documentation is often very poor.  For example, you might find microarray data with no information about the probes or missing information about the samples.  Sequencing data may lack information about the total library sizes, and so on.   Even if the data are available and documented, it might not be possible to replicate the analyses from the associated papers because the settings for the mapping software, the preprocessing steps, etc may not be recorded.  It seems that documentation of what we have done is the very last thing that seems to take place after our paper is published or contract is fulfilled.

One of the objectives of this course is to understand what makes an experiment replicable, and how to do our analyses in a way that is replicable as well.

The Challenges Ahead

There are a lot of challenges in working with genomics data. The scientific questions are very deep. We have masses of information, but we still don't know how the bits and pieces fit together.  An analogy might be being introduced to a hardware store where you have all of the screws, nails and tools but no instructions and very little information about which pieces should be used together.  There are huge possible impacts on human life in terms of tinkering with disease, crops, embryos, etc.   Both the scientific and ethical issues are enormous and growing, as the tools available to the lab scientists become more and more sophisticated.

One of the challenges for statisticians is trying to keep up with the rapidly changing technologies.  Important distribution characteristics of the data change as the the technology changes, and the technology is changing extremely rapidly.  Although our top-level analyses such as clustering and t-tests will continue to be valid, the data cleaning and preprocessing steps change with the technology.  Often the developers of the technology also provide data analysis tools, but these are not always correct or efficient.  In addition, the analysis of the data requires at least some knowledge of the rapidly evolving web-based knowledge repository, some of which has self replicating errors. And, partly because we are using so many tools, we don't always know where these errors are creeping in. 

There is also the p greater than n problem. In statistics, p is the number of features (genes, exons, SNPs or whatever it is that you are measuring). n is the number of samples. It just happens to be a fact of life that if you have more features than you have samples then with your current data you could always do perfect classification and perfect prediction although your classification or prediction rule would like not work correctly with a new sample.  Even if the data are just noise, when p>n there would be a unique set of noise that would, for instance, divide the tumors from the normals.  In some medical studies, we now have quite large sample sizes as well, in the 10s of thousands of patients.  However, since we often have millions of features such as SNPs, we might still have p greater than n and have even more massive data to deal with.  In the core scientific disciplines, we are lucky if we have 10s of samples.  So we cannot rely on statistics alone - we need to bring other information into our analyses.

Another challenge is the sheer the amount of data that we need to handle. For example, an RNA-seq sample is 25 million or more short sequences per sample which can be 10 Gb or more.  This means that the data are not readily up- and down-loaded, and are difficult to handle on a laptop or desktop computer. The data are also very noisy, and as we try to understand this noise we find that much of this noise depends on the measurement technology that's being used.

icon for collaboration - four individuals and a plus signThe most successful researchers have either trained themselves to be very cross-disciplinary, with knowledge about biology, computer science and statistics, or have assembled cross-disciplinary teams of collaborators.  A barrier to collaboration is the impenetrable jargon coming from biology, computer science and statistics.  Even scientists in the same discipline may not use the jargon in the same way.  Personally I deal with this by asking lots and lots of questions.  This often helps open dialogs that greatly improve communication.  


Source URL: https://onlinecourses.science.psu.edu/stat555/node/2

Links:
[1] https://www.ncbi.nlm.nih.gov/probe/docs/techbeadarray/