STAT 555
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 13: ChIP-Seq Data

Lesson 13: ChIP-Seq Data

Key Learning Goals for this Lesson:
  • Understanding the basics of gene regulation through epigenetic modifications.
  • Reviewing chromatin immunoprecipitation as a means of targeting regions of DNA that bind to specific proteins.
  • Understanding the basics of peak calling to define the features for differential binding analysis.
  • Understanding differential binding analysis

Introduction

The basic paradigm in cell biology is that (almost) all the cells in the organism have the same DNA and that cell types differ due to how they are regulated.  Epigenomics is the study of modifications to the DNA that regulate transcription and other cellular functions.  

The objectives of the analysis are to locate the locations of interest and determine how the modifications behave under various biological conditions. Because the exact location of epigenomic modications and protein binding sites is not known for most regulatory processes, designing microarray probe sequences is difficult.  Thus early studies of protein binding using microarrays had only limited success.  Sequencing technology is much better suited to these assays, as the regions of interest can be captured and then sequenced and mapped.  Sophisticated assays which target the regions which are protected from DNAse activity by the bound proteins are possible and also use sequencing technology.   However, these "footprinting" technologies require sophisticated mapping strategies to identify which proteins are bound at each protected site.  In this section we will consider the analysis of one of the simplest assays for epigenomics, which targets regions of the DNA which are bound to specific proteins.  Similar statistical analysis methods can be used to assess other assays such as methylation.

A number of epigenetic features are associated with transcriptional regulation including uncoiling of the chromatin and binding of protein complexes to the DNA.  ChIP targets the proteins bound to the DNA.  There are other assays that target the DNase hypersensitive sites.

transcriptional regulation

Features captured by ChIP-seq, created R. Hardison
Used with permission.

 

ChIP-seq can locate the binding sites of specific proteins which may be part of the transcription machinery or may enhance or block the binding of other proteins required for transcription.

cis regulatory modules

Features of cis-regulatory modules
From Hardison, R.  and Taylor, J.  (2012) "Genomic approaches towards finding cis-regulatory modules in animals [1]", Nature Reviews, 13: 469.
Used with permission.

 

General transcription factors are proteins associated with actively transcribed genes.  There are also sequence specific transcription factors which bind to regulatory sites adjacent to the gene (cis-regulatory sites) and may enhance or silence transcription.  As well, there are proteins that act as insulators, isolating segments of the DNA from activation or silencing.

Epigenetic features play large roles in transcriptional regulation, but understanding these roles requires knowledge of location of the feature relative to the DNA region under regulation. The goal is to determine the locations of these epigenetic features under various conditions. Differential occupancy of sequence specific transcription factors provides insight into gene usage and gene networks.

In the analyses so far, we have been looking at gene expression - the number of transcripts of each gene that can be detected in the tissue.  Often, however, we want a more direct measure of what is happening in the cell.  ChIP-seq and related analyses such as methyl-seq measure targeted regions of the tissue's DNA by enriching a DNA sample for the targeted regions.

chromatin immunoprecipitation process

As discussed in Section 1.8 and illustrated in the figure to the right, ChIP-seq targets regions of the DNA (chromatin) that are bound to specific proteins (1).  The proteins are stabilized by cross-linking them chemically to the DNA (2).  A protein of interest is then tagged with an antibody specific to the protein (3).  The DNA is sheered (4), and the fragments bound to the protein are retrieve by chemically retrieving the antibody (immunoprecipitation) (5).  Finally, the tag and protein are released from the DNA and washed away, leaving behind the DNA that had been bound (6).  This DNA can then be sequenced.  The entire process is called ChIP-seq.  The objective of the assay is to determine the location of the protein binding sites on the DNA, and whether the protein binds to different sites under different experimental conditions.

Typically the remaining DNA sample is enriched for the bound fragments, but still has a background of miscellaneous DNA fragments.  For this reason, we need to do a differential analysis against a sample that was handled similarly except for the antibody binding step. If we are considering several experimental conditions, we may instead do a direct comparison of samples from the different treatments.

As an example, we will be looking at GATA 1 which is a mammalian transcription factor and  ask where it attaches to the chromosome and whether it differs between two experimental conditions.

13.1 - Chromatin Immunoprecipitation

The chromatin in a living cell is constantly undergoing dynamic changes which enhance or inhibit transcription.  Some of these changes are mediated by proteins called transcription factors which bind to the DNA.  By chemically enhancing the binding of the protein to the DNA and then using antibodies that target a specific protein, it is possible to capture the targeted protein and its binding site.  Chromatin immunoprecipitation captures regions of the DNA that are bound the targeted protein.  DNA fragments that are not bound to the protein are washed away, leaving a sample that is enriched for the bound and tagged DNA fragments.  The protein is then released and washed away, leaving a DNA that is enriched in, but not exclusively comprised of, the fragments of interest.  

The fragments that were bound to the proteins are not identical, even if they come from the same region of DNA (in different cells) because the fragments break at different sites.  The bases bound to the protein are protected from fragmentation.  Generally the fragments include the binding site and extend beyond it.  Once the proteins are released from the proteins, they can be sequenced.  Locating the binding sites is done by mapping the sequenced fragments and determining locations on the chromosome which have more mapped reads than expected by chance.  Since fragmentation and capture is not perfectly at random, background reads are not uniformly distributed across the chromosomes, and this must be accounted for when assessing which regions have an excess of mapped reads. 

After enriching the sample for formerly bound fragments, the processing steps include:

  1. Sequence the fragments.
  2. Map reads to the genome.
  3. Determine locations of binding sites (peaks in the distribution of mapped reads).
  4. Remove artifacts due to background reads.
  5. Match peak locations across samples.
  6. Count reads in the peaks.
  7. Differential binding analysis.

As in RNA-seq analysis, a reference is required for the mapping.  Unlike RNA-seq however, the regions of the genome corresponding to features is not known.  The features of interest are the peaks in the distribution of mapped reads.  These may vary from sample to sample and are expected to vary in intensity across treatments.  High intensity peaks are more readily found than low intensity peaks.

Both strands of DNA are bound to the proteins, and so may be captured.  The Genome Browser plot below shows the reads from the two strands in different colors.  

The red and blue instances below are mapped reads from the gata1 ChIP-seq experiment done at Penn State with red indicating one strand and blue indicating the other. 

 

data processing

Calling Peaks

After the positive and negative strand reads are mapped to the reference genome we are ready to determine the location of the binding sites.  These are expected to produce peaks in the numbers of reads piling up at the same spot in a distinctive pattern.  Peak calling inevitably means setting some threshold for the minimum number of reads.  Pile-ups of reads that go above the threshold are called peaks. The threshold is often arbitrary and can be a point of contention in interpreting the data.  If the threshold is set too high, weak binding sites will be missed; if the threshold is set too low, random noise can be mistaken for a peak.  A statistical algorithm is used to "call" the peak locations.

The binding sites are expected to have a distinctive double peak, due to the anti-symmetry between the properties of the two strands. We can see this in the light blue histogram above the plot. The cartoon below illustrates why the double peak occurs.  The fragments containing the binding sites are much longer than the sites themselves.  Because they are sequenced from the 5' end, the actual binding site usually beyond the sequenced part of the read.  As a result, peaks generally present a pattern of pile-ups of reads on the 5' side of the binding site.  However, generally both strands are captured, leading to symmetric pile-ups on both sides of the binding site.  This leads to a symmetric double peak with the binding site in a valley between.

finding binding sites

There are a number of different “peak-calling methods” that can be used. The method that was used to call the peaks in our example GATA1 data is called XSET.  Each mapped read is extended to the average fragment length using the reference. This should extend the reads to include the actual binding site.  Since the reads on the two strands should form symmetric pile-ups, the number of reads at each base should form a symmetric peak as shown below.

finding binding sites

The histogram shows the number of reads at each nucleotide in the sequence.  The "peak width" is the number of bases in the peak area.  The peak intensity is the number of reads in the peak.  

However, it is not always clear where there are true peaks.  For example, there are two potential peaks below.

two histograms

Is the histogram on the right a peak where something is binding with low binding affinity, or is this background? After locating the potential peaks, we still have to decide which ones are true peaks. This is done by thresholding and comparing to a reference sample in which there was no immunoprecitation.

Below is a visualization of reads for the GATA1 enriched sample (upper red track) and an unlabeled sample (lower red track).

removing artifacts

Although the first strong peak in the labeled sample seems large, the unlabeled sample also has a strong signal in that region.  This peak is potentially just background.  There are some other peaks that are fairly unambiguous.  

The most conservative thing you can do with the background is to simply mask the area and assume any reads in this area are due to background. However, you do risk missing pertinent information.  Alternatively, you can subtract out the background.

There are several peak calling programs available, and each will "call" a slightly different set of peaks if the data are at all ambiguous.  It is useful to use the Genome Browser to visualize some of the data to understand the raw read counts and where peaks have been inferred.

13.2 - Differential Binding

Finally, we take a look at differential bindings. Differential binding analysis should be able to identify not only locations where binding signal is completely lost, but also areas showing significant changes in binding strength.  Usually we have several biological replicates and at least two conditions.  Typically the peak caller will not give exactly the same peak boundaries in all samples, so it will be necessary to decide peak boundaries that can be used for all the samples.  For example, in the 4 samples below, the leftmost peak might be expanded to include all the reads in red (erythroblasts) and the double peak in blue (megakaryocyte).  

differential binding analysis

Next, we count the reads in the binding sites derived from all the samples. This is really the same as defining genes, exons, etc in RNA-seq. We then define the “peak region” and count the number of reads that fall into it. This produces a “reads per feature” table with the same features for every sample.

You can then do differential expression analysis using EdgeR or Voom as we did earlier.  The main difference is that there are usually only a few 10s of peaks, rather than 10s of thousands of features.

13.3 - Example in R

In R several packages are available. DiffBind is recommended because it appears to have all the necessary functionality.  Here we show the main output from the analysis of the peaks on chromosome 19.  The details are in the ChIP-seq R Lab.

Required inputs:

  • bam files of mapped reads from ChIP samples
  • bam files of mapped reads from reference samples
  • peak locations from a “peak caller”

DiffBind then does the remaining analyses.

 

Example

Data for Chromosome 19 was prepared from a study of differential binding of GATA1 in 2 cell types.

This not a huge sample. There were 2 erythroblast samples and 3 megakaryocyte samples.

DiffBind reports that you have 5 samples and there were 649 peaks in total that were merged down into 224 consensus peaks that could be used as features in the differential binding analysis.

How many of these peaks were in common in the two cell types?  Just 14.

determing number of peaks

At this point you might ask yourself, should you bother to do a differential binding analysis? The reason for asking this question is that there are only 14 peaks in common.  The other 210 sites appear to be "differential" since they are found in only one cell type.    However, it is actually not so clear that these sites are unbound in the other cells, since the peak caller must distinguish between the background and real peaks.  Both cell types likely have some reads mapped to these regions even though not peaks were called.  For this reason, we use all 224 regions as features.

The next step is to count the reads in each of the 224 peaks and do a differential expression analysis using edgeR, voom or DESeq2.

Usually the library size will be the total number of reads in the bam file (not the total in the peaks).

Notice that we found 224 peaks on chromosome 19.  Multiply this by the number of chromosomes and you may obtain several thousand peaks, so that the number of features is similar to differential expression analysis.  However, unlike differential expression analysis, there is no inexpensive technology that can be used to validate the peak locations and the binding intensity.


Source URL: https://onlinecourses.science.psu.edu/stat555/node/16

Links:
[1] https://www.nature.com/nrg/journal/v13/n7/full/nrg3242.html