12.1 - Finding SNPs Using Sequencing Data

Printer-friendly versionPrinter-friendly version

While the initial draft of the human genome was published in 2000, there have been many subsequent efforts to add to our knowledge of the human genome by assembling genomes for many different individuals.  This type of project is called "resequencing" because the initial sequence can be used as a reference making the analysis of each additional individual's DNA much simpler. In a single genome, genomic variation can be determined only at sites at which the individual is heterozygotic.  Resequencing projects allow researchers to determine the extent of genetic variation in a population.  As well, resequencing parents and off-spring allow researchers to determine how frequently new variants arise in a population.

In human populations, we are often interested in disease phenotypes and the associated genotypes. The genotype might be causative or it might associated with something else that is actually associated with disease - for example, a genotype associated with obesity might be due to its association with a phenotype such as a glucose processing phenotype.  Other uses of SNP analysis is for inferring evolutionary relationships.  Finally, SNP analysis is now being used for breeding of domesticated plants and animals to improve yields, nutritional value etc.

 We are going to focus on finding the association of SNPs with phenotype. However, in this section, we will briefly go over SNP calling using sequencing.

SNP Detection with a Reference Genome

As with other high throughput sequencing analyses, the workflow begins with fragmented DNA from a tissue.  The fragments are sequenced and then mapped to the reference without the requirement of perfect match at each location.  SNP detection begins after mapping. 

Here's an example of reference and mapped reads:

table of snp reads

As you can see in this reference above these SNPs are not necessarily in the middle of the read. However, after the reads are aligned you can detect single nucleotide mismatches.  Here we show reads from a heterozygote.  If all the reads have an A or a T at the SNP location, the individual is a homozygote.  In the case that the individual matches the reference, we only say that a SNP is present at this location if we have determined in advance that there is genetic variation in the population at the location.

One problem is that the best sequencing methods still have error rates of a couple of percentage points. They are about the same frequency as some of the SNPs that we are interested in detecting. SNPs in the human genome are about 1 per 1000 bp and about 1-2% of reads are exact duplicates which may be due to technical issues in sample preparation so that seeing two reads with the same variant is not sufficient to conclude that a SNP is present. Usually we assume that exact replicates are technical errors and remove all but one.  We also have to acknowledge that the reference itself might have an error at some locations (although for the human genome this is less problematic than for other less well-researched genomes).  Finally, alignment problems can cause errors, if a closely related sequence is wrongly mapped to the site.

To account for the various types of error in the data, we only call new SNPs at locations that have multiple reads (usually at least 75 for de novo calls) and if either the vast majority of the reads match each other but not the reference at this site, or if two variants are in about 50/50 ratio.  

The SNP call error rate is about 3% with Sanger sequencing (which is better than high throughput). However, high throughput sequencing is getting more accurate all the time.

Typically, when calling a SNP at a location, we should have at least 2 reads with each variant.  So if a location is covered by 6 reads and there are 2 A's and 4 T's we would consider the individual to be a heterozygote at that locus with 2 alleles, A and T.  However, if there is 1 A and 5 T's, we would assume that the A is a sequence error unless we have information from other samples that suggest that it could be correct - e.g. in family studies we could consider the genotype of relatives at that location.

Because it is so expensive, coverage for whole genome sequencing is generally low.  However,  coverage is usually quite uniform, with about the same number of reads at each location so that the quality of SNP calls is even over the genome.  Exome sequencing projects usually have much higher average coverage, but the current technologies have biases that create very uneven coverage over the exome.  This can be problematic for SNP calling, although the higher average coverage is advantageous.

SNP Detection with No Reference Genome

If you don't have reference genome calling SNPs is much harder. In this case  local reference sequences can be created by assembly - i.e. by overlapping reads that have sequence similarity.  For this purpose longer or paired end-reads are preferred to short single-end reads.   The DIAL (De novo Identification of Alleles) software was developed at Penn State for this purpose [1]. 

References

[1] Ratan A, Zhang Y, Hayes VM, Schuster SC, Miller W. Calling SNPs without a reference sequence. BMC Bioinformatics. 2010;11:130. doi:10.1186/1471-2105-11-130. link