Lesson 12: Single Nucleotide Polymorphisms

Printer-friendly versionPrinter-friendly version
Key Learning Goals for this Lesson:
  • Understanding genomic variation
  • Understanding why SNPs have become a popular genomic variant and marker for genetic analysis
  • Introducing how SNPs are detected.
  • Introducing the concept of genetic marker as opposed to the genetic cause

Single Nucleotide Polymorphisms (SNPs)

Differences in DNA among individuals drive many types of phenotypic differences. There are many types of genomic differences between individuals of the same species or related species.  These could include:

  • insertion and deletions of chunks of DNA,
  • duplications of chunks of DNA particularly duplications of whole genes
  • inversion of chunks of DNA, where a chunk of DNA gets pulled out of a chromosome, reversed and reinserted on the opposite strand,
  • “stuttering” of repetitive regions, often called microsatellites, and
  • common single nucleotide changes, also called single nucleotide polymorphisms or SNPs
  • rare single nucleotide changes, also called single nucleotide variants or SNVs.

Ideally we would like to determine the actual genetic variant that causes a phenotype.  However, with high throughput methods, often we are able only to locate a region on the chromosomes that is associated with the phenotype.  In this case, the variants are used as genetic markers, bounding the regions of the chromosome most likely to be associated with the phenotype.  We can do this due to linkage disequilibrium (LD) which is the propensity of chunks of DNA which are close together on the same chromosome to be inherited together.

Single nucleotide changess are probably the simplest type of genetic variant to study with high throughput methods.  Currently, we distinguish between SNPs which are relatively common (occurring in at least 1% of the population) which are called single nucleotide polymorphisms or SNPs, and rarer variants called SNVs.  

Here is an animated presentation from the National Genome Research Institute presented as part of their DNA Day resources.  Click on the 'Watch' button and a new window will open up with this presentation.  In the new window, click on the button 'What is a SNP?'




SNPs are often used in genomic studies because they are readily detected and measured.  Known SNP locations can readily be used to create microarray probes with both variants on the microarray.  This is a cheap and efficient way of genotyping, especially when the genotypes are used as genetic markers.  SNPs may also be detected during DNA sequencing.  Genotyping of a new sample at known SNP locations is readily done using mapping.  

High throughput sequencing can also be used to detect rare or novel SNVs, by looking for differences between the reads and the locations to which they are mapped.  Since sequencing is inherently noisy, typically high coverage is required to verify new SNV calls, with local coverage in the region of the SNV of about 75 reads per base being preferred.  This can be done with whole genome sequencing.  However, often a much smaller subset of the DNA, the coding exons, are used instead.  The exome sequencing approach can greatly enhance coverage, because the exome is only 1% or 2% of the DNA.  However, the chemistry required to enrich the samples for exomic regions can also introduce biases in coverage even in the exons, and of course cannot detect SNVs in the regulatory regions which may be the actual cause of the phenotype under investigation.  mRNA can also be used to detect SNVs in the coding region, although only highly expressed genes are likely to have sufficiently many reads to make this possible.

For de novo detection of SNVs, we expect that after mapping we would see something like the situation below.  Here we align the reads to the same region because they have identical sequences except at the location of the putative SNV.  Typically it is assumed that there are only 2 variants at a SNV location.  Suppose in our population of individuals, the variants are A and T as in the example below.  Then we expect to see some individuals with only A at the location, some with only T (homozygotes) and some with about 50% of each (heterozygotes), having inherited one of each allele from their parents.  It is possible that only one homozygote type is in the population if the variant is sufficiently deleterious.  

table of snp reads

Depending on where this change is in the chromosome, there are many possible biological effects of single nucleotide changes (SNC). If it happens the middle of the coding region it could change the protein.  However, since many amino acids have 2 or more codes, it might be a synonymous mutation which does not change the protein - biologists are currently investigating in whether synonymous mutations might have other biological effects, such as changing the rate at which the transcript is translated into a protein.  If the change is in the promoter region of a gene or other protein binding site it can change the gene regulation. If it is in other places such as introns or intergenomic regions it might change some of the energetics of what's going on. Much of this we don't understand very well yet.  A SNC might also have an effect only if there is some other sequence difference that goes along with it. Finally, there is a possibility that a SNC has no effect at all.

SNCs are not necessarily more important than other types of variation in the genome. In fact, they may be less important than other kinds of mutations or polymorphisms because they are small. However, they are very easy to measure with the technologies that we have today. Larger mutations may be more difficult to measure, because, for example, it is difficult to map regions in which many adjacent nucleotides do not match.  

The idea of the perfect match and mismatch probe on the early Affymetrix microarrays was ideal for SNP detection.  Instead of perfect match and mismatch, we have 2 probes for each SNP with the same sequence except in the center, where the 2 SNP alleles are printed.  SNP microarrays are heavily used for human genotyping and are also available for many other plants and animals.  Because genotyping by microarray is much less expensive than genotyping by sequencing and the data management is much simpler, the use of microarrays for genotyping with known markers is likely to be used for some time to come.

Understanding the link between genotype and phenotype is an important goal in many studies, such as plant and animal breeding and understanding the genetics of both simple and complex diseases.  As a result, large databases of SNPs are now available for many organisms, including agriculturally important plants and animals.  

The largest SNP and SNV databases are being collected on humans.  There are several companies and/or research institutes with the objective of genotyping hundreds of thousands of individuals to provide data useful for personalized medicine.  A variety of technologies are being employed, but the most use exome sequencing, which provides the capability of de novo SNV discovery at less cost than whole genome sequencing.  Groups collecting these data may offer incentives such as ancestry tracing or some type of summary of possible disease-associated genotypes.  The amount of data that will soon be available is quite staggering!  As well, these data will include relevant phenotypes (such as medical records) and possibly familial information.

This course covers the use of SNP data only briefly.  We do not cover calling SNVs de novo although we will briefly consider the work flow.  We will assume that the SNVs are known in advance and are sufficiently frequent to be classified as SNPs.  The genotype of each sample at each SNP has been determined - either by a microarray or by mapping - a limited number of phenotypic variables, such assick or healthy are recorded.