STAT 555
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)

Home > Lesson 12: Single Nucleotide Polymorphisms

Lesson 12: Single Nucleotide Polymorphisms

Key Learning Goals for this Lesson:
  • Understanding genomic variation
  • Understanding why SNPs have become a popular genomic variant and marker for genetic analysis
  • Introducing how SNPs are detected.
  • Introducing the concept of genetic marker as opposed to the genetic cause

Single Nucleotide Polymorphisms (SNPs)

Differences in DNA among individuals drive many types of phenotypic differences. There are many types of genomic differences between individuals of the same species or related species.  These could include:

  • insertion and deletions of chunks of DNA,
  • duplications of chunks of DNA particularly duplications of whole genes
  • inversion of chunks of DNA, where a chunk of DNA gets pulled out of a chromosome, reversed and reinserted on the opposite strand,
  • “stuttering” of repetitive regions, often called microsatellites, and
  • common single nucleotide changes, also called single nucleotide polymorphisms or SNPs
  • rare single nucleotide changes, also called single nucleotide variants or SNVs.

Ideally we would like to determine the actual genetic variant that causes a phenotype.  However, with high throughput methods, often we are able only to locate a region on the chromosomes that is associated with the phenotype.  In this case, the variants are used as genetic markers, bounding the regions of the chromosome most likely to be associated with the phenotype.  We can do this due to linkage disequilibrium (LD) which is the propensity of chunks of DNA which are close together on the same chromosome to be inherited together.

Single nucleotide changess are probably the simplest type of genetic variant to study with high throughput methods.  Currently, we distinguish between SNPs which are relatively common (occurring in at least 1% of the population) which are called single nucleotide polymorphisms or SNPs, and rarer variants called SNVs.  

Here is an animated presentation from the National Genome Research Institute [1]presented as part of their DNA Day resources.  Click on the 'Watch' button and a new window will open up with this presentation.  In the new window, click on the button 'What is a SNP?'

watch! [2]

 

 

SNPs are often used in genomic studies because they are readily detected and measured.  Known SNP locations can readily be used to create microarray probes with both variants on the microarray.  This is a cheap and efficient way of genotyping, especially when the genotypes are used as genetic markers.  SNPs may also be detected during DNA sequencing.  Genotyping of a new sample at known SNP locations is readily done using mapping.  

High throughput sequencing can also be used to detect rare or novel SNVs, by looking for differences between the reads and the locations to which they are mapped.  Since sequencing is inherently noisy, typically high coverage is required to verify new SNV calls, with local coverage in the region of the SNV of about 75 reads per base being preferred.  This can be done with whole genome sequencing.  However, often a much smaller subset of the DNA, the coding exons, are used instead.  The exome sequencing approach can greatly enhance coverage, because the exome is only 1% or 2% of the DNA.  However, the chemistry required to enrich the samples for exomic regions can also introduce biases in coverage even in the exons, and of course cannot detect SNVs in the regulatory regions which may be the actual cause of the phenotype under investigation.  mRNA can also be used to detect SNVs in the coding region, although only highly expressed genes are likely to have sufficiently many reads to make this possible.

For de novo detection of SNVs, we expect that after mapping we would see something like the situation below.  Here we align the reads to the same region because they have identical sequences except at the location of the putative SNV.  Typically it is assumed that there are only 2 variants at a SNV location.  Suppose in our population of individuals, the variants are A and T as in the example below.  Then we expect to see some individuals with only A at the location, some with only T (homozygotes) and some with about 50% of each (heterozygotes), having inherited one of each allele from their parents.  It is possible that only one homozygote type is in the population if the variant is sufficiently deleterious.  

table of snp reads

Depending on where this change is in the chromosome, there are many possible biological effects of single nucleotide changes (SNC). If it happens the middle of the coding region it could change the protein.  However, since many amino acids have 2 or more codes, it might be a synonymous mutation which does not change the protein - biologists are currently investigating in whether synonymous mutations might have other biological effects, such as changing the rate at which the transcript is translated into a protein.  If the change is in the promoter region of a gene or other protein binding site it can change the gene regulation. If it is in other places such as introns or intergenomic regions it might change some of the energetics of what's going on. Much of this we don't understand very well yet.  A SNC might also have an effect only if there is some other sequence difference that goes along with it. Finally, there is a possibility that a SNC has no effect at all.

SNCs are not necessarily more important than other types of variation in the genome. In fact, they may be less important than other kinds of mutations or polymorphisms because they are small. However, they are very easy to measure with the technologies that we have today. Larger mutations may be more difficult to measure, because, for example, it is difficult to map regions in which many adjacent nucleotides do not match.  

The idea of the perfect match and mismatch probe on the early Affymetrix microarrays was ideal for SNP detection.  Instead of perfect match and mismatch, we have 2 probes for each SNP with the same sequence except in the center, where the 2 SNP alleles are printed.  SNP microarrays are heavily used for human genotyping and are also available for many other plants and animals.  Because genotyping by microarray is much less expensive than genotyping by sequencing and the data management is much simpler, the use of microarrays for genotyping with known markers is likely to be used for some time to come.

Understanding the link between genotype and phenotype is an important goal in many studies, such as plant and animal breeding and understanding the genetics of both simple and complex diseases.  As a result, large databases of SNPs are now available for many organisms, including agriculturally important plants and animals.  

The largest SNP and SNV databases are being collected on humans.  There are several companies and/or research institutes with the objective of genotyping hundreds of thousands of individuals to provide data useful for personalized medicine.  A variety of technologies are being employed, but the most use exome sequencing, which provides the capability of de novo SNV discovery at less cost than whole genome sequencing.  Groups collecting these data may offer incentives such as ancestry tracing or some type of summary of possible disease-associated genotypes.  The amount of data that will soon be available is quite staggering!  As well, these data will include relevant phenotypes (such as medical records) and possibly familial information.

This course covers the use of SNP data only briefly.  We do not cover calling SNVs de novo although we will briefly consider the work flow.  We will assume that the SNVs are known in advance and are sufficiently frequent to be classified as SNPs.  The genotype of each sample at each SNP has been determined - either by a microarray or by mapping - a limited number of phenotypic variables, such assick or healthy are recorded.

12.1 - Finding SNPs Using Sequencing Data

While the initial draft of the human genome was published in 2000, there have been many subsequent efforts to add to our knowledge of the human genome by assembling genomes for many different individuals.  This type of project is called "resequencing" because the initial sequence can be used as a reference making the analysis of each additional individual's DNA much simpler. In a single genome, genomic variation can be determined only at sites at which the individual is heterozygotic.  Resequencing projects allow researchers to determine the extent of genetic variation in a population.  As well, resequencing parents and off-spring allow researchers to determine how frequently new variants arise in a population.

In human populations, we are often interested in disease phenotypes and the associated genotypes. The genotype might be causative or it might associated with something else that is actually associated with disease - for example, a genotype associated with obesity might be due to its association with a phenotype such as a glucose processing phenotype.  Other uses of SNP analysis is for inferring evolutionary relationships.  Finally, SNP analysis is now being used for breeding of domesticated plants and animals to improve yields, nutritional value etc.

 We are going to focus on finding the association of SNPs with phenotype. However, in this section, we will briefly go over SNP calling using sequencing.

SNP Detection with a Reference Genome

As with other high throughput sequencing analyses, the workflow begins with fragmented DNA from a tissue.  The fragments are sequenced and then mapped to the reference without the requirement of perfect match at each location.  SNP detection begins after mapping. 

Here's an example of reference and mapped reads:

table of snp reads

As you can see in this reference above these SNPs are not necessarily in the middle of the read. However, after the reads are aligned you can detect single nucleotide mismatches.  Here we show reads from a heterozygote.  If all the reads have an A or a T at the SNP location, the individual is a homozygote.  In the case that the individual matches the reference, we only say that a SNP is present at this location if we have determined in advance that there is genetic variation in the population at the location.

One problem is that the best sequencing methods still have error rates of a couple of percentage points. They are about the same frequency as some of the SNPs that we are interested in detecting. SNPs in the human genome are about 1 per 1000 bp and about 1-2% of reads are exact duplicates which may be due to technical issues in sample preparation so that seeing two reads with the same variant is not sufficient to conclude that a SNP is present. Usually we assume that exact replicates are technical errors and remove all but one.  We also have to acknowledge that the reference itself might have an error at some locations (although for the human genome this is less problematic than for other less well-researched genomes).  Finally, alignment problems can cause errors, if a closely related sequence is wrongly mapped to the site.

To account for the various types of error in the data, we only call new SNPs at locations that have multiple reads (usually at least 75 for de novo calls) and if either the vast majority of the reads match each other but not the reference at this site, or if two variants are in about 50/50 ratio.  

The SNP call error rate is about 3% with Sanger sequencing (which is better than high throughput). However, high throughput sequencing is getting more accurate all the time.

Typically, when calling a SNP at a location, we should have at least 2 reads with each variant.  So if a location is covered by 6 reads and there are 2 A's and 4 T's we would consider the individual to be a heterozygote at that locus with 2 alleles, A and T.  However, if there is 1 A and 5 T's, we would assume that the A is a sequence error unless we have information from other samples that suggest that it could be correct - e.g. in family studies we could consider the genotype of relatives at that location.

Because it is so expensive, coverage for whole genome sequencing is generally low.  However,  coverage is usually quite uniform, with about the same number of reads at each location so that the quality of SNP calls is even over the genome.  Exome sequencing projects usually have much higher average coverage, but the current technologies have biases that create very uneven coverage over the exome.  This can be problematic for SNP calling, although the higher average coverage is advantageous.

SNP Detection with No Reference Genome

If you don't have reference genome calling SNPs is much harder. In this case  local reference sequences can be created by assembly - i.e. by overlapping reads that have sequence similarity.  For this purpose longer or paired end-reads are preferred to short single-end reads.   The DIAL (De novo Identification of Alleles) software was developed at Penn State for this purpose [1]. 

References

[1] Ratan A, Zhang Y, Hayes VM, Schuster SC, Miller W. Calling SNPs without a reference sequence. BMC Bioinformatics. 2010;11:130. doi:10.1186/1471-2105-11-130. link [3]

12.2 - Genotyping with SNP Microarrays

Although sequencing technologies are essential to SNP detection, microarrays will continue to be used for genotyping for at least the near future because they are cheaper to use than sequencing and data management is simpler.  Usually we're using SNPs as markers so we don't really care if we miss a SNP or two.  

There are a number of SNP microarrays commercially available.  The standard human arrays have from 500 thousand to 2 million SNPs on the microarray.  However, when the genes likely to be involved in a disease process are known, it is also possible to create smaller microarrays targeting only SNPs in those genes.  Even if the standard arrays are used, often the stastistical analysis is limited to a smaller preselected set in order to avoid loss of power when doing multiple testing adjustments.  Adjusting for a few hundreds of SNPs in a smaller set of preselected genes is a big power improvement compared to adjusting for 2 million SNPs.

SNP microarrays are designed with a probe for each allele. This may include both sense and anti-sense probes. In principle, genotype calling is straight-forward. Homozygotes should have high intensity on 1 allele (called "2") and low on the other (called "0"). Heterzygotes should have moderate intensity on both alleles (called "1").  On some platforms, only one of the alleles is used as a probe and the hybridization intensity is partitioned to call 0, 1 or 2.

In actuality, the data need to be normalized. The different nucleotides have different binding energies which need to be accounted for. You might also have to make adjustments for "fragment length". Normalization methods used for gene expression are not suitable for this, because of the smaller dynamic range of the arrays (i.e. 0,1 or 2) and because of the known relationship between the calls for the 2 alleles (the only possible combinations are (0,2), (1,1) or (2,0).)

The call and call reliability is usually recorded for downstream use. More reliable calls are obtained from homozygous sites because the difference between the present and absent alleles is obvious. When both alleles are present, we expect an equal signal, but due to noise this is not observed.    

Missing data are frequent on genotyping microarrays.  Occasionally, most of the data are missing for a feature - this usually means that the call reliability is poor due to technical problems.  More frequently, there are sporadic missing features for several SNPs on each array.  This may be due to biases introduced during sample preparation or during hybridization or due to genomic variation such as missing genomic segments in an indiviual.

After you aquire the SNP data as a data matrix of SNPs by samples, there are several things that you might do next including:

  • haplotyping,
  • filtering SNPs [1]
  • determining population structure,
  • Genome-Wide Association Studies

References

[1] GWAS Data Cleaning (see readings)

12.3 - Haplotyping

Suppose two SNPs, A and B are in the same gene. We only observe the individual genotype for each SNP, we don't get to see how they sit together along the chromosome. So, if the person is homozygous for A, (AA), but heterozygous for B, (Bb), then we know that one chromosome  had the major allele for AB and on the other the major for A the minor for B, (Ab). But what if the individual is heterozygous for both, Aa and Bb? Then we don't know which of the four possibilities they have.  In many cases, if there are 2 SNPs on the same gene, certain combinations are more likely to be associated with phenotypes than others.  Even if this is not the case, linkage disequilibrium induces a correlation in the SNP frequencies which will affect our multiple testing adjustments.  In the most severe case, the alleles are always inherited together, so that there is 100% correlation between the tests of association at the 2 sites.  

The haplotype is the set of variants in a single gene that are inherited as a unit from one of the parents (and so are on the same copy of the chromosome).  When the gene is transcribed and then translated, all of these variants will occur together in the resulting protein.

Let's re-examine the website from Genome.gov to see what they have to say about linkage and haplotypes. Click on the 'Watch' button below and a new window will open up with this presentation. In the new window, click on the second button, "What is a haplotype?":  

watch! [4]

 

 

One of the questions of interest is  how to infer the haplotype from the individual SNPs. This is sometimes called phasing. With family data it is easier to do phasing because the offspring have a chromosome from each parent. So, unless there is a recombination event (which moves genetic material between the two copies of the chromosome in the parent while producing the germ cells) the offspring inherit the haplotype from the parents.

The animation Genome.gov states that you can think of a person's haplotype pair as their 'SNP profile'. As a statistician, the haplotype is the way of reducing the very dense set of data with 1 million features into something that has a, hopefully, smaller more meaningful set of features.  As well, we expect the haplotypes to have less linkage disequilibrium than the the SNPs within each haplotype, reducing the problem of correlation of the features.

12.4 - Linkage Disequilibrium

It turns out that linkage disequilibrium is a key to doing phasing if we don't have family data. What is linkage disequilibrium?

In diploid organisms (organisms with 2 copies of each chromosome) the parents have two chromosomes each and the offspring gets one from each parent.

diagram of parent's contributions to offspring

However, processes during meiosis (production of the haploid (one copy of each chromosome) germ cells) crossover events take place in which the two autologous chromosomes exchange material. (Autologous means that these are the "same" chromosomes although there may be differences at the smaller scale.) This is called recombination, because the material in the chromosomes is recombined to form new chromosomes.  In many multicellular organisms, the average is 1 - 2 cross-overs in each germ cell.  This means that while the offspring obtain half of their genetic material from each parent, most inherited chromosomes are composed of segments arising from both of the chromosomes of the parents.  Linkage refers to the probabiity that two segments of DNA are inherited together.  If the segments are on different chromosomes, then the linkage probability is 1/4 (because inheritance will be independent, and the probability of selecting one of the two copies from each parent is 1/2).  If the segments are on the same chromosome, then the probability depends on the recombination rate between the segments (often called the genetic distance).   Although the genetic distance is not the same as the physical distance, thinking of the chromosomes as rigid sticks that get broken and then stuck back together leads to the (accurate) heuristic that segments that are close together have higher linkage than those that are further apart.  However, because the chromosomes have weaker and stronger regions, and because cross-over is mediated by biochemical processes, the genetic distance is not identical to the physical distance between the segments.

Two segments are said to be in linkage equilibrium if they are inherited independently.  Under random mating without selection you would expect SNPs on different chromosomes to be unlinked. Linkage equilibrium also occurs when the segments are sufficiently distant on the same chromosome.  Otherwise, they are said to be in linkage disequilibrium or LD.  LD induces association between the genotypes at nearby genetic variants.  SNPs close together on the same chromosome are linked.

LD makes it possible and effective to haplotype even when family samples are not included.  You can infer what the missing haplotypes are if you have a large sample by using information from the homozygotes to infer the likely pairings for the heterozygotes.

12.5 - Filtering SNPs

The large numbers of markers such as SNPs now available create a large multiple testing problem.  Haplotyping can reduce the number of markers somewhat - if haplotyping has been in done a study, just replace "SNP" with "haplotype" in the discussion to follow.

To reduce the number of markers, filtering is often done to remove uninformative SNPs.  The most obvious filter is to remove SNPs with poor quality data (usually recognized because no variant call can be made for many samples).  Another obvious filter is to remove SNPs with identical calls in all the samples (since these constant genotypes cannot contributed to variation in a phenotype).

Other types of filtering are driven by biological considerations.  The most obvious of these is limiting the analysis to SNPs in predefined genomic regions such as genes or genomic segments already identified as being associated with the phenotype.

One type of filtering is based on biological functioning.  So for example, only SNPs in coding regions may be used.  This is the basis for a sequencing technology called exome sequencing which captures and sequences only the exons.  However, this method can also be used with SNP arrays, either by printing only probes for exonic SNPs or by filtering the probes to use only exonic SNPs in the analysis.

Nucleotides in the exons code for amino acids, the building blocks of proteins.  A set of 3 nucleotides called a codon codes for an amino acid.  Since there are 4 different amino acids, there are 43=64 codons, but there are only 20 (or maybe 22 - estimates vary due to some rare types of) amino acids that are incorporated into proteins.  As a result, many amino acids are encoded by several different codons which are said to be synonymous.  Synonymous SNPs are often filtered because they do not change the protein sequence.

Using only exonic SNPs and excluding synonymous SNPs assumes phenotypic effects due to gene regulation are negligible.  As we tackle more complex diseases, these strategies may change.  

Since SNPs in LD are correlated, when SNPs are used as markers (rather than as the causal agents) it is common to select only a few SNPs from each highly associated region.  This reduces both the number of markers and the association among them.

SNPs on the X-chromosome are often removed from the study because females have a different number than the males, and males cannot be heterozygous for the genes that are not also on the Y-chromosome. However, as we know there a number of sex-linked diseases.  For these diseases, we will likely want to include genotypes on the X and Y chromosome, keeping in mind the differences between the genders as well as the fact that annotation of these chromosomes is poorer than that for the autosomal chromosomes.

Samples are often filtered as well.  Sometimes there are many missing SNPs - this is likely due to a technical problem.  These samples are removed or re-genotyped.

Another type of sample filtering is done based on the X and Y chromosomes.  The most basic filter is to remove female samples that have more than a few non-missing SNPs on the Y chromosome (a small number might be due to poor SNP calls and can be tolerated).  Similarly, a few "heterozygous" calls for the X chromosome for males can be tolerated but many such calls indicate a problem with the sample.  Anomalous SNPs on the X and Y chromosome can be correct - a number of human conditions with anomalous numbers of X and Y chromosomes exist, some of which might not be known to the subject.  However, unless a study specifically targets individuals with anomalous sex chromosomes, it is usually safest to assume that they may have a range of phenotypes which differ from the main population.  Similarly, if detected, individuals with other major chromosomal anomalies such as extra, missing or translocated chromosomes are generally not used for genotyping studies.

12.6 - Hardy-Weinberg Equilibrium

In diploid populations (with 2 copies of each chromosome) if you consider a single genotype (SNP or other) with:

  • A with frequency p
  • a with frequency (1 - p)

Then, under random mating with equal survival probability for all offspring:

  • AA has a frequency of  \(p^2\)
  • Aa has a frequency of  \(2p(1-p)\)
  • aa has a frequency of  \((1-p)^2\)

This is called the Hardy-Weinberg equilibrium.  It is assessed by considering whether the percentage of samples having each genotype are in the correct proportion, which can be tested using a Chi-square goodness of fit test.

Genes can fail to be in Hard-Weinberg equilibrium for a number of reasons.  Firstly mating is seldom completely at random - some mates are more attractive or more fertile and hence contribute more offspring to the population.  Genes associated with these traits will not be in HW equilibrium.  Another cause of non-random mating is proximity - subpopulations which are physically isolated will inbreed (compared to the larger population from which they originated) .  The variant may be in HW equilibrium with respect to the subpopulation, but not with respect to the larger population.  Finally, some genotypes may reduce fitness so that no offspring survive - this occurs, for example in diseases in which one homozygote is highly unfit but the heterozygote has enhanced fitness.  SNPs in LD with a variant that is not in HW equilibrium are also likely to fail to be in HW equilibrium.

If you look at the workflow for SNP analysis researchers often removed genes that are not in HW equilibrium. This might be a controversial step from a statistician's point of view as there may be a possibility of HW disequilibrium due to some type of selection, making these more interesting SNPs rather than SNPs that would be removed.  On the other hand, lack of HW equilibrium suggests some type of population structure that might need to be taken into account.  As an example - suppose that a genetic disease is more prevalent in a north European population compared to other human subpopulations.  The alleles that confer blonde hair (and many other phenotypes) are also more prevalent in the north European population.  If we do not screen for HW equilibrium, many of these alleles will appear to be associated with the disease.

12.7 - Population Structure

In human studies, but also in any study involving 'free living'animals, as opposed to laboratory animal studies, it is important to consider population structure.

Population structure occurs because individuals do not have access to the entire population to choose mates.  In animal populations, and historically in human populations, individuals are more likely to have offspring with others who live nearby.  Thus, even without considering sexual selection or survival of offspring, proximity induces assortative mating which makes certain SNPs more prevalent in each subpopulation simply due to the genotype of the founding population and genetic drift.

If we are looking at the phenotype of blonde hair and we have a mixture of individuals of both European background and non-European background, we will find many genes associated with European ancestry that are not necessarily causal for blonde hair.  Similarly, there are many diseases that are more prevalent in some human populations than others.  To understand the genetic components of these diseases, we need to distinguish between SNPs that are associated with the disease because they are more more (or less) prevalent in the population in which the disease is more prevalent and those that may have a causal link with the disease. 

Take for example, type II (late on-set) diabetes.  While diabetes is a problem in most human populations, it has higher prevalence in some groups, such as Native Americans.  Disentangling the genetic and environmental causes of diabetes is quite difficult and is made more difficult by the fact that the same geographical isolation that created population genetic substructure also exposed the populations to different environments and led to different cultural traditions that might interact with the genetics to induce the disease.

Population substructure can often be determined using SNP data using a method called principal components analysis or PCA.  We will discuss PCA in more detail towards the end of the course.  For now, we just note that it is a means of finding several uncorrelated weighted averages of the features that have the greatest variance.  We usually use the weighted averages to plot the data in 2 or 3 dimensions.  We use all the data to find the weights for the weighted averages.  We then use these weights to compute  2 or 3 different weighted averages for each sample.  

plot of population clustersIn the study shown at the right the following populations were involved:

CEU - Central European (Utah)
HCB – Han Chinese (Beijing)
JPT- Japanese (Tokyo)
YRI - Yoruba (Ibadan, Nigeria)

The SNPs were coded as 0,1 or 2 according to the number of "A" alleles. 2 weighted averages of the SNPs were computed - the first is the one with the maximum variance across the samples.  Once the weights are computed, each individual gets a value (called the first principal component) by computing the weighted average using the SNP values for that individual.  The second set of weights is computed so that the second principal component is uncorrelated with the first, and has maximum variance among the uncorrelated weighted averages.  (This sounds complicated, but it is readily computed using matrix algebra.)

The plot shows each individual in the study plotted against the first 2 principal components and color coded by race.  There is a very clear separation between the African, European and Asian individuals in this sample.  The 2 Asian populations are not well separated - with just 2 principal components, but a 3rd component does separate them.

With this type of strong population structure, any study that includes individuals from all 4 populations is likely to have confounding due to the population structure.  While you could study each population individually, you would likely get more statistical power by considering the subpopulation to be a block, and fit a generalized linear model similar to the model for the RNA-seq data.  However, for SNP data we usually consider 0,1,2 to be ordinal and use ordinal logistic regression to fit a model with subpopulation as block and the phenotype as the predictor.

12.8 - Genetic Markers

In some studies, SNPs or haplotypes are expected to be the causative agents in the phenotype of interest.  For these types of studies, it is important to include all the SNPs (in the genomic region of interest).  Once SNPs are found that are associated with the phenotype, more detailed analysis of the gene and protein networks are done to understand the processes involved.

However, in many studies, SNPs are used as genetic markers - they are not necessarily expected to be causal in the phenotype, but SNPs that are associated with the phenotype are expected to be in LD with causal genomic regions.  When using SNPs as genetic markers, we do not need to have all the SNPs - we need a set of SNPs that are well distributed across the genome.

With quantitative phenotypes such as height, weight, or a measure of insulin resistance,  the analysis attempts to find the region between two markers that is highly associated with the phenotype.  Such a region is called a Quantitative Trait Locus.  

When we use a dense set of markers such as SNPs and we use all of the SNPs available, the analysis is called a Genome-Wide Association Study (GWAS). In a GWAS in humans we usually have at least a half a million features that we are looking at, and we do a test of each feature.

Another analysis that has become popular links gene expression and SNP analysis into eQTL(expression QTL) analysis. In eQTL analysis the phenotype is the expression level of one of more genes.  Potentially we could have half a million markers and 30,000 gene expression values for each sample. We are talking here about very big data sets! There is not much difference between eQTL's and other QTL's in terms of the basic statistical testing. The main differences are how the data are handled and how the data are filtered.  Often for each eQTL, only SNPs which are physically close on the chromosome to the expressing gene, or which are close to genes in a network involving the expressing gene are used for the QTL analysis, which greatly reduces the numbers of tests.

12.9 - Statistical Testing in GWAS

In GWAS studies, usually a test is done for every gene.  Several tests are available.

In the simplest case, we have a categorical phenotype with two categories.  Together with the 3 genotypes, this creates a 2x3 table.  The counts in the table are the numbers of samples in the study with a particular genotype and phenotype combination.  


AA=0
Aa=1
aa=2
Total
healthy
\(N_{11}\)
\(N_{12}\) \(N_{13}\) \(R_{1}\)
disease
\(N_{21}\) \(N_{22}\) \(N_{23}\) \(R_{2}\)
Total        

Assuming that the samples are independent (e.g. they are not related), there is no population structure and no covariates, then Fisher's exact test or a chi-squared test can be done to determine if the phenotype is associated with the genotype.  

Another commonly used test (again for independent samples and no population structure) is the Cochran-Armitage test:

\[T(t)=\sum_{i=1}^{3} t_i(N_{1i}R_{2}-N_{2i}R_{1})\]

The term \(N_{1i}R_{2}-N_{2i}R_{1}\) essentially takes the difference in counts between the rows, after reweighting to equalize the row totals.  (To see this, note that \(\sum_{i=1}^{3} N_{1i}R_{2}=\sum_{i=1}^3 N_{2i}R_{1})\).  The weights \(t_i\) are selected depending on the pattern you want to test for.  E.g. if you hypothesize that the A allele is dominant then the weights are \(t_1=t_2=1, t_3=0\).  If you hypothesize that the effects of A and a are additive, then the weights are \(t_1=1, t_2=2, t_3=3\).  Other patterns are also possible and can be tested using different weights.

When the samples are related, there is population structure or there are environmental covariates, regression models are more flexible than models for tables.  For binary traits as in the table above, we can use logistic regression to formulate the probability of one of the phenotypes (compared to the other) which provides a very flexible framework similar to the linear model.  When the trait is quantitative, ordinary linear models can be used.  The phenotype can be considered categorical (using indicator variables as the predictors) or ordinal (using the 0,1,2 as numerical values.)

The best software I am aware of for GWAS studies is PLINK [5]. Although PLINK is stand-alone software, the authors also provide a link to R called Rplinkseq [6] The authors state: "Rplinkseq is an R package that allows access to PLINK/Seq projects directly from R, so that R's rich set of statistical and visualisation tools can be utilised. "  PLINK can handle haplotyping, filtering and all of the currently popular models for GWAS analysis.  However, data management such as filtering, selecting samples or features, etc. are probably best done in R. 

One problem in GWAS studies is that multiple testing has not been entirely worked out. This is because the multiple testing methods that we know work require independence among the tests.  However, because of LD,if you use a dense set of SNPs, the correlations among the tests can be high. Haplotyping can combine multiple SNPs into a smaller number of more complex genotypes (with possibly more than 2 alleles) which usually improves the analysis by having higher association with the phenotype, having fewer features to compare and reducing LD among features.  In QTL studies, the genotypes are assumed to be markers of the causal loci, rather than being causal themselves.  This takes advantage of LD, as markers more correlated with to the causal regions should have stronger association with the phenotype. Researchers take advantage of the correlation among the p-values and plot the -log10(p-values) against the physical distance on the chromosome in a "Manhattan plot". The x-axis of this plot are the chromosomal positions of each feature within each chromosome, ordered by chromosome number (and usually color coded so that it is easy to see which features are in which chromosome).  The y-axis are transformed p-values.  Since the smallest p-values are of interest, the y-axis is usually -log10(p-value), which emphasizes the small values.  "Real" QTLs are assumed to be indicated by a local peak of small p-values.

References

  • Li, Ruan and Durbin, 2008 Genome Research
  • Ratan, Zhang, Hayes, Schuster and Miller, 2010, BMC Bioinformatics
  • Lin, Carvalho, Cutler et al 2008 Genome Biology
  • Kathiresan, Willer, Peloso et al (2009) Nature Genetics
  • Barrett, Clayton, Concannon et al (2009) Nature Genetics

Source URL: https://onlinecourses.science.psu.edu/stat555/node/18

Links:
[1] https://www.genome.gov/
[2] javascript:popup_window( 'https://www.genome.gov/Pages/Education/DNADay/TeachingTools/MakingSNPsMakeSense.html', 'genome_gov', 874, 315);
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851604/
[4] https://www.genome.gov/Pages/Education/DNADay/TeachingTools/MakingSNPsMakeSense.html
[5] https://pngu.mgh.harvard.edu/~purcell/plink/
[6] https://atgu.mgh.harvard.edu/plinkseq/install.shtml#rlib