12.9 - Statistical Testing in GWAS

Printer-friendly versionPrinter-friendly version

In GWAS studies, usually a test is done for every gene.  Several tests are available.

In the simplest case, we have a categorical phenotype with two categories.  Together with the 3 genotypes, this creates a 2x3 table.  The counts in the table are the numbers of samples in the study with a particular genotype and phenotype combination.  


AA=0
Aa=1
aa=2
Total
healthy
\(N_{11}\)
\(N_{12}\) \(N_{13}\) \(R_{1}\)
disease
\(N_{21}\) \(N_{22}\) \(N_{23}\) \(R_{2}\)
Total        

Assuming that the samples are independent (e.g. they are not related), there is no population structure and no covariates, then Fisher's exact test or a chi-squared test can be done to determine if the phenotype is associated with the genotype.  

Another commonly used test (again for independent samples and no population structure) is the Cochran-Armitage test:

\[T(t)=\sum_{i=1}^{3} t_i(N_{1i}R_{2}-N_{2i}R_{1})\]

The term \(N_{1i}R_{2}-N_{2i}R_{1}\) essentially takes the difference in counts between the rows, after reweighting to equalize the row totals.  (To see this, note that \(\sum_{i=1}^{3} N_{1i}R_{2}=\sum_{i=1}^3 N_{2i}R_{1})\).  The weights \(t_i\) are selected depending on the pattern you want to test for.  E.g. if you hypothesize that the A allele is dominant then the weights are \(t_1=t_2=1, t_3=0\).  If you hypothesize that the effects of A and a are additive, then the weights are \(t_1=1, t_2=2, t_3=3\).  Other patterns are also possible and can be tested using different weights.

When the samples are related, there is population structure or there are environmental covariates, regression models are more flexible than models for tables.  For binary traits as in the table above, we can use logistic regression to formulate the probability of one of the phenotypes (compared to the other) which provides a very flexible framework similar to the linear model.  When the trait is quantitative, ordinary linear models can be used.  The phenotype can be considered categorical (using indicator variables as the predictors) or ordinal (using the 0,1,2 as numerical values.)

The best software I am aware of for GWAS studies is PLINK. Although PLINK is stand-alone software, the authors also provide a link to R called Rplinkseq The authors state: "Rplinkseq is an R package that allows access to PLINK/Seq projects directly from R, so that R's rich set of statistical and visualisation tools can be utilised. "  PLINK can handle haplotyping, filtering and all of the currently popular models for GWAS analysis.  However, data management such as filtering, selecting samples or features, etc. are probably best done in R. 

One problem in GWAS studies is that multiple testing has not been entirely worked out. This is because the multiple testing methods that we know work require independence among the tests.  However, because of LD,if you use a dense set of SNPs, the correlations among the tests can be high. Haplotyping can combine multiple SNPs into a smaller number of more complex genotypes (with possibly more than 2 alleles) which usually improves the analysis by having higher association with the phenotype, having fewer features to compare and reducing LD among features.  In QTL studies, the genotypes are assumed to be markers of the causal loci, rather than being causal themselves.  This takes advantage of LD, as markers more correlated with to the causal regions should have stronger association with the phenotype. Researchers take advantage of the correlation among the p-values and plot the -log10(p-values) against the physical distance on the chromosome in a "Manhattan plot". The x-axis of this plot are the chromosomal positions of each feature within each chromosome, ordered by chromosome number (and usually color coded so that it is easy to see which features are in which chromosome).  The y-axis are transformed p-values.  Since the smallest p-values are of interest, the y-axis is usually -log10(p-value), which emphasizes the small values.  "Real" QTLs are assumed to be indicated by a local peak of small p-values.

References

  • Li, Ruan and Durbin, 2008 Genome Research
  • Ratan, Zhang, Hayes, Schuster and Miller, 2010, BMC Bioinformatics
  • Lin, Carvalho, Cutler et al 2008 Genome Biology
  • Kathiresan, Willer, Peloso et al (2009) Nature Genetics
  • Barrett, Clayton, Concannon et al (2009) Nature Genetics