12.2 - Genotyping with SNP Microarrays

Printer-friendly versionPrinter-friendly version

Although sequencing technologies are essential to SNP detection, microarrays will continue to be used for genotyping for at least the near future because they are cheaper to use than sequencing and data management is simpler.  Usually we're using SNPs as markers so we don't really care if we miss a SNP or two.  

There are a number of SNP microarrays commercially available.  The standard human arrays have from 500 thousand to 2 million SNPs on the microarray.  However, when the genes likely to be involved in a disease process are known, it is also possible to create smaller microarrays targeting only SNPs in those genes.  Even if the standard arrays are used, often the stastistical analysis is limited to a smaller preselected set in order to avoid loss of power when doing multiple testing adjustments.  Adjusting for a few hundreds of SNPs in a smaller set of preselected genes is a big power improvement compared to adjusting for 2 million SNPs.

SNP microarrays are designed with a probe for each allele. This may include both sense and anti-sense probes. In principle, genotype calling is straight-forward. Homozygotes should have high intensity on 1 allele (called "2") and low on the other (called "0"). Heterzygotes should have moderate intensity on both alleles (called "1").  On some platforms, only one of the alleles is used as a probe and the hybridization intensity is partitioned to call 0, 1 or 2.

In actuality, the data need to be normalized. The different nucleotides have different binding energies which need to be accounted for. You might also have to make adjustments for "fragment length". Normalization methods used for gene expression are not suitable for this, because of the smaller dynamic range of the arrays (i.e. 0,1 or 2) and because of the known relationship between the calls for the 2 alleles (the only possible combinations are (0,2), (1,1) or (2,0).)

The call and call reliability is usually recorded for downstream use. More reliable calls are obtained from homozygous sites because the difference between the present and absent alleles is obvious. When both alleles are present, we expect an equal signal, but due to noise this is not observed.    

Missing data are frequent on genotyping microarrays.  Occasionally, most of the data are missing for a feature - this usually means that the call reliability is poor due to technical problems.  More frequently, there are sporadic missing features for several SNPs on each array.  This may be due to biases introduced during sample preparation or during hybridization or due to genomic variation such as missing genomic segments in an indiviual.

After you aquire the SNP data as a data matrix of SNPs by samples, there are several things that you might do next including:

  • haplotyping,
  • filtering SNPs [1]
  • determining population structure,
  • Genome-Wide Association Studies

References

[1] GWAS Data Cleaning (see readings)