Printer-friendly versionPrinter-friendly version

The large numbers of markers such as SNPs now available create a large multiple testing problem.  Haplotyping can reduce the number of markers somewhat - if haplotyping has been in done a study, just replace "SNP" with "haplotype" in the discussion to follow.

To reduce the number of markers, filtering is often done to remove uninformative SNPs.  The most obvious filter is to remove SNPs with poor quality data (usually recognized because no variant call can be made for many samples).  Another obvious filter is to remove SNPs with identical calls in all the samples (since these constant genotypes cannot contributed to variation in a phenotype).

Other types of filtering are driven by biological considerations.  The most obvious of these is limiting the analysis to SNPs in predefined genomic regions such as genes or genomic segments already identified as being associated with the phenotype.

One type of filtering is based on biological functioning.  So for example, only SNPs in coding regions may be used.  This is the basis for a sequencing technology called exome sequencing which captures and sequences only the exons.  However, this method can also be used with SNP arrays, either by printing only probes for exonic SNPs or by filtering the probes to use only exonic SNPs in the analysis.

Nucleotides in the exons code for amino acids, the building blocks of proteins.  A set of 3 nucleotides called a codon codes for an amino acid.  Since there are 4 different amino acids, there are 43=64 codons, but there are only 20 (or maybe 22 - estimates vary due to some rare types of) amino acids that are incorporated into proteins.  As a result, many amino acids are encoded by several different codons which are said to be synonymous.  Synonymous SNPs are often filtered because they do not change the protein sequence.

Using only exonic SNPs and excluding synonymous SNPs assumes phenotypic effects due to gene regulation are negligible.  As we tackle more complex diseases, these strategies may change.  

Since SNPs in LD are correlated, when SNPs are used as markers (rather than as the causal agents) it is common to select only a few SNPs from each highly associated region.  This reduces both the number of markers and the association among them.

SNPs on the X-chromosome are often removed from the study because females have a different number than the males, and males cannot be heterozygous for the genes that are not also on the Y-chromosome. However, as we know there a number of sex-linked diseases.  For these diseases, we will likely want to include genotypes on the X and Y chromosome, keeping in mind the differences between the genders as well as the fact that annotation of these chromosomes is poorer than that for the autosomal chromosomes.

Samples are often filtered as well.  Sometimes there are many missing SNPs - this is likely due to a technical problem.  These samples are removed or re-genotyped.

Another type of sample filtering is done based on the X and Y chromosomes.  The most basic filter is to remove female samples that have more than a few non-missing SNPs on the Y chromosome (a small number might be due to poor SNP calls and can be tolerated).  Similarly, a few "heterozygous" calls for the X chromosome for males can be tolerated but many such calls indicate a problem with the sample.  Anomalous SNPs on the X and Y chromosome can be correct - a number of human conditions with anomalous numbers of X and Y chromosomes exist, some of which might not be known to the subject.  However, unless a study specifically targets individuals with anomalous sex chromosomes, it is usually safest to assume that they may have a range of phenotypes which differ from the main population.  Similarly, if detected, individuals with other major chromosomal anomalies such as extra, missing or translocated chromosomes are generally not used for genotyping studies.