8.2 - Uses of Tabular Data in Bioinformatics

Printer-friendly versionPrinter-friendly version

We will be using these types of tests many times throughout this course. We will be using them for

  • gene set enrichment analysis,
  • RNA-seq differential expression analysis, and
  • SNP frequency analysis.

There are some differences in how these tests are used.  However, in all cases, multiple tests are done, so that multiple testing methods must also be used.

In gene set enrichment analysis we look at whether a list of features (usually obtained from a literature search or an upstream analysis such as differential expression analysis) appears to be enriched or depleted compared to some known list (e.g. genes known to be involved in oxidative stress).  One margin of the table is whether the feature was or was not selected from a reference set of features. The other margin is whether the feature is or is not on the known list.  Often the known lists are in some type of nested arrangement - for example, in gene ontology analysis the ontology terms become more and more specific, so the list of features in each set are subsets of each other.  Multiple testing adjustments need to take the nesting into account.

In RNA-seq differential expression analysis we look at how many reads from the library are mapped to a specific feature, and how many are not, for each of our treatments. However, just as in microarray differential expression, to make biological conclusions we need biological replication.  Since the library sizes and proportion of reads from each feature varies from sample to sample, we cannot simply add across the samples to obtain a single table.  We will use a method that can be considered either an extension of t-tests or an extension of Fisher's exact test to handle this situation.

In SNP frequency analysis there is a 2 by 3 table for each locus with a genetic variant.  The margin with three entries is the number of minor alleles (0,1,or 2 or alternatively aa, aA or AA). The other margin is the treatment.  In this case, Fisher's exact test or the chi-squared test are appropriate for each table.  However, the tables are correlated  due to linkage disequilibrium between nearby loci. This means that appropriate adjustments must be made for multiple testing with correlated data.