Lesson 11: Gene Set Analysis

Key Learning Goals for this Lesson:

Understanding a feature list
Understanding how to match a feature list to a reference set
Introduction to KEGG and GO databases

Introduction

High throughput results often lead to a long list of “significant” features. This list might be derived from a gene expression study or by looking at protein binding sites, genetic variations, sequence analysis etc.

Gene set enrichment analysis is a method for validating and interpreting the list by matching its elements to reference sets that are relevant to the problem. For example, if you're looking at a gene list from a study of depression, it would be really exciting if many of the significant features were associated with neurotransmitters. Alternatively, it would be interesting to know if features associated with neurotransmitters differed significantly between depressed and normal individuals. Unfortunately, usually the story is not that clear, and gene set analysis is only one of the many clues we use.

If we have only a few features on the list, we could do a literature search to determine what is already known about each feature. Of, if we have the laboratory resources, we could perform functional studies using knockout strains, silencing mechanisms, upregulating mechanisms, etc. Unfortunately, even with 10 genes is very expensive to do functional studies.

When we have larger lists, one approach is to leverage information already available in large web-based databases. For example, if we are doing a study of gene expression associated with depression, we might look at biological pathways diagrams for the known neurotransmitters serotonin, norepinephrine and dopamine. A biological pathway diagram might show a network of proteins that regulate each of these transmitters, or a gene regulatory network involved in the manufacture or activation of these transmitters in the brain. These diagrams are often constructed by painstaking "manual curation" from experimental work combined with literature search. In any case, pathways databases such as KEGG [1] provide a repository of diagrams along with important information such as the names of the features in the diagram. For example a pathway for serotonin is:

Pathway for serotonin

[ ENLARGE [2] ] - [ Original Image [3] ] Image used with permission of KEGG.

Gene set enrichment analysis works by creating a 2-way table of counts, where the "universe" of features is the set from which our list was selected. For example, if we selected all the differentially expressed genes from a microarray experiment, our universe would be all the genes which are printed on the microarray. For an RNA-seq experiment, it might be all the genes detected in the samples. If we compile the list from a literature search, it might be all known genes in the organism. The reference set would be all the features in our universe that are in the pathway.

	features in the diagram	features not in the diagram
features in the list
features not in the list

We then do Fisher's exact test or the chi-squared test to determine if the number of features both on the list and in the diagram are what would be expected if we selected features at random from the universe, or if there is significant association between the list and the diagram. Often we are interested in both over-representation and under-representation. We might also do this test separately for up- and down-regulated genes.

Another question we might ask is, "Do the features in the reference set behave differently from those not in it?" For example, we might take the features in our reference set that are also in the feature universe as one population and the features not in the reference as another population. We could then do a t-test to determine if the mean fold changes are the same in both populations. [1] Another idea is to test if the distribution of fold changes is the same in both populations.[2]

Often we have numerous diagrams or other sets of features to which we will match our list, so we need to consider multiple testing adjustments.

References

[1] Bussemaker, Harmen J., Lucas D. Ward, and Andre Boorsma. "Dissecting complex transcriptional responses using pathway-level scores based on prior information." BMC bioinformatics 8.Suppl 6 (2007): S6.

[2] Subramanian, Aravind, et al. "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles." Proceedings of the National Academy of Sciences of the United States of America 102.43 (2005): 15545-15550.

11.1 - Reference Sets

The feature list comes from our study - it may be the list of differentially expressing features, features with genetic variants, or other feature we have identified. Where do the reference sets come from? There are three main sources of reference sets: search through previous results, pathway diagrams and gene ontologies.

Previous Studies

Sometimes previous studies of the same organism or the same system in a different organism has already indicated a set of interesting features. If it is the same system in a different organism, then your reference set will be homologs of the features in your own study. In a well-studied organism or system there may already be well-curated reference sets. In this situation, typically there is only one reference set and a single test is done.

Pathways and Networks

Over the years, biologists have painstakingly put together biological pathways and networks that show how genes and/or proteins or other cellular processes fit together. These have usually been built manually using data from many sources, although there are now computational tools which can assist in inferring the structure of the network. Network and pathway diagrams may be available from individual papers and/or websites. However, there are also a number of repositories that have collected and annotated the pathways and networks as both diagrams and as machine-readable lists of features involved in the pathway. Most importantly, these repositories impose a uniform annotation on the network which allows the user to search the network and find all the relevant entries.

The network database keeps track of the tags associated with the network diagrams and the list of genes or proteins in the diagrams. These lists are used as reference sets. The reference sets are the genes in the network, or core nodes in the pathway.

The Bioconductor software provides an interface to the most commonly used repositories such as KEGG [1].

You may be interested in only a single pathway or network, several partitions of a single network, or multiple networks. For example, a search of KEGG for the keyword "serotonin" found 15 networks. If gene set analysis is done using multiple reference sets, multiple testing adjustments should be done.

Gene Onotology (GO)

An ontology is a standardized vocabulary for a field. For example, there is an ontology of medical diagnoses so that anyone reading a medical chart understands the diagnoses.

Once high throughput analyses became available to biology, it quickly became evident that a gene ontology was required to simplify searches and standardize annotation. For example, when annotating the genome, the terms "putative gene, "presumed gene" and "possible gene" were all used to describe sequences for which no transcript or protein had been confirmed, but which resembled coding regions. For genes with actual confirmed transcripts, multiple descriptions are available.

The Gene Ontology Consortium [4] is an organization of scientists who develop the standard terms as well as methodology to annotation genomes using these terms, and tools for retrieving information. Three ontologies are maintained for biological processes (BP), cell components (CC) and molecular functions (MF). As well, important information such how the annotation was determined is retained with the annotation. For example, some terms are applied to a gene due to experimental results when the gene is up or down-regulated; other terms are applied due to sequence homology with a gene in another species with known function in the other species.

Many species are represented in the Gene Ontology database. The ontology for an organism is constructed by an organism consortium. When a new genome comes out, for instance, interested scientists may volunteer to create an ontology based on existing ontologies, and then assigning the terms to the features of the new genome.

The ontology's are organized as graphs. Each node is a term. The leaf nodes are more specific instances of the term. For example, a term might be "response to stress" and the leaf nodes might be "response to DNA damage" and "response to heat shock". However, a leaf node might have multiple parents, so the graph is not necessarily a tree. Here's an example of a Go graph. It is only a small piece of the biological process:

GO graph

(from P. Hu, Computational prediction of cancer-gene function [5]. Nature Reviews Cancer 7, 23-34 (January 2007), used with permission.)

The nodes of the graph are not independent. Any gene assigned to a node in the graph is also assigned to its parent nodes. Each node is associated with a set of genes that have this term in their annotation. The gene set at any node is a subset of the gene set at any parent node.

There is also reduced ontology that people use to produce pie charts and bar charts that are often seen in papers. This is called GOSlim. GOSlim cuts the tree at one level, so that each gene is assigned to a unique node.

There are two main ways to use the information in GO for gene set analysis.

Firstly, you can download a small number of categories, usually with GOSlim and then test those categories. Usually you would select categories so that do not overlap too much, so that you can assume the tests are independent. Ordinarily a chi-squared test is performed in each category but you should include multiple testing adjustments.

Another mode of using GO for gene set analysis is to select one of the 3 ontologies and test at every node. There are several software packages available for this - we will use one of the Bioconductor packages. Since the reference set at each node is nested within the set defined by the parent node, this leads to multiple correlated tests. Multiple testing is complicated due to lack of independence among the tests. The software we will use works in two modes - the unconditional mode reports the significant tests at every node. The conditional mode reports only the child node if the statistical significance of the parent can be attributed to enrichment or depletion of the child.

11.2 - Example: Primate Brain Data

The example of gene set analysis uses the primate brain data from Homework 5. We also use these data in a computing lab to illustrate clustering and gene set analysis.

Recall that the data are 3 biological replicates of human and chimpanzee brain. We used 4 distinct regions of each brain. The mRNA was hybridized to a human Affymetrix gene expression microarray, and differential expression analysis was performed using LIMMA. The gene list is the set of 200 genes with the smallest p-value from the overall F-test that all means are equal.

Gene Set Analysis with GO

In principle, our "universe" of gene would be all genes in the human genome. However, not all genes have probes on the microarray, and of the probes on the microarray, not all have an annotation in the GO database. Therefore the universe has to be reduced to genes represented by probes on the microarray which have a GO annotation. As well, not all of the 200 genes selected for our gene list have a GO annotation. The gene list also has to be reduced to a usable set. This reduced the gene universe to 7865 genes and the gene list to 160 genes.

We ran the GOStat package in Bioconductor on this gene list, using the Biological Process ontology for the human genome. We considered only nodes that are enriched for our gene list (i.e. that have more genes than expected from our gene list). We arbitrarily decided on p<0.001 as the cut-off for statistical significance, as the package does not have a formal method for multiple testing adjustment. Using the conditional test, this produced 47 significant nodes. Many of the annotations on this list refer to neural function or growth of neural cells. Here are the first few entries on this list of significant nodes:

##       GOBPID       Pvalue OddsRatio  ExpCount Count Size Term
## 1 GO:0007268 2.899098e-10  4.163969 10.222222    34  500  synaptic transmission
## 2 GO:0007399 9.917320e-10  2.945622 26.496000    58 1296  nervous system development
## 3 GO:0007420 8.058485e-09  4.027778  8.750222    29  428  brain development
## 4 GO:0007417 1.671928e-08  3.552083 11.346667    33  555  central nervous system development
## 5 GO:0007267 3.012535e-08  3.134041 15.394667    39  753  cell-cell signaling
## 6 GO:0050877 1.462947e-07  3.472552  9.976889    29  488  neurological system process

Notice that each of these GO nodes have 2 or 3 times as many genes from our set of 160 as would be expected by chance

If you are working with an organism that does not have a GO database, it is necessary to map genes between your organism and a reference organism that has a GO ontology. Due to genome evolution, the mapping may not be one-to-one - for example the species you are working with might have some gene duplications since divergence from the common ancestor of the reference species. The current versions of the software do not allow duplicates on the gene list or the gene universe. GOStats removes any duplicates on the gene list and in the gene universe to a single copy.

Testing Expression Differences

Another idea for gene set analysis starts from the genes in the reference set, and seeing if they express differently from genes not in the reference set. It is not clear if the overall expression intensity should be used, or whether gene expression should be reduced to a z-score for each gene.

In the GO analysis, the GO node with the smallest p-value was

GOID: GO:0007268

Term:synaptic transmission

Ontology: BP

Definition: The process of communication from a neuron to a target (neuron, muscle, or secretory cell) across a synapse.

Synonym: neurotransmission

Synonym: regulation of synapse

Synonym: signal transmission across a synapse

The coef component of the LIMMA output has the mean expression of each gene in each treatment. These were centred and scaled to form z-scores. We then took the mean z-score for the genes in and not in node GO:0007268. The plot below summarizes the results.

Note that the differences between the curves should be assessed in the vertical direction. This shows that the biggest differences between the two groups of genes is treatments 3 and 7, which are cerebellum in respectively chimpanzee and human, where the genes in the node tend have lower expression than those not in the node.

GO Expression profile plot