A functional map is a way of usefully exploring information from thousands of experimental results, focused on a specific query of interest. This might mean finding data that pertains to a single gene/protein, a group of related (or unrelated) genes, a pathway, process, or set of disease-related genes. Functional maps rely on data integration to summarize genomic data as functional relationship networks. Each network encodes how likely it is for every pair of genes in the genome to interact functionally - possibly a direct interaction, like protein binding, or an indirect functional relationship, like participating in the same cellular process. Functional mapping analyzes portions of these networks related to user-specified groups of genes and biological processes and displays the results as probabilities (for individual genes), functional association p-values (for groups of genes), or graphically (as an interaction network).
The focus of a functional map will sometimes be referred to as the "query" or "subject". This is the primary entity that you're interested in when you generate a functional map: a gene, gene set, process, or disease. "Objects" are the type of entities you're interested in associating with the subject. So for example, if you wanted to predict gene function for ALOX5AP, your "subject" would be a gene, and your "objects" would be biological processes. Similarly, if you want to find out what genetic disorders might be caused by a particular pathway, your "subject" might be "protein processing", and your "objects" would be diseases.
A "context" or "process" refers to a biological process, e.g. a Gene Ontology term or other group of functionally related genes. This is roughly equivalent to a pathway or group of related pathways. HEFalMp's functional maps rely on process-specific data integration as described in Huttenhower et al 2006 and Myers et al 2007. This means that different genomic data will be up- or down-weighted in different biological contexts; microarrays might be very informative for transcriptionally regulated processes, for example, but direct binding data might be more informative for post-translationally regulated processes.
HEFalMp provides three types of scores depending on the subject and object of a map:
Click on it! Every time you see a score in HEFalMp, you can click on it to find out where it came from. Scores always bottom out in individual dataset/evidence scores taken directly from process-specific Bayesian classifiers learned automatically from experimental data. Getting to these experimental results could take multiple clicks, though, since functional mapping relies on several levels of summarization (remember, each predicted association might be based on literally millions of datapoints!) These include:
HEFalMp uses information relating pairs of genes from anything it can get its hands on, primarily microarrays, protein interactions (both physical and genetic), and different types of sequence comparisons (homology, protein domains, transcription factor binding sites, etc.) This totals almost 30 billion data points from over 30,000 experimental conditions! For more information, see our supplemental information and our page about HEFalMp; a quick summary is:
| Data points | Datasets | Publications | Experimental conditions | |
|---|---|---|---|---|
| Interactions (physical and genetic) | 11,244,053 | 14 | >15,000 | >15,000 |
| Sequence comparisons (nucleotide and protein) | 452,199,430 | 7 | 6 | NA |
| Microarrays | 27,248,177,875 | 635 | 417 | 14,671 |
| All data | 27,711,621,358 | 656 | >15,500 | ~30,000 |
Functional maps and their underlying functional relationship networks always analyze pairs of genes; this means that every data point corresponds to a measurement comparing two genes, e.g. microarray correlation, protein interaction, sequence similarity, and so forth. HEFalMp uses naive Bayesian classifiers to balance efficiency and accuracy; to improve performance in the presence of many datasets, we also perform parameter regularization to upweight particularly unique, informative datasets. These classifiers operate on discrete values, however; this is fine for interaction datasets, but it means that we have to bin microarray correlations, sequence comparisons, and other continuous values. Details are found in our supplemental information; in brief:
The evidence scores, which range from -1 (strong negative) through 0 (none) to 1 (strong positive), indicate how much a particular experimental result from a particular dataset influences the probability of two genes' functional relationship in a particular context. What a mouthful! In other words, this score tells you how confident you can be that two genes are related given just one experimental result. This confidence changes from process to process, since the same data might be more informative in one context than another (e.g. high microarray correlation won't tell you much about MAPK signaling). These scores tend to be near zero, since most individual datasets (particularly microarrays) are not sufficient in isolation to confidently relate two genes - but when hundreds of datasets support the same relationship, the confidence adds up!
Mathematically, this score is the fraction of possible change in posterior versus prior incurred by a single dataset's value in a process-specific Bayesian classifier. For example, suppose any two genes in the entire genome have a 1% chance of being related in the process of autophagy. Given no evidence, the prior probability would thus be 0.01. If you have one microarray dataset, and two genes show a high correlation, this might increase the classifier's posterior probability of relationship to 0.05. This represents an increase of (0.05-0.01)/0.99 = 4% of the possible range, so that microarray's high correlation evidence receives a score of 0.04. Conversely, if low sequence similarity provides negative evidence and decreases the posterior to 0.005, it would receive an evidence score of (0.005-0.01)/0.01 = -0.5.
Functional relationships between individual genes are generally predicted from data integration by taking two genes, looking at all data that relates to them in a bunch of experimental results, and boiling those results down in a principled manner to a single measure of relatedness. The same idea applies to groups of genes, in that their functional association can be measured by boiling down all of the relationships between their constituent genes in a principled manner. The allows functional maps to detail cross-talk among biological processes (e.g. which processes tend to be co-regulated or to carry out related cellular tasks) and to identify links between genetic disorders and potentially causal pathways and processes. The functional association score between two groups of genes G and H consists of four pieces:
These are SVG graphics in which each node represents a gene, each edge a predicted functional relationship, node color the query versus neighbor genes, and edge color the predicted probability of functional relationship. Each graph is a portion of the complete predicted functional relationship network for the current context calculated to be most related to the query genes. This is based on the bioPIXIE algorithm with modifications to take into account the sparsity of human gold standards (relative to yeast), the larger size of the human genome and datasets, and the functional association score system as used for functional mapping. The images can be viewed online in Internet Explorer using the Adobe Viewer, any recent version of Firefox, or downloaded and viewed/edited offline using Inkscape. The Renesis Player might also be an option for SVG viewing, but I don't know anything about it.
Gold bars that look like this mean that there's at least one gene known to be co-annotated between the current subject (i.e. query gene/process/disease) and the highlighted object. This might mean that the query gene is annotated to the target process in GO, associated with the target disease in OMIM, or that two processes or diseases overlap by at least one gene.
A raw functional association score is a ratio of the between, background, within, and baseline measures of average functional relationship for two gene sets. When this ratio is bigger, the two sets are more related - but quantifying "how much more" depends on the size of the two gene sets and on the current biological context. To analyze these scores in a more principled manner, we convert them to p-values by comparing them to bootstrapped null distributions generated by randomly calculating scores for thousands of gene sets over a wide ranges of sizes in every biological context. This yields a distribution of expected scores (per size of the two gene sets, per context) that is approximately normal with mean one, and comparing the score for a "real" gene set to this background distribution yields a p-value. However, the exact variance of the distribution is dependent on the size of the gene sets being analyzed and on the current context, and we can't randomly generate thousands of gene sets on the fly! Instead, we fit interpolating curves to the standard deviations observed for each context, with parameters fit to the range of sizes of the two gene sets. These are generally asymptotic in both sizes (e.g. the expected variance goes down sharply as either or both gene sets grow large) and fit well with a small number of parameters. These interpolating curves are then used at runtime to rapidly compute a standard deviation (equivalent to the original bootstrapped distribution), and from this we infer a p-value. If you want more details (which seems unlikely to me, but I have to offer), see our supplement.
What this means in practice is really just that 0.05 isn't a hard-and-fast significance threshhold like it might be in other areas. It's still a good rule-of-thumb significance threshhold, but the tails of a normal distribution fall off rapidly enough that small variations in our approximation can generate some wiggle. The p-values are as accurate as we can make them in real-time, but use them as a guide only.
If all else fails, let us know! We're happy to hear about any suggestions, comments, concerns, successes, or failures you have using HEFalMp or functional maps. Thanks for your interest!