Help

  1. Overview
    1. What's a functional map?
    2. What are subjects and objects?
    3. What's a context (or process)?
    4. What do all of these scores mean?
  2. How do I...
    1. Predict gene function?
    2. Find new genes possibly associated with a disease or biological process?
    3. Analyze my own set of genes?
    4. Find out how a score was generated?
  3. Data integration
    1. What experimental data is used in HEFalMp?
    2. How are the experimental results encoded?
    3. What do the dataset/evidence scores mean?
  4. Functional mapping
    1. How are groups of genes analyzed?
    2. What are the graphics generated when a gene set is related to all genes?
    3. What does a gold bar beside a process/disease/gene mean?
    4. How are p-values calculated, and why are they approximate?
  5. Contact us

Overview

What's a functional map?

A functional map is a way of usefully exploring information from thousands of experimental results, focused on a specific query of interest. This might mean finding data that pertains to a single gene/protein, a group of related (or unrelated) genes, a pathway, process, or set of disease-related genes. Functional maps rely on data integration to summarize genomic data as functional relationship networks. Each network encodes how likely it is for every pair of genes in the genome to interact functionally - possibly a direct interaction, like protein binding, or an indirect functional relationship, like participating in the same cellular process. Functional mapping analyzes portions of these networks related to user-specified groups of genes and biological processes and displays the results as probabilities (for individual genes), functional association p-values (for groups of genes), or graphically (as an interaction network).

What are subjects and objects?

The focus of a functional map will sometimes be referred to as the "query" or "subject". This is the primary entity that you're interested in when you generate a functional map: a gene, gene set, process, or disease. "Objects" are the type of entities you're interested in associating with the subject. So for example, if you wanted to predict gene function for ALOX5AP, your "subject" would be a gene, and your "objects" would be biological processes. Similarly, if you want to find out what genetic disorders might be caused by a particular pathway, your "subject" might be "protein processing", and your "objects" would be diseases.

What's a context (or process)?

A "context" or "process" refers to a biological process, e.g. a Gene Ontology term or other group of functionally related genes. This is roughly equivalent to a pathway or group of related pathways. HEFalMp's functional maps rely on process-specific data integration as described in Huttenhower et al 2006 and Myers et al 2007. This means that different genomic data will be up- or down-weighted in different biological contexts; microarrays might be very informative for transcriptionally regulated processes, for example, but direct binding data might be more informative for post-translationally regulated processes.

What do all of these scores mean?

HEFalMp provides three types of scores depending on the subject and object of a map:

How do I...

Predict gene function?

  1. Investigate a gene as it relates to biological processes in the context of all biological processes.

Find new genes possibly associated with a disease or biological process?

  1. Investigate a biological process (to find new genes associated with a pathway) or disease (for genes associated with a genetic disorder).
  2. See how it relates to all genes.
  3. You can find relationships in all biological processes or in a specific process of interest.
  4. Predictions in a particular context will generally be more specific and confident. For example, if you're finding genes associated with the cell cycle, you probably want to do so in the context of the mitotic cell cycle or DNA replication. If you're exploring Alzheimer's disease, you might want to use the protein processing or learning and/or memory contexts.

Analyze my own set of genes?

  1. Investigate a set of genes and type your genes of interest into the text entry box.
  2. If you'd like to see a functional relationship network around your gene set, relate it to all genes and select a context of interest or all biological processes.
  3. If you'd like to predict which pathway(s) your genes might be involved in, see how they relate to biological processes in all biological processes.
  4. If you'd like to see whether your genes are associated with any known genetic disorders, see how they relate to diseases in a specific context or in all biological processes.

Find out how a score was generated?

Click on it! Every time you see a score in HEFalMp, you can click on it to find out where it came from. Scores always bottom out in individual dataset/evidence scores taken directly from process-specific Bayesian classifiers learned automatically from experimental data. Getting to these experimental results could take multiple clicks, though, since functional mapping relies on several levels of summarization (remember, each predicted association might be based on literally millions of datapoints!) These include:

Data integration

What experimental data is used in HEFalMp?

HEFalMp uses information relating pairs of genes from anything it can get its hands on, primarily microarrays, protein interactions (both physical and genetic), and different types of sequence comparisons (homology, protein domains, transcription factor binding sites, etc.) This totals almost 30 billion data points from over 30,000 experimental conditions! For more information, see our supplemental information and our page about HEFalMp; a quick summary is:

Data points Datasets Publications Experimental conditions
Interactions (physical and genetic) 11,244,053 14 >15,000 >15,000
Sequence comparisons (nucleotide and protein) 452,199,430 7 6 NA
Microarrays 27,248,177,875 635 417 14,671
All data 27,711,621,358 656 >15,500 ~30,000

How are the experimental results encoded?

Functional maps and their underlying functional relationship networks always analyze pairs of genes; this means that every data point corresponds to a measurement comparing two genes, e.g. microarray correlation, protein interaction, sequence similarity, and so forth. HEFalMp uses naive Bayesian classifiers to balance efficiency and accuracy; to improve performance in the presence of many datasets, we also perform parameter regularization to upweight particularly unique, informative datasets. These classifiers operate on discrete values, however; this is fine for interaction datasets, but it means that we have to bin microarray correlations, sequence comparisons, and other continuous values. Details are found in our supplemental information; in brief:

What do the dataset/evidence scores mean?

The evidence scores, which range from -1 (strong negative) through 0 (none) to 1 (strong positive), indicate how much a particular experimental result from a particular dataset influences the probability of two genes' functional relationship in a particular context. What a mouthful! In other words, this score tells you how confident you can be that two genes are related given just one experimental result. This confidence changes from process to process, since the same data might be more informative in one context than another (e.g. high microarray correlation won't tell you much about MAPK signaling). These scores tend to be near zero, since most individual datasets (particularly microarrays) are not sufficient in isolation to confidently relate two genes - but when hundreds of datasets support the same relationship, the confidence adds up!

Mathematically, this score is the fraction of possible change in posterior versus prior incurred by a single dataset's value in a process-specific Bayesian classifier. For example, suppose any two genes in the entire genome have a 1% chance of being related in the process of autophagy. Given no evidence, the prior probability would thus be 0.01. If you have one microarray dataset, and two genes show a high correlation, this might increase the classifier's posterior probability of relationship to 0.05. This represents an increase of (0.05-0.01)/0.99 = 4% of the possible range, so that microarray's high correlation evidence receives a score of 0.04. Conversely, if low sequence similarity provides negative evidence and decreases the posterior to 0.005, it would receive an evidence score of (0.005-0.01)/0.01 = -0.5.

Functional mapping

How are groups of genes analyzed?

Functional relationships between individual genes are generally predicted from data integration by taking two genes, looking at all data that relates to them in a bunch of experimental results, and boiling those results down in a principled manner to a single measure of relatedness. The same idea applies to groups of genes, in that their functional association can be measured by boiling down all of the relationships between their constituent genes in a principled manner. The allows functional maps to detail cross-talk among biological processes (e.g. which processes tend to be co-regulated or to carry out related cellular tasks) and to identify links between genetic disorders and potentially causal pathways and processes. The functional association score between two groups of genes G and H consists of four pieces:

What are the graphics generated when a gene set is related to all genes?

These are SVG graphics in which each node represents a gene, each edge a predicted functional relationship, node color the query versus neighbor genes, and edge color the predicted probability of functional relationship. Each graph is a portion of the complete predicted functional relationship network for the current context calculated to be most related to the query genes. This is based on the bioPIXIE algorithm with modifications to take into account the sparsity of human gold standards (relative to yeast), the larger size of the human genome and datasets, and the functional association score system as used for functional mapping. The images can be viewed online in Internet Explorer using the Adobe Viewer, any recent version of Firefox, or downloaded and viewed/edited offline using Inkscape. The Renesis Player might also be an option for SVG viewing, but I don't know anything about it.

What does a gold bar beside a process/disease/gene mean?

Gold bars that look like this mean that there's at least one gene known to be co-annotated between the current subject (i.e. query gene/process/disease) and the highlighted object. This might mean that the query gene is annotated to the target process in GO, associated with the target disease in OMIM, or that two processes or diseases overlap by at least one gene.

How are p-values calculated, and why are they approximate?

A raw functional association score is a ratio of the between, background, within, and baseline measures of average functional relationship for two gene sets. When this ratio is bigger, the two sets are more related - but quantifying "how much more" depends on the size of the two gene sets and on the current biological context. To analyze these scores in a more principled manner, we convert them to p-values by comparing them to bootstrapped null distributions generated by randomly calculating scores for thousands of gene sets over a wide ranges of sizes in every biological context. This yields a distribution of expected scores (per size of the two gene sets, per context) that is approximately normal with mean one, and comparing the score for a "real" gene set to this background distribution yields a p-value. However, the exact variance of the distribution is dependent on the size of the gene sets being analyzed and on the current context, and we can't randomly generate thousands of gene sets on the fly! Instead, we fit interpolating curves to the standard deviations observed for each context, with parameters fit to the range of sizes of the two gene sets. These are generally asymptotic in both sizes (e.g. the expected variance goes down sharply as either or both gene sets grow large) and fit well with a small number of parameters. These interpolating curves are then used at runtime to rapidly compute a standard deviation (equivalent to the original bootstrapped distribution), and from this we infer a p-value. If you want more details (which seems unlikely to me, but I have to offer), see our supplement.

What this means in practice is really just that 0.05 isn't a hard-and-fast significance threshhold like it might be in other areas. It's still a good rule-of-thumb significance threshhold, but the tails of a normal distribution fall off rapidly enough that small variations in our approximation can generate some wiggle. The p-values are as accurate as we can make them in real-time, but use them as a guide only.

Contact us

If all else fails, let us know! We're happy to hear about any suggestions, comments, concerns, successes, or failures you have using HEFalMp or functional maps. Thanks for your interest!