Download
- Supplemental Data
- Implementation
Looking to download functional maps? Keep an eye on the bottom of
each page of results: every functional map of any kind is generated
with a Download link at the bottom right. Most functional maps
are provided as tab-delimited text to simplify downstream processing;
graphical interaction networks are provided as
Support Vector Graphics
files, which can be viewed using the Adobe Viewer, any recent version of
Firefox, or the
excellent open source Inkscape tool.
For a copy of the manuscript, please see:
Supplemental Data
-
Supplemental Figure 1
Process-specific performance of data integration. Each
column in the heatmap represents one of four data integration
schemes (classifiers trained and evaluated for individual processes,
process-specific classifiers trained for specific processes and
reintegrated, a single process-independent classifier trained
globally and evaluated for individual processes, or an
unregularized global classifier), and each scheme is evaluated
using the entire genome or a set of ~6,000 holdout genes. Each
cell in the heatmap represents the AUPRC over the precision
(functional relationships) and recall (genes) in each of the 229
analyzed processes, sorted in order of decreasing number of genes.
This performance evaluates predicted functional relationships using
a gold standard derived from multiple curated functional catalogs
(including the Gene Ontology, KEGG, and others); each process's
performance is evaluated using the subset of this gold standard
associated with the process (see Methods). While there is a clear
correlation between process size and performance (due to the fact
that more training data makes the machine learning task easier),
certain processes are also predicted more accurately than others
independent of size, such as defense response to virus or protein
polymerization. The most specific classifiers overfit somewhat to
the smallest terms, but this effect is mitigated in larger terms or
by reintegration into the process-specific classifier. For medium
and large processes, increased specificity provides increased
performance, with individual process classifiers outperforming the
process-specific classifier, which in turn outperforms the
process-independent classifier. Finally, the unregularized
classifier is outperformed in essentially every process.
-
Supplemental Figure 2
A subset of the functional modules predicted by mining highly-connected clusters from functional relationship networks.
These modules are organized into a partially overlapping hierarchy, similar to that of the Gene Ontology. Each module consists of genes predicted to be related based on multiple informative genomic datasets. Here, a specific module consisting of PIAS3, MITF, and PAX6 generalizes through two main branches into modules enriched for various transcriptional regulation activities in the cell cycle, apoptosis, and intercellullar signaling. The most specific module in the hierarchy links the transcriptional regulators PIAS3, MITF, and PAX6 with very strong evidence drawn from multiple direct binding assays in the BioGRID (Stark et al. 2006). This module has two main branches of more general parents in the hierarchy. The first contains several cell growth, death, and differentiation transcriptional modulators, including JUN, NFKB1, and BCL3. The second contains multiple cell cycle related oncogenes, oncogene activators, and TGF-β family mediators, almost all of which are also transcriptional modulators (Kim et al. 2000). This is likely indicative of two interrelated regulatory programs, the former focused on cell development and differentiation and the latter responding more specifically to extracellular signaling. We have automatically mined and hierarchically organized ~17,000 functional modules of varying specificities from our integrated data.
-
Supplemental Figure 3
Performance of predicted functional modules in recapitulating known biology.
To validate the functional
specificity of the hierarchical modules predicted by our data integration process, we analyzed their ability to recover information from a
held-out portion of our gold standard (as per Figure 1B's evaluation of individual predicted functional relationships). These modules,
a hierarchy of gene clusters drawn from our functional relationship networks (not from usual clustering metrics such as distance
or correlation), were compared with other clustering schemes (k-means with k=5, 10, 50, 100, 500, 1000, and 5000, and hierarchical
clustering cut to produce 5, 10, etc. clusters). In all cases, more specific parameters (modules lower in the hierarchy, larger k,
or smaller diameters) are expected to produce more precise clusters in which fewer genes cocluster. This is reflected in the plot
above, with our predicted functional modules demonstrating a substantial advantage over less structured clusterings (all of which
consume our predicted functional relationship networks as input). The hierarchical structure and discrete sets of related genes that
make up the functional modules can be an advantage over the raw, continuous predicted functional relationship scores for some
applications; these results show that the functional modules do not sacrifice substantial biological accuracy in order to offer these
features.
-
Supplemental Figure 4
Associations between biological processes derived by functional mapping.
As a validation of the system's ability to derive
known biology by means of functional mapping, a focus on the process of cell fate commitment predicts it to be associated with a cluster of cell
development and differentiation processes. Arrow width indicates the strength of predicted association, and border thickness indicates the internal
cohesiveness of each process in the integrated genomic data. For example, when we focus on the process of cell fate commitment, we predict associations
with many specific processes of cell differentiation and development; many of these relationships represent known biology and thus serve as a
validation of the computational method. Several of these associations are driven by proteins known to be involved in multiple processes, e.g. the
association with gastrulation involves many shared genes including TGFB2, BMP4, TBX6, and TRIM15. On the other hand, an apparently similar
association with axis specification is driven mainly by genes not yet cataloged as involved in a cell fate decision (e.g. TDGF1, T, MDFI, etc.) These
predicted associations are based on a combination of proteins known to participate in multiple processes and on data-driven predicted relationships;
additional novel associations can be explored for other biological processes using the HEFalMp interface.
-
Supplemental Figure 5
Quantification of autophagosome formation in starved cells.
MAP1LC3 is typically diffuse throughout the cytoplasm in non-starved cells. Under normal conditions, starved cells will initiate autophagy, process MAP1LC3 to the MAP1LC3-II isoform, and form punctate autophagosomes to which it is localized. We measured the degree to which this was impaired by luciferase (negative control), ATG5 (positive control), AP3B1, ATP6AP1, BLOC1S1, LAMP2, RAB11A, and VAMP7 siRNA depletions using immunoblotting (see Figure 3) and automated and manual inspection of ten images for each condition (totaling 80 images). While VAMP7 knockdowns showed no effect (see Discussion), siRNA knockdowns of the remaining five genes inhibited normal autophagy. A) Automated image analysis detects a significant decrease in fluorescent GFP-tagged MAP1LC3-II under starvation conditions for the positive control (ATG5) and five validated knockdowns. Bars show standard error of average relative intensity as quantified by CellProfiler (Carpenter et al. 2006) over a collection of 10 images per condition (80 total). This decrease in detectable fluorescence indicates that normal MAP1LC3-II processing (and thus autophagy) is impaired when these five protein levels or the ATG5 positive control are reduced. B) Manual inspection of the number of puncta per cell shows decreased autophagosome formation when autophagy is impaired. Error bars indicate standard error over counts by three independent investigators viewing randomized, unlabeled images. The number of puncta increases when cells are starved under the luciferase control condition, but this increase is substantially impaired in ATG5 (positive control), AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A siRNA conditions.
-
Supplemental Figure 6
Bayesian parameter regularization prevents overconfident
probability estimation in the presence of many datasets. While
naive Bayesian classifiers provide an accurate and efficient way to
integrate hundreds of genomic datasets, they assume complete
independence between all data. Violations of this assumption,
which occur due to shared biological and technical signals between
datasets, become increasingly problematic as the number of
integrated datasets increases. We use Bayesian parameter
regularization to combine each dataset's probability distribution
with a uniform prior, mixing this prior in with weight proportional
to the amount of information shared by each dataset (see
Supplemental Table 6). Intuitively, this results in datasets with
strong, unique signals being upweighted during the integration
process, while groups of datasets sharing most of the same
information will be downweighted. Without regularization, a
low-confidence datum contributed by many datasets can
inappropriately result in a high-confidence prediction of
functional relationship. Regularization downweights such shared
data and results in a more biologically realistic distribution of
low- and high-probability functional relationships.
-
Supplemental Table 1
Datasets integrated for functional mapping.
Short IDs are used for dataset identification in other supplemental data. "Discretization" indicates the number of bins into which the dataset's values were discretized; generally, interaction datasets used 0 and 1 for (un)observed interactions, microarrays were binned by standard deviation from <-1.5 to >4.5, and profile data (TFBSs, domains, etc.) were binned using discretized inner product or Euclidean distance. "Conditions" are the number of microarray conditions in the dataset, "Publications" the approximate number of distinct Pubmed IDs associated with the dataset, and "Datapoints" indicates the number of gene pairs for which the dataset contributes information.
-
Supplemental Table 2
Frequency-weighted functional activity scores.
Each score represents the sum of change in probability attributable to a single dataset in each process-specific Bayesian classifier, weighted by the frequency with which that dataset contributes information. This is calculated as the weighted sum over the absolute change in posterior given each possible value from a dataset, weighted by that value's prior probability. Intuitively, this means that a dataset must contain accurate data for many gene pairs to have a high score; microarrays, for example, contain data for most gene pairs but are often noisy. Specific assays collected in databases such as BIND or DIP are much more accurate, but they provide information for a much smaller number of gene pairs. Relative functional activity scores, which can be more informative in many situations, can be calculated by dividing each process-specific score by the appropriate dataset's score in the global (process-independent) network.
-
Supplemental Table 3
Process-specific performance of data integration.
Each column in the table represents one of four data integration schemes (classifiers trained and evaluated for individual processes, process-specific classifiers trained for specific processes and reintegrated, a single process-independent classifier trained globally and evaluated for individual processes, or an unregularized global classifier), and each scheme is evaluated using the entire genome or a set of ~6,000 holdout genes. Each cell in the table represents the AUPRC over the precision (functional relationships) and recall (genes) in each of the 229 analyzed processes, sorted in order of decreasing number of genes. This performance evaluates predicted functional relationships using a gold standard derived from multiple curated functional catalogs (including the Gene Ontology, KEGG, and others); each process's performance is evaluated using the subset of this gold standard associated with the process (see Methods). While there is a clear correlation between process size and performance (due to the fact that more training data makes the machine learning task easier), certain processes are also predicted more accurately than others independent of size, such as defense response to virus or protein polymerization. The most specific classifiers overfit somewhat to the smallest terms, but this effect is mitigated in larger terms or by reintegration into the process-specific classifier. For medium and large processes, increased specificity provides increased performance, with individual process classifiers outperforming the process-specific classifier, which in turn outperforms the process-independent classifier. Finally, the unregularized classifier is outperformed in essentially every process.
-
Supplemental Table 4
Non-normalized functional association scores between all genes and 229 biological processes.
These scores represent the gene-specific portion of the functional association between each gene in the genome and each biological process, calculated from that process's specific predictions. This is the ratio of a gene's average probability of functional relationship to genes in the process divided by its average relationship probability to the entire genome. Scores for genes previously annotated to the process are negative. These scores can easily be interpreted as function predictions, with the third quartile or inner upper fence of the score for genes characterized to a process serving as a confident cutoff for associating additional genes with the process.
-
Supplemental Table 5
Functional modules automatically extracted from predicted functional relationship networks.
Each module represents a cluster (heavy subgraph) in the global process-aware functional relationship network using an algorithm based on the greedy approximation of Charikar 2000. All modules were generated with a minimum initial score (sigma) of 0.95, the indicated final ratio (rho) from 0.5 to 0.01, and the indicated confidence score; the ratio can be considered similar to a depth in the Gene Ontology, where higher ratios indicate "lower", more specific clusters, and the final score indicates the overall cohesiveness of the module. After extraction, all modules with a Jaccard index of at least 0.5 were merged. Parent/child relationships were automatically inferred for any pair of modules P and C such that C's ratio was at least equal to P's and at least 2/3s of C's genes were contained in P. While this results in many redundant hierarchical relationships such that A->B->C and A->C, it quickly and automatically captures a great deal of the functional structure derived from integration of many genomic datasets without relying on predetermined catalogs such as the Gene Ontology.
-
Supplemental Table 6
Mutual information scores between datasets.
These were approximated using the discretization described in Supplemental Table 1, with missing values normalized by randomization (minimizing the score contribution from simple overlap of present/not present gene pairs). These mutual information values were normalized per-dataset and used to heuristically determine appropriate alpha weights for parameter regularization of the naive Bayesian classifiers.
-
Supplemental Table 7
Bootstrapped distributions used for realtime p-value approximation.
In order to approximate a p-value for functional association scores involving user-provided gene sets of any size in any context, bootstrapped null distributions were precomputed using 62,500 randomly chosen gene sets over a range of sizes. Functional association scores between random gene sets were observed to follow an approximately normal distribution with mean one and standard deviation asymptotic in the size of both gene sets. This allowed the expected standard deviation of functional association score between gene sets of size N and M with N<=M to be modeled using six parameters A1, A2, B, C1, C2, and C3: sigma(N, M) = (A*M + B)/(M + C), A = (A1*N + A2)/(N + 1), C = (C1*N + C2)/(N + C3). This standard deviation allows the conversion of a functional association score into a p-value by subtracting the mean (one), dividing by the standard deviation, and performing a z-test using a Gaussian CDF. These six parameters and the bootstrapped standard deviations are provided below for each process and for the global (process-independent) network.
-
Supplemental Text 1
Previous work in heterogeneous data integration.
As mentioned in the main text, related prior work in the area of heterogeneous data integration falls into two categories: methodological precursors involving naive Bayesian classifiers and biological precursors performing data integration for simpler organisms.
Implementation
HEFalMp is implemented using the Sleipnir library for
computational functional genomics as a data-processing back-end; the
BNServer tool provides real-time functional maps, while Hubber,
Funcaeologist, and various other tools were used for data processing,
precomputation, and evaluation.
The HEFalmp web front end is built using Ruby on Rails with a standard MySQL database. If you're interested in
setting up your own functional mapping web site (e.g. for a new
organism or data collection), please contact us - we're happy to provide code and database contents.