Projects
A cell can be viewed as a complex network of inter-relating
proteins, nucleic acids and other bio-molecules. A bio-molecular
network can be viewed as a collection of nodes, representing the
bio-molecules, connected by links, representing relations between the
bio-molecules. We are working on learning/mining valuable
information from bio-molecular networks.
Support Node
Partition
An important problem in computational biology is the
identification of candidate genes which can be considered
representative of the different cellular processes taking place in the
cell as it evolves through
time.
Multiple
and
very
noisy
data
sources contain information about such
processes and should therefore be integrated in order to obtain a
reliable identification of such candidate genes. These data sources
usually appear as a graph or can be naturally mapped to a graph (e.g.,
co-expression graph for microarray data). To find the most
representative nodes from multiple graphs, we adapted Vector
Quantization to Node Quantization (currently we call it Support Node
Partition) by optimizing a submodular function which guarantees that a
greedy algorithm will find a suboptimal solution.
By the toy data shown in the right panel, we demonstrate
the difference between our algorithm and k-means. The three clusters are generated from the normal
distribution with different means
(black cycle) and variances (1.5, 0.5,
and 0.5). They are marked by circle, square, and diamond respectively.
We select three points by our algorithm (red pentagram)
and k-means (yellow hexagram).
Graph Dependency
Graphs
appear
in
many real-world applications from web hyperlinks to social
networks to biological networks. Although several measures have been
proposed to quantify information between two discrete or continuous
variables
(e.g. conditional entropy,
the dependency degree in Rough Set Theory), no proper information
measure has so far been proposed to quantify how much information a
graph contains about another graph. In this project, we introduce Graph
Dependency (GD) as a measure of the amount of information between two
graphs. Moreover, the key observation that a discrete variable can be
thought of as an equivalence relation allows us to establish a
connection between the proposed measure and the aforementioned
traditional measures thus providing a unified algebraic view of
information measures. To demonstrate the general applicability and
usefulness of GD, we applied it to graphs from different fields such as
biology, the World Wide Web and social networks. Here we show that GD
can be used to quantify the amount of gene functional information in
correlation networks and protein-protein interaction networks; the
amount of categorical information in web pages and hyperlink graphs;
the relationship between user trust/distrust and taste similarity.
By the left toy data, we show the graph dependency between
the ''house'' graph and two of its subgraph.
As an example, by the pictures below we show how much information
co-expression graphs contain about the semantic similarity graphs for
the three different categories in the Gene Ontology – Biological
Process, Cellular Component and Molecular Function. GD, Pearson
Correlation Coefficient (PCC) and the Conditional Entropy (CE) have
been calculated separately for the four different experiments in the
Yeast cell cycle dataset from Spellman et al. According to GD,
microarray data contain most information about BP, followed by CC and
MF. This is as expected from existing biological knowledge, since
co-expression data capture the joint collaborative activity of genes
(processes) and the activity of co-expressed genes is more likely to
occur in the same cellular component. At the same time, co-expression
data carry less information about the specific molecular function of a
gene. PCC and CE fail to provide a clear ordering of the three
categories of Gene Ontology. Four different colors represent four
independent experiments: α factor arrest, cdc15, cdc28, and
elutriation. We employed Resnik’s semantic similarity measure to build
the semantic
similarity
graphs.
Development of a
graph-theoretic approach to predict protein
function by integrating large scale heterogeneous data
In recent years, the numerous large scale sequencing projects
have generated enormous amounts of sequence data. This has led to the
identification of thousands of previously unseen genes whose function
awaits to be characterized. A fundamental goal is therefore to identify
the function of uncharacterized genes on a genomic scale. It is
difficult to design functional assays for uncharacterized genes so a
major challenge in bioinformatics is to devise algorithmic methods
that, given a gene, can predict a hypothesis for its function that can
then be validated experimentally. In this project, we focus on
predicting the functions of genes in an organism whose sequences
are just detected and for which no experiment has been done so none of
its genes has a known function. Our methods consists of five steps:
initial guesses for the functions of genes by different models such as
Superfamily, BlastProDom, FPrintScan, Gene3D, HMMTigr and so on,
integration of these guesses, building graphs that are transformed
from STRING, integration of these graphs, and diffusion of integrated
initial guess of functions of genes on the integrated graph. To
evaluate the proposed method, we choose E. Coli as a test organism, in
which some genes are characterized and some experiments have been done.
We only used the sequence information of genes in E. Coli (we do not
use any experiment information), and the known functions of some genes
are only used as gold standard. We followed G. Obosinski et al.'s
setting that subdivide predictions into four groups on the basis of the
number of genes assigned to a GO terms (3 to10, 11 to 30, 31 to 100,
and 101 to 300). The average AUC scores for GO terms in these
four groups are 0.80, 0.75, 0.73 and 0.77 respectively.
Methods for Semantic
similarity
Several measures have been recently proposed for quantifying
the functional similarity between gene products according to
well-structured controlled vocabularies where biological terms are
organized in a tree or in a directed acyclic graph (DAG) structure.
However, existing semantic similarity measures ignore two important
facts. First, when calculating the similarity between two terms, they
disregard the descendants of these terms. While this makes no
difference when the ontology is a tree, we shall show that it has
important
consequences when the ontology is a DAG -- this is the case, for
example, with the Gene Ontology (GO). For example, see the left figure,
the similarity between M and N should be larger than the similarity
between I and L. Second
, existing similarity
measures do not model the inherent uncertainty which comes from the
fact that our current knowledge of the gene annotation and of the gene
ontology structure is incomplete. Here we propose a novel approach
based on downward random walks that can be used to improve any of the
existing similarity measures to exhibit these two properties.
We show that our approach greatly improves three commonly used semantic
similarity measures which are based on the work of Resnik, Lin and
Jiang. We applied these improved measures to the Gene Ontology
annotations of the yeast Saccharomyces cerevisiae, and tested how they
correlate with sequence similarity, mRNA co-expression and
protein-protein interaction data. Our results consistently show that
the use of downward random walks leads to more reliable similarity
measures.
Code: http://www.cs.rhul.ac.uk/home/haixuan/GOSIM.html.