A cell can be viewed as a complex network of inter-relating proteins, nucleic acids and other bio-molecules. A bio-molecular network can be viewed as a collection of nodes, representing the bio-molecules, connected by links, representing relations between the bio-molecules. We are working on learning/mining valuable information from bio-molecular networks.

Support Node Partition

An important problem in computational biology is the identification of candidate genes which can be considered representative of the different cellular processes taking place in the cell as it evolves through abctime. Multiple and very noisy data sources contain information about such processes and should therefore be integrated in order to obtain a reliable identification of such candidate genes. These data sources usually appear as a graph or can be naturally mapped to a graph (e.g., co-expression graph for microarray data). To find the most representative nodes from multiple graphs, we adapted Vector Quantization to Node Quantization (currently we call it Support Node Partition) by optimizing a submodular function which guarantees that a greedy algorithm will find a suboptimal solution.

By the toy data shown in the right panel, we demonstrate the difference between our algorithm and k-means. The three clusters are generated from the normal distribution with different means (black cycle) and variances (1.5, 0.5, and 0.5). They are marked by circle, square, and diamond respectively. We select three points by our algorithm (red pentagram) and k-means (yellow hexagram).

Graph Dependency

Graphs appear in many real-world applications from web hyperlinks to social networks to biological networks. Although several measures have been proposed to quantify information between two discrete or continuous variableshe Graph Dependency between the “house” binary graph and two of its subgraphs.  (e.g. conditional entropy, the dependency degree in Rough Set Theory), no proper information measure has so far been proposed to quantify how much information a graph contains about another graph. In this project, we introduce Graph Dependency (GD) as a measure of the amount of information between two graphs. Moreover, the key observation that a discrete variable can be thought of as an equivalence relation allows us to establish a connection between the proposed measure and the aforementioned traditional measures thus providing a unified algebraic view of information measures. To demonstrate the general applicability and usefulness of GD, we applied it to graphs from different fields such as biology, the World Wide Web and social networks. Here we show that GD can be used to quantify the amount of gene functional information in correlation networks and protein-protein interaction networks; the amount of categorical information in web pages and hyperlink graphs; the relationship between user trust/distrust and taste similarity.

By the left toy data, we show the graph dependency between the ''house'' graph and two of its subgraph.

As an example, by the pictures below we show how much information co-expression graphs contain about the semantic similarity graphs for the three different categories in the Gene Ontology – Biological Process, Cellular Component and Molecular Function.  GD, Pearson Correlation Coefficient (PCC) and the Conditional Entropy (CE) have been calculated separately for the four different experiments in the Yeast cell cycle dataset from Spellman et al. According to GD, microarray data contain most information about BP, followed by CC and MF. This is as expected from existing biological knowledge, since co-expression data capture the joint collaborative activity of genes (processes) and the activity of co-expressed genes is more likely to occur in the same cellular component. At the same time, co-expression data carry less information about the specific molecular function of a gene. PCC and CE fail to provide a clear ordering of the three categories of Gene Ontology.
Four different colors represent four independent experiments: α factor arrest, cdc15, cdc28, and elutriation. We employed Resnik’s semantic similarity measure to build the semantic similarity graphs.

Development of a graph-theoretic approach to predict protein function by integrating large scale heterogeneous data

In recent years, the numerous large scale sequencing projects have generated enormous amounts of sequence data. This has led to the identification of thousands of previously unseen genes whose function awaits to be characterized. A fundamental goal is therefore to identify the function of uncharacterized genes on a genomic scale. It is difficult to design functional assays for uncharacterized genes so a major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated experimentally. In this project, we focus on predicting the functions of genes in an organism whose sequences are just detected and for which no experiment has been done so none of its genes has a known function. Our methods consists of five steps: initial guesses for the functions of genes by different models such as Superfamily, BlastProDom, FPrintScan, Gene3D, HMMTigr and so on, integration of these guesses, building graphs that are transformed from STRING, integration of these graphs, and diffusion of integrated initial guess of functions of genes on the integrated graph. To evaluate the proposed method, we choose E. Coli as a test organism, in which some genes are characterized and some experiments have been done. We only used the sequence information of genes in E. Coli (we do not use any experiment information), and the known functions of some genes are only used as gold standard. We followed G. Obosinski et al.'s setting that subdivide predictions into four groups on the basis of the number of genes assigned to a GO terms (3 to10, 11 to 30, 31 to 100, and 101 to 300).  The average AUC scores for GO terms in these four groups are 0.80, 0.75, 0.73 and 0.77 respectively. 

Methods for Semantic similarity

Several measures have been recently proposed for quantifying the functional similarity between gene products according to well-structured controlled vocabularies where biological terms are organized in a tree or in a directed acyclic graph (DAG) structure. However, existing semantic similarity measures ignore two important facts. First, when calculating the similarity between two terms, they disregard the descendants of these terms. While this makes no di fference when the ontology is a tree, we shall show that it has important consequences when the ontology is a DAG -- this is the case, for example, with the Gene Ontology (GO). For example, see the left figure, the similarity between M and N should be larger than the similarity between I and L. Second, existing similarity measures do not model the inherent uncertainty which comes from the fact that our current knowledge of the gene annotation and of the gene ontology structure is incomplete. Here we propose a novel approach based on downward random walks that can be used to improve any of the existing similarity measures to exhibit these two properties.

We show that our approach greatly improves three commonly used semantic similarity measures which are based on the work of Resnik, Lin and Jiang. We applied these improved measures to the Gene Ontology annotations of the yeast Saccharomyces cerevisiae, and tested how they correlate with sequence similarity, mRNA co-expression and protein-protein interaction data. Our results consistently show that the use of downward random walks leads to more reliable similarity measures.