Spectral methods for clustering protein sequences
Clustering protein sequences based on their evolutionary relationship is important for sequence annotation as structural and functional relationships can potentially be inferred. Most of the existing methods are based on simply thresholding a measure related to the distance between sequences. I mapped this problem into that of clustering the nodes of a weighted undirected graph in which each node corresponds to a protein sequence and the weights on the edges correspond to a measure of distance between two sequences. The goal is to partition such a graph into a set of discrete clusters whose members are homologs. Formulating the problem in this way it was possible to analyze the limitations that one might expect from earlier methods and I was able to provide a theoretical explanation of why spectral clustering methods should perform better for this problem. Then, I introduced a spectral method that partitions the graph into clusters by considering the random walk formulation on the graph, and analyzing the perturbations to the stationary distribution of a Markov relaxation process. This is done by looking at the eigenvectors of the Markov transition matrix. The weights on the edges of the graphs that I used are a nonlinear transformation of the pairwise BLAST e-values, and the parameters of such nonlinear transformation were learned. I tested this algorithm on difficult sets of proteins whose relationships are known from the SCOP database [1, 3, 4]. My method correctly identified many of the family/superfamily relationships. I found that the eigenvalue spectrum and in particular, the eigengaps provide useful indication on the number of clusters in the dataset. Using this approach the results obtained were much better than those obtained using other methods on the same datasets. In fact, I consistently observed that, the number of clusters obtained for a given set of proteins was close to the number of superfamilies in that set; there were fewer singletons; and the method correctly grouped most remote homologs. In my experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better (on average, improvements were: 84% over Hierarchical Clustering, 34% over Connected Component Analysis (similar to GeneRAGE), and 72% over TribeMCL. In collaboration with Rajkumar Sashidaran (Mark Gerstein's lab,
 A. Paccanaro, J. A. Casbon, M. A. S. Saqi (2006). Spectral Clustering of Proteins Sequences Nucleic Acids Research 2006 Mar 17;34(5):1571-80 [pubmed]
Sasidharan, M Gerstein, A Paccanaro
(2006) Spectral Clustering of Protein
Sequences Using Sequence-Profile Scores Proceeding of ICNPSC 2006 —
International Conference on Neural Parallel and Scientific
Atlanta, USA [pdf]
Paccanaro, C.Chennubhotla, J. A. Casbon, M. A. S. Saqi (2003). Spectral Clustering of Protein Sequences.
IJCNN 2003 - International Joint Conference on Neural Networks,
Chennubhotla, A. Paccanaro
(2003). Markov Analysis of Protein
Sequence Similarities Lecture Notes in Computer Science series,
278-286, Springer-Verlag. 14 [pdf]
Techniques for integrating different protein-protein interaction experiments
The ultimate goal of functional genomics is to discover the functions of all proteins in a genome. Proteins carry out their molecular functions by interacting with other molecules, mainly other proteins. Thus, protein interactions provide an important clue to the function of proteins. Many interaction datasets, mostly from large-scale experiments are available now. However, the quality of protein interaction experiments varies greatly with the researcher who performs the experiment and with the particular technique used. Therefore, it is important to be able to integrate the results of different experiments into a more reliable measurement. Together with Haiyuan Yu, I developed a system that provided such reliable measurement by combining the results of different experiments through a set of learned weights. Using the hand-curated protein complexes in the MIPS reference database we trained a system to assign a probability that each pairwise interaction is true based on experimental reproducibility and mass spectrometry scores from the relevant purifications. Thus from the raw experimental data we obtained a protein–protein interaction network as an undirected weighted graph in which individual proteins are nodes and the weight of the edge connecting two nodes is the probability that the interaction is correct. I am currently attempting to improve these results by including into the measurement information about the network topology. [This work is done in collaboration with the laboratory of Andrew Emili's lab,
Krogan et al, (2006), Global landscape of protein complexes in the yeast Saccharomyces cerevisiae Nature, 2006, Mar 30,440(7084):637-43 [pubmed]
Methods for denoising large scale protein-protein interaction experiments
Large-scale experiments for the identification of Protein-Protein interactions are known to be error-prone. I have developed two methods for inferring protein-protein interactions which are based purely on topological properties of interaction networks observed experimentally and can correct many of the errors present in large-scale experiments. The basic idea of the first method (joint work with Haiyuan Yu, Harvard University and Valery Trionov, Yale University) derives from the way in which large-scale experiments are carried out and particularly from the matrix model interpretation of their results. In these experiments, one protein (the bait), is used to pull out the set of proteins interacting with it (the preys), i.e. its protein complex, in the form of a list. When such lists differ only in a few elements, it is reasonable to assume that this is because of experimental errors, and the missing elements should therefore be added. Each list can be represented as a fully connected graph in which proteins occupy the nodes. Then the problem of identifying lists that differ in only a few elements is equivalent to finding a clique with a few missing edges, which we called a "defective clique". Therefore the algorithm searches the network for defective cliques (i.e. nearly complete complexes of pairwise interacting proteins) and predicts the interactions that complete them. The second method (which I am currently refining), computes a diffusion distance between each pair of proteins and then infers an interaction when such distance is below a given threshold. Both methods have been shown to have a very good predictive performance, thus allowing the correction of many errors present in large-scale experiments.
H. Yu, A.
Paccanaro, V. Trifonov, M. Gerstein (2006). Predicting
by completing defective cliques. Bioinformatics 2006 Apr
Paccanaro, V. Trifonov, H. Yu, M. Gerstein (2005). Inferring
Protein-Protein Interactions Using
Interaction Network Topologies. — IJCNN 2005 - International Joint
Conference on Neural Networks,
Prediction of protein-protein interactions
information from different genomic features, ranging from co-expression relationships tMachine learning techniques have often been used for genomic data integration. We used Naive Bayes to predict genome-wide protein-protein interactions in yeast by integratingo similarity of phylogenetic profiles. At a certain level of sensitivity the predictions were more accurate than the existing high-throughput experimental dataset. We explored the limits of genomic data integration, assessing the degree to which the predictive power increases with the addition of more features. I am currently improving these results by applying recently developed machine learning algorithms to this problem.
L. J. Lu,
Y. Xia, A. Paccanaro, H. Yu, M. Gerstein
(2005). Assessing the Limits of
Genomic Data Integration for Protein-Protein Interactions Genome
Jul 2005; 15: 945 – 953 [pubmed]
Prediction of gene essentiality from genomic features
Predicting essential genes remains a main goal in directed drug design for antimicrobial or antifungal targets. Currently, most essential gene prediction is performed by homology searches to organisms where essentiality is known and has been experimentally tested. While quick and easy, this system is also simplistic. Together with Michael Seringhaus (Mark Gerstein's Lab, Yale University), we aimed at improving the efficacy of such prediction through the integration of genomic-scale data, and the application of machine learning techniques. We trained a classification system on S. cerevisiae, where the Saccharomyces Genome Deletion project has ascertained essentiality for 95%+ of the genome. For each gene in the organism, we collected a set of genomic features – some derived from sequence information, others from functional genomics experiments. We used these data to learn a system that can predict essential genes in S. cerevisiae. We then applied this system to three recently-sequenced yeast genomes (S. bayanus, S. mikatae, and C. albicans) for which essential genes have not been experimentally identified. We then compared our predictive engine to a simple BLAST homology search, and a subset of our putative essential candidates in S. bayanus and S. mikatae were tested with knockouts in vivo. We were able to demonstrate for the first time that it is possible to learn traits associated with essential genes in yeast species and to use these features in a predictive manner. Our approach therefore shows promise for the identification of drug targets in novel and pathogenic species. We are currently continuing this work by studying the relative importance and effects of the different types of features on the prediction. [This work is done in collaboration with the laboratories of Michael Snyder and Mark Gerstein,
M Seringhaus, A Paccanaro, A Borneman, M Snyder and M Gerstein (2006) Predicting Essential Genes in Fungal Genomes, Genome Research, 2006, 16 (9): 1126-35 [pubmed]
Predicting protein function in E. Coli
We performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome. We used an N-step diffusion method to assign un-annotated genes to specific biological processes.
[This work is done in collaboration with the laboratory of Andrew Emili,
P. Hu, S. C. Janga, M. Babu, J. J. D’iaz-Mej’ia, G. Butland, W. Yang, O. Pogoutse, X. Guo, S. Phanse, P. Wong, S. Chandran, C. Christopoulos, A. Nazarians-Armavil, N. K. Nasseri, G. Musso, M. Ali, N. Nazemof, V. Eroukova, A. Golshani, A. Paccanaro, J. F. Greenblatt, G. Moreno-Hagelsieb, and A. Emili Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins PLoS biology, vol. 7, iss. 4, p. 1000096, 2009].
[this paper was highlighted in Nature Methods 2009; 6, 402-403]
Quantifying environmental adaptation of metabolic pathways in metagenomics
Recently, approaches have been developed to sample the genetic content of heterogeneous environments (metagenomics).
We investigated how these sequences link distinct environmental conditions with specific biological processes.
We focused on how the usage of particular pathways and subnetworks reflects the adaptation of microbial
communities across environments and habitats. We introduced a novel approach to relate multiple, continuously
varying factors defining an environment to the extent of particular microbial pathways present in a geographic site.
Furthermore, we developed a method to identify ensembles of weighted pathways which we called metabolic footprints
that are predictive of their environment and could be used as biosensors.
[This work is done in collaboration with the laboratories of Mark Gerstein and Michael Snyder,
T. A. Gianoulis, J. Raes, P. V. Patel, R. Bjornson, J. O. Korbel, I. Letunic, T. Yamada, A. Paccanaro, L. J. Jensen, M. Snyder, P. Bork, and M. B. Gerstein Quantifying environmental adaptation of metabolic pathways in metagenomics, Proceedings of the National Academy of Sciences, vol. 106, iss. 5, pp. 1374-1379, 2009]. [pubmed]
[this paper was highlighted in Science 2009; 323 (5918) and Nature Genetics 2009; 41 (275)]