GENOME SCALE PREDICTION OF PROTEIN FUNCTIONAL CLASS FROM SEQUENCE USING DATA MINING
Dr Ross D King, Department of Computer Science, The University of Wales, Aberystwyth
Abstract: The ability to predict protein function from sequence is a central research goal of molecular biology. Such a capability would greatly aid the biological interpretation of the genomic data and accelerate its medical exploitation. For the existing sequenced genomes function can be assigned to typically only between 40-60% of the Open Reading Frames (ORFs). The new science of functional genomics is dedicated to discovering the function of these ORFs, and to further detailing the function of genes with assigned function. I will present a novel data-mining based approach for predicting biological function from sequence. This is based on a combination of: Warmr an inductive logic programming version (ILP) of the Apriori algorithm, and decision tree rule learning.
The effectiveness of this approach will be demonstrated on the M. tuberculosis and E. coli genomes. For M. tuberculosis 65% of the ORFs with no current assigned function, and 24% of those in E. coli, are predicted with an estimated accuracy of 60-80% (depending on the level of functional assignment). Biologically interpretable rules are identified that can predict protein function in the absence of identifiable sequence similarity. The possible causation of the rules gives insight into the evolutionary history of M. tuberculosis and E. coli.
This seminar was held at the Department of Computer Science, Royal Holloway, University of London on 26 June 2000.