Royal Holloway logo with departmental theme Royal Holloway, University of London

ANALYZING CLUSTER STRUCTURES IN BIOMOLECULAR DATA
Dr Boris Mirkin, Germany National Cancer Center, Heidelberg, Germany, and DIMACS, Rutgers, The State University of New Jersey, USA

Abstract: Many biomolecular data analysis problems fall within the machine learning/artificial intelligence framework, however the specifics of biomolecular data and processes make a difference. I will discuss two combinatorial clustering problems with regard to the specifics-generated issues: (1) aggregating single gene phylogenetic trees into an evolutionary tree of a species, and (2) distinctively describing a protein fold/family with regard to a larger protein class.

(1) Aggregation of phylogenetic trees: the major question so far has been that of finding a biologically motivated measure of difference between trees. Such a measure can be suggested based on the assumption that it is the mechanism of gene duplication which is responsible for the differences. There have been developed three different approaches to model gene duplication events in terms of phylogenetic trees: (a) copying duplication (R. Page 1994); (b) combinatorial tree-mapping (M. Goodman et al 1979, R. Guigo et al 1996); and (c) annotating duplication (B. Mirkin et al 1995). These approaches have been mathematically explored and proven to be, to an extent, equivalent. This makes it possible to exploit each approach in its niche: (a) in recreating a gene duplication history; (b) in tree mapping and counting distortions; and (c) in finding co-evolution gene clusters. Recently acquired information on whole genomes shows in which direction models of gene evolution should be modified to make them more consistent with the new biomolecular data.

(2) Distinctive description of a protein subgroup: for this problem we rely on the following specifics: (a) two feature space templates should be explored, one based on sequence, the other on spatial structure; (b) it is the subgroup to be described versus the whole set, not versus the rest, as usually; (c) the description must be simple and logical; and (d) the data available may be biased. A description technique has been developed based on properties of the square-error clustering criterion (B. Mirkin 1999) and involving feature transformation and resampling. In some experimental applications, e.g. sequence-based description of TIM-barrel folded proteins, the technique shows relatively good performance.

This seminar was held at the Department of Computer Science, Royal Holloway, University of London on 26 April 1999.

back


Last updated Mon, 15-Dec-2008 14:53 GMT / PS
Department of Computer Science, University of London, Egham, Surrey TW20 0EX
Tel/Fax : +44 (0)1784 443421 /439786
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@