Royal Holloway logo and departmental theme Royal Holloway, University of London

MSc by Research in Computer Science

Back to MSc by Research in Computer Science Home Page

Bioinformatics: Biological Sequence Analysis Strand

Course Overview

The course consists of three parts:

  • supervisions on sequence analysis, machine learning, and molecular biology background and the types of data available.
  • assessed coursework
  • a research project
It is expected that the supervisions will be tailored to suit the needs and interests of the student. It is envisaged that only a few students will follow this course at any time, so that all teaching will be by individual or small-group supervision.


There is a large and increasing amount of biological sequence data available. The analysis of biological sequences is a topic that involves both some biological knowledge and some advanced computational techniques. It is therefore not generally taught at undergraduate level.

The course is intended either for students with biological knowledge and an aptitude for computation, or for students with computational knowledge and an interest in molecular biology. The course aims to give the student a sufficient knowledge of current problems and methods in sequence analysis to be able to choose and deliver a good research project in the area.


At the end of the course, the student should have an understanding of

  • the nature and origin of biological sequence data
  • current techniques for modelling, searching, and annotating this data
  • machine learning techniques relevant to sequence analysis.
The student should be able to plan and carry out a significant research project on biological sequence data using the techniques learned.

Provisional Syllabus

  1. Background in Molecular Biology.

    • DNA, RNA, proteins, genetic code, transcription, translation, RNA editing. Structure of genes.
    • Gene expression. Promoters. Examples of genetic regulation and genetic cascades.
    • Structure and evolution of the genome.
    • Brief overview of experimental techniques of molecular biology.

  2. Modelling, analysis, searching, and alignment of biological sequences.

    Topics covered will include: criteria and methods for sequence alignment; hidden Markov models for sequence alignment and characterisation of sequence families; phylogenetic trees; RNA structure analysis and alignment using context-free grammars.

    This section of the course will be based on the book by Durbin et al. cited below, which is a comprehensive recent tutorial text.

  3. Techniques of machine learning.

    • Introduction to neural networks.
    • Introduction to support vector machines and other maximal margin methods.
    • Kernels for sequence comparison.

Provisional Coursework

In addition to question sheets accompanying the supervisions, students will be required to complete two substantial pieces of assessed coursework.

  1. Search of public databases for sequences related to given protein sequences. The search should cover both protein databases (for related proteins) and DNA sequence databases, for related pseudogenes. The write-up should contain an account both of the results obtained and of the computational matching techniques used in the searches.

  2. Given a set of protein sequences, to construct an alignment of the sequences, and then to use the Baum-Welch algorithm to train a HMM that characterises the alignment.

Project Areas

We would expect that most projects would be in the area of applying machine learning techniques to biological sequences. Students will make use of the Department's machine learning expertise to find methods of tackling problems of analysis of biosequences.

The following are examples of current research areas within which a capable student could conduct a research project:

Identification of Protein Fold Types
There is now consensus agreement that the three dimensional (tertiary) structures of most proteins can classified into a relatively small number of fold types. It is of interest to predict the fold type of a protein from its amino-acid sequence because the number of proteins with known sequences is far larger than the number of proteins with known three-dimensional structures. Machine learning techniques can be used to find classification rules for predicting the fold type from the sequence.

Identification of Sequence Similarities in Co-Regulated Genes
It is now possible to measure the levels of expression (the rate at which a gene is being used) of very many genes simultaneously. There are groups of genes, at different places in the genome, which tend to be expressed together, and may be regulated by the same promoter, which binds to the DNA sequence ``upstream'' of each gene. A possible project would be to search for common promoter sequences among genes with correlated patterns of expression.

Identification of Exon-Intron Boundaries in Eukaryotic Genes
The coding regions of eukaryotic genes (broadly, non-bacterial genes) are not continuous, but divided up by insertions of DNA which are excised from the messenger RNA before translation to protein takes place. Finding the exact locations of where an intron starts and ends is therefore crucial to correct identification and analysis of a gene. Machine learning techniques have been used to identify intron boundaries in real sequences, but there is still some way to go before the technique is entirely accurate.


Victor Solovyev, Chris Watkins, Hugh Shanahan

Main References

  1. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis, Cambridge University Press, 1998.
  2. P. Baldi and S. Brunak, Bioninformatics: The Machine Learning Approach, MIT Press, 1998.

Back to MSc by Research in Computer Science Home Page

Last updated Fri, 23-Jan-2009 15:11 GMT / CompSci-Webmaster
Department of Computer Science, University of London, Egham, Surrey TW20 0EX
Tel/Fax : +44 (0)1784 443421 /439786
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@
@@('' )@@