proposal

Pattern discovery for protein classification

David Gilbert, Leverhulme Research Fellow
Department of Computing, City University, London
drg@soi.city.ac.uk

Host: Professor Janet Thornton FRS, Biomolecular Structure and Modelling group, Department of Biochemistry, University College London
Funding body: The Leverhulme Trust
Dates: 1 August 2000 - 31 January 2001
Report: available here

Abstract: We aim to develop a method to automatically extract topological patterns characterising protein families and sub-families in the CATH protein structure classification, and to investigate the use of these patterns to chart evolutionary relationships amongst the proteins and as a classification method in their own right. The expected benefits from such methodology are the fast characterisation of sets of proteins leading to the evaluation of possible evolutionary/functional relationships, and speed up in the insertion of new members into CATH. The methods that we will develop will also be applicable other hierarchical classifications of protein structure, such as SCOP.

Aims: to develop a method to automatically extract topological patterns characterising protein families and sub-families in the CATH protein structure classification at UCL (Orengo et al, 1997), based on the initial approach I have developed., and to investigate the use of these patterns both to chart evolutionary relationships amongst the proteins and as a classification method in their own right.

Objectives: to

develop extended TOPS patterns which are suitable for characterising proteins, and a pattern-matching algorithm where matches are weighted with probabilistic measures,
develop a method to automatically and efficiently extract patterns and rules to characterise a given set of proteins, and specifically to learn from both positive and negative examples.
develop a language for comparing patterns, that will permit the description of the differences and similarities between patterns in terms of motifs which have been inserted or deleted.
develop a graphical representation of the patterns.
derive meaningful associations between the evolutionary relationships between families and their characteristic patterns.
evaluate our pattern discovery method by testing how accurate and how fast it is in determining the correct position in the CATH hierachy for a given protein, and hence assess its usefulness in placing new structures within CATH.

The methods that we will develop will be also applicable other hierarchical classifications of protein structure, such as SCOP.

Background: The CATH hierarchy (Orengeo et al, 1997) is compiled semi-automatically and uses several different attributes to classify proteins, including sequence and structure comparison (using SSAP) at the amino-acid level, secondary structure composition and knowledge about function and evolutionary relationships. The proposed research will attempt to chart protein evolution by relating structural patterns at different levels of similarity across families within the CATH hierarchy. The conjecture to be investigated is that a topological representation of structure will in general be sufficient to generate meaningful phylogenetic information (such as trees) about CATH families.

TOPS diagrams represent protein structure at the level of topology in terms of secondary structure elements, chirality, b -sheets connectivity and form (Flores et al 1994). I have defined a TOPS pattern (motif) language based on such diagrams, and designed a method for fast pattern-diagram matching (Gilbert et al, 1999), and developed a pattern discovery method (Gilbert et al, 2000) based on repeated matching and pattern extension (Brazma et al, 1998). This operates in a time proportional to the size of the set proteins to be characterised, unlike algorithms based on variations of the (Bron and Kerbosch, 1973) maximal clique detection algorithm which is only practical for pairwise comparisons. Initial attempts to use our technique to learn patterns characterising the CATH H-level have proved encouraging, but clearly indicate the need to design a more discriminatory pattern language and discovery method. Although TOPS descriptions are not useful for describing domains with a majority of a -helices, they do describe well domains with a high proportion of b -strands and such structures form 75% of CATH entries; we plan to generate characteristic patterns for families at the H-level, S-level and below in CATH .

Additionally, there is a need to speed up the process whereby newly determined structures are placed within the CATH hierarchy. We will investigate the usefulness of the characteristic topological patterns in giving a fast indication of where a new structure should be placed; such an approach would be in addition (and hence a front-end) to the existing methods used by the CATH group.

Methodology:

Extend TOPS patterns to incorporate stochastic information (e.g. on number and type of inserts, and secondary structure length), disjunctive and negative operators, global constraints (e.g. number of parallel/anti-parallel strands), and super-secondary structure descriptions (sheets, barrels, sandwiches). This is necessary for analysis of sets of proteins that contain "noisy" data (i.e. consist of several weakly related groups of proteins) and for characterisation of sets that contain positive and negative examples.

Develop a graphical representation of these patterns.

Adapt the pattern matching and discovery algorithms to process the extended patterns and also to obtain versions that are as efficient as possible for large sets of TOPS diagrams.

Find characteristic patterns for families at the H-level and below in CATH.

Develop a method to compare these patterns, whose output should indicate differences in terms of insertions and deletions of motifs, and attempt to relate these differences to known evolutionary relationships.

Evaluate the goodness of the pattern discovery method using (i) jack-knife techniques over the CATH hierarchy, (ii) comparison with existing methods in allocating CATH numbers to domains.

Outcome: publications in international journals, a software suite useable at UCL and with a Web interface.

Bibliography

Brazma, I. Jonassen, I Eidhammer, and D. R. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2):277-303, 1998.

C. Bron and J. Kerbosch (1973). Algorithm 457: Finding All Cliques of an Uundirected Graph. CACM, 16(9):575-577.

T.P. Flores, D.M. Moss, and J.M. Thornton. An algorithm for automatically generating protein topology cartoons. Protein Engineering, 7(1):31-37, 1994

D. R. Gilbert, D. R. Westhead, N. Nagano, and J. M. Thornton (1999a). Motif-based searching in tops protein topology databases. Bioinformatics, 15(4):317-326.

David Gilbert, David Westhead, Juris Viksna and Janet Thornton, Topology-based protein structure comparison using a pattern discovery technique, Proceedings of the AISB-00 Symposium on AI in Bioinformatics , ISBN 1 902956 12 X, pp 11-17, published by The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, 19-20 April 2000.

David Gilbert, David Westhead, Janet Thornton and Karine Yvon, Une technique déclarative pour filtrer des motifs topologiques de protéines, JFPLC2000: Neuvièmes Journées Francophones de Programmation Logique et Programmation par Contraintes, pp 165-185, ISBN 2-7462-0147-X, (in French) [HTML] [Postscript] July 2000

C. A. Orengo, A. D. Michie, D. T. Jones, M. B. Swindells, and J. M. Thornton (1997). CATH - a hirearchic classification of protein domain structures. Structure, 5(8):1093-1108.