David Gilbert
Leverhulme Research Fellow Report
"Pattern discovery for protein classification"
August 2000 - January 2001
Motivation
The aim of this project was to develop a method to classify proteins according to structural patterns; understanding how different proteins are related should help scientists assign function to protein structures, and eventually benefit the process of developing new drugs. This research falls within the area of bioinformatics, where techniques from computer science are used to solve problems in biology.
Genetic material, normally DNA, contains sequences of nucleotides called "genes" which code for proteins via a process of transcription to RNA and translation to chains of amino-acids which fold up to form proteins. One of the aims of research in bioinformatics is to determine the function of genes, or more precisely the function of the proteins that they encode. Theraputic drugs are often small molecules, called ligands, which interact with proteins, and work by binding to proteins in specific ways. The development of drugs can be a very long and expensive process; bioinformatics can speed up this process by giving an insight into how proteins function and how they interact with ligands and thus narrow the range of potential drug targets.
Protein sequences can be more or less automatically computed from genes (DNA sequences). If the sequences of two genes are quite similar then it is likely that both the proteins derived from both genes have a similar function. Thus we can guess the function of a newly discovered gene if it has sequence similarity to a gene with known function. However, two proteins can have similar structures and functions even though their sequences are very different. Thus another way to determine the function of a gene is to determine the structure of the protein for which it codes, and then compare that with structures of known function. Unfortunately structure determination is a lengthy experimental process, and it is less straightforward to compare structures than it is to compare sequences.
Protein structure descriptions are stored in the Protein Data Bank, and form the basis of several classifications, including CATH at University College London. Proteins are grouped in CATH according to structural information using structural, functional and evolutionary relationships. The aim of my research was to develop a method to automatically extract high-level topological structural patterns characterising groups of proteins (for example families and sub-families in CATH), and to investigate the use of these patterns to chart evolutionary relationships amongst the proteins and as a classification method in their own right. The idea was that use of such high-level descriptions would permit the rapid characterisation of proteins, and could be used to speed the classisification of newly determined structures. These descriptions are called TOPS diagrams (for individual proteins) and patterns (for groups of proteins).
Situation at start of fellowship
I had already developed some basic components of the system by the start of the fellowship. Briefly, these were a method to match TOPS patterns to diagrams, and to automatically discover patterns for 100% of a set of proteins. I also had some data sets of TOPS protein descriptions. What needed to be developed were a method to graphically display patterns and where they matched proteins, and a method to discover patterns which better described sets of proteins with large structural diversity. The motivations were firstly that structural biologists were unable to interpret the raw, non graphical patterns that I had hitherto produced. Secondly my existing method performed very badly on diverse structures, producing patterns with little information content. What was needed to overcome this was to develop a method to automatically divide a group of proteins into sub-groups, find a characteristic pattern for each group and evaluate the goodness of the division and the patterns. Finally, I needed to gather suitable data sets on which to test my programs, and to develop an interface to my system so that it could be used over the Internet.
Work done during the fellowship
Graphical representation of TOPS diagrams and patterns
I developed a method to generate graphical representations of TOPS diagrams using layout information pre-computed by another program (written by David Westhead). I identified GIFs as being the type of graphical files best suited for rapid display on standard Internet browsers, located software to generate these files, and developed a program to generate graphical descriptions in a suitable format. I then developed a method to generate graphical representations of TOPS patterns (there was no precomputed information for these) and how they matched TOPS. Finally I designed web pages to display the patterns and diagrams, and wrote a program to automatically generate these pages.
Improved pattern discovery algorithm
First of all I improved the basic patterns that could be automatically discovered using my existing method. Essentially, I expanded the patterns so that they contained more information, replacing ‘wild cards’ by more specific information. I then evaluated these improved basic patterns on test data sets.
Secondly I developed a clustering methodology to automatically divide a group of proteins into sub-groups, the division being based on structural information, and to discover a pattern for each group. The novelty of this method is that it is not based on hierarchical clustering but directly on pattern discovery. An approach based on hierarchical clustering would have required performing pairwise structural comparison over the group (comparing each member with every other member), using the values generated to do pairwise clustering, identifying significant clusters and then discovering a pattern for each cluster. One disadvantage of such an approach is its computational complexity which is in the order of the square of the number of items in the group due to the pairwise comparison.
My clustering method is based on discovering a pattern for a subset of the initial group, and determining the membership of that subset by the usefulness of the pattern. The process is then repeatedly applied to the remainder of the initial group, partitioning it into further subsets each associated with a characteristic pattern. This meant that I had to extend the pattern language to unions (disjunctions) of simple patterns, since a group of proteins can now be associated with a pattern union. The advantages of this approach are that the complexity is lower than that of pairwise comparison (in the best case being linear in the number of members of the group) and that a pattern is automatically associated with each subset. The technical issues that I had to solve were how to rate the usefulness of a pattern, and when to stop the division into subsets. The usefulness of a pattern is determined by a combination of how well it describes a set of proteins (measured by the ‘compression’ obtained by using the pattern) and how many of the initial set of proteins it matches (measured by the ‘coverage’). The issue of ‘when to stop’ is harder to resolve, and I base this at the moment on experience gained from practical examples.
Compiling and analysing example sets
I put a lot of effort into compiling sets of example protein domains. The requirements were for the domains to be related structurally, have some structural diversity, and for which some functional information is known. The aim was to relate the partitions of the sets resulting from my method to partitions made using other structural or functional based methods. For each group I analysed the sequence identity, structural similarity at the atomic coordinate and topological levels, functional information using enzyme (EC) numbers and reaction descriptions. I then used my method to divide each group into clusters with associated patterns. The results indicated that my method worked well when the target group of proteins was structurally diverse and distinct functions could be associated with each group. The correlation was best with structural classes identified by CATH, and weakest with EC numbers; this was to be expected since in general EC numbers do not correlate well with structure.
I had planned to use some of the example data sets to investigate protein evolution. The idea was to derive evolutionary rules using proteins whose evolutionary relationship has been previously established. Professor Golstein at UCL suggested to determine the evolutionary relationship amongst a set of proteins by constructing a phylogenetic tree for them based on sequence, and then to derive rules for their structural relationships. This did not prove to be feasible since structures are much more conserved during evolution than their amino-acid sequences. Reliable phylogenetic trees can only be done for proteins whose sequence similarity is high (over 70%); the structures for such groups exhibit almost no topological diversity and thus there is not enough information on which to derive topological evolutionary rules. I am now looking at other approaches to derive evolutionary rules for protein topology.
Application of the technique to the CATH classification
I attempted to relate the results generated from my technique to the CATH classification system. First of all I generated patterns for CATH families and evaluated how good these patterns were in terms of how often a protein was incorrectly included in (specificity) or excluded (sensitivity) from a family (specificity) when using the patterns as discrimators. In general most patterns were good. I am still looking in more detail at those that gave poor results; this could arise from problems with my method or from the fact that the CATH classification is not well defined for certain protein families. In fact the CATH researchers asked me to suggest groupings for certain families using my method. I now have a library of TOPS patterns for CATH families, and intend to use them to help with the task of classifying new protein structures.
Secondly I created new groupings for proteins; I am now analysing these, and relating them to the CATH classification.
Evaluating the structure comparison method
I used a data set provided by Andrew Harrison of the CATH group to evaluate my structure comparison method, and to compare the results with those obtained by his method. His method uses a less abstract representation of protein structure, and gives overall better results.
Education in the biological background
As a computer scientist, I lack the appropriate training in the life sciences which is required to carry out effective research in bioinformatics. Accordingly I attended two undergraduate courses at UCL during my fellowship: Protein Structure and Enzymology, and Bioinformatics. I also attended selected lectures on molecular biology, and research seminars run in the Department of Biochemistry. These activities helped me to deepen my knowledge of biochemistry and specifically protein structure and function.
Important research conclusions
Overall, the most important research conclusions derived from my research are
Obstacles, intellectual/practical & how I overcame them
The main intellectual obstacle proved to be derive a meaningful association between evolutionary relationships and characteristic topological patterns. I am making a funding proposal for a much larger project to investigate this question.
During the project I realised that I would have benefited from access to a computing cluster which would have speeded up data testing. However such a powerful computing facilities was not available. In the event, testing just took longer!
Institution support
My home institution, City University, was very supportive; as well as receiving my full salary, I was not required to do any teaching or administration during my fellowship. The host, Professor Thornton at the Department of Biochemistry UCL was extremely generous in providing me with an office shared with just one other person, computing facilities, photocopier and library access.
Output
I am now writing several scientific research papers as a result of this project; as to be expected with project of 6 months, much of the writing up has to be done afterwards.
The results of this fellowship (knowledge, experience and methodology) will directly feed into the project "Patterns, functions and structures on a protein topology database", funded by the BBSRC and EPSRC, to be carried out at City University and joint with the School of Biochemistry, Leeds University.
Some of the ideas generated and knowledge gained during my fellowship have been incorporated in a course on Bioinformatics that I have given since my return to City University.
Activities directly related to the fellowship