City University, UK - University of Bergen, Norway
January 1996
Department of Computer Science, City University, London UK
David Gilbert
Department of Informatics, University of Bergen, Bergen, Norway
Inge Jonassen and Ingvar Eidhammer
The aim of this proposal is to design a constraint-based language for the description of patterns in genetic databases, algorithms for recognising and learning such patterns, to implement these algorithms as in a constraint-based programming language, and to test the system created on a substantial set of genetic databases.
During the last decade molecular biologists have focussed more and more attention to finding patterns in biosequences. There are several reasons for this interest. For instance, if we can find a common pattern present in DNA sequences believed to be related to gene regulation, then finding the same pattern elsewhere in DNA suggests that the respective part of the DNA may also plays role as a regulatory region [Sta82]. Finding common patterns in protein sequences helps in predicting their three dimensional structure [ea86].
One of the many problems in research related to patterns in biosequences is finding an appropriate language for their description in the specific applications, or in other words, finding the best hypothesis space. Biologists have introduced quite a large number of languages each of which differs from the others in more or less important ways, and employ a variety of techniques to discover these patterns in databases [BJEG95]. Up to now, computer scientists have paid relatively little attention to these ``bio-pattern" languages.
The intention of the research proposed in this application is to apply the latest techniques in computer science to the design and implementation of a bio-pattern language system. We will exploit constraint logic programming, which is a significant and active area of research in computer science. constraint logic programming is a development of logic programming, and constraint programming techniques can be more declarative and elegant (hence maintainable) than standard imperative languages, without sacrificing efficiency. Constraint programming has been applied to constraint satisfaction problems, constraint satisfaction in AI, concurrent programming, dynamic constraint satisfaction, deductive databases and object orientation.
We intend to develop a formal constraint-based pattern language based on the ideas presented by Brazma and Gilbert [BG95], itself a development of Staden's pattern language [Sta90]. Their language can be considered to be a motif-based pattern language enhanced with constraints over the distance between the motifs. We envisage that the definition of the language will be extended to constraints over strings, along the lines suggested by Walinsky [Wal89]. We further intend to cast the original language of Brazma and Gilbert within the framework of constraint logic programming by the introduction of logical constraints in its definition. We also plan to construct a semantics for the language to include reasoning over the consistency of language expressions. Finally we intend to design several algorithms for recognising and learning the patterns, and to implement these algorithms as in a constraint-based programming language, and to test the system created on a substantial set of genetic databases.
Our language will be designed with a view to facilitate its implementation in ECLiPSe, and we intend to take advantage of the powerful debugging facilities of this system. Of the several constraint programming systems available, ECLiPSe (ECRC Logic Programming System) combines the functionalities of several ECRC systems, including Sepia, MegaLog and CHIP, and is widely used in academia and commercially. Both academic sites have access to the ECLiPSe system.