|
Home | Research | Publications |
| Emma Steele Research |
|
Problem overview: Bioinformatics and gene regulatory networks Bioinformatics is a rapidly expanding research area focusing on the analysis of biological data using computers. It is often described as an intersection of research in computer science and biology. Many types of biological experiments have now been fully or partially automated, leading to more and more large and complex sets of experimental data. The emphasis of bioinformatics is on developing robust and efficient data analysis techniques and algorithms to extract more knowledge from such datasets. One important type of biological data is gene expression levels. Simply put, we can say that a gene is ‘expressed’ when it is active during cellular processes. Another way to view gene expression is as the process whereby information within an organism’s DNA genes is translated into its traits or characteristics [1]. The process of gene expression is responsible for differences in cell behaviours and architecture. Analysing gene expression can provide insight into the causes of particular medical conditions where diseased cells behave differently to healthy cells (e.g. cancer). Gene expression levels can now be measured simultaneously for a whole genome using technology commonly known as DNA microarrays. Each microarray is a slide which can contain thousands of gene samples (see Figure 1). Prior to automated microarray technology, measuring the expression level of just one or two genes was time-consuming and expensive. Now, expression levels for a whole genome (i.e. all genes within an organism’s DNA) can be obtained from just one experiment.
Figure 1: A DNA microarray slide (left) is scanned. Each spot represents a gene, its expression level represented by the colour intensity of the spot. The expression profile (right) of a gene consists of successive expression levels taken periodically during an experiment. So now scientists have access to expression levels for thousands of genes, measured at many points during various cellular processes in many different experiments. And it is not just gene expression data. In addition to this, there are many other data sources – for example: observed interactions between genes under different conditions and functions of genes have been discovered and organised into an ontological structure. Some of these data sources are shown below in Figure 2. Data from different sources is often represented in different ways and because it is derived from (at least partially) automated experiments it is usually very noisy.
Figure 2: Integrating different data types. Gene expression data can consist of time-series of expression measurements for a set of genes through an experiment. Transcription factor binding site (TFBS) data provides a p-value (confidence value) of whether a pair of genes interact based on physical experimental evidence. The gene ontology assigns function(s) to genes based on current biological knowledge. Interactions between genes can be extracted from scientific papers using text mining techniques. All this data could be used in reconstructing gene networks A key goal in bioinformatics is the reconstruction of gene interaction networks. Genes can interact by activating (turning on) and repressing (turning off) the expression of one another. This is known as regulation of gene expression or gene regulation, so these types of networks are also referred to as gene regulatory networks. The majority of algorithms for re-engineering gene networks are based on expression data. Clustering techniques have been popular for extracting groups of co-expressed genes (i.e. gene with similar expression profiles). However, clustering does not reveal the structure of the gene regulation process - that is, how genes interact and their inter-dependencies. My PhD research concentrates on using Bayesian Networks (BNs) [2] as a basis for modelling gene networks. BNs have become a popular method for modelling gene networks from gene expression data [3-5]. BNs consist of two components. The first is a graphical representation of the network consisting of nodes representing variables (in this case, gene expression values) and arrows indicating dependencies between variables. Figure 3 shows an example of a simple gene network represented using such a network. The second component is a conditional probability distribution quantifying the dependency of a node upon its parent (influencing) nodes. Gene expression is considered to be a stochastic mechanism [6] so it is well-suited to probabilistic modelling.
Figure 3: Bayesian network representing a gene interaction network. Each node represents the expression value of a gene (identified as A, B and C). Arrows (arcs) indicate dependencies between nodes. In this network the expression value of Gene C depends on that of genes A and B. The conditional probability distribution associated with Gene C (not shown) will quantify the extent of dependencies. Research progress During the first year of my PhD I have focused on researching gene regulation network discovery using gene expression data. This has entailed a thorough literature review and practical investigation of BN learning from gene expression data, which concerned learning regulatory gene networks for muscular dystrophy. This research involved using a gene expression dataset containing data from different types of muscular dystrophy as well as healthy cells. Discovery of regulatory structures based on classes (e.g. different disease strains) has not been considered before. It is important because it can help biologists identify the differences between gene regulation in healthy and diseased cells. An additional key finding is that the quality of gene expression data (in terms of dataset size or noise) can often be too low in order to learn reliable network structures. This is something that is often ignored in other research. My PhD project is focused on improving gene network discovery algorithms by integrating additional data sources to improve reliability. There is a wealth of data available about genes, generated from different types of experiments and analysis. Other data sources (in addition to gene expression data) include observed gene interactions based on physical experimental evidence (transcription factor DNA binding sites), the functional gene ontology, and co-occurrence of gene names in the literature (from text mining methods). However, different data sources are often represented using non-uniform methods and are difficult to integrate, as shown in Figure 2. Recently, there has been interest in incorporating different data sources into regulatory network algorithms, in order to make more robust and reliable conclusions from experimental data. Most notably, Bar-Joseph et al [7] combine binding site data and expression data into a clustering framework to learn regulatory networks, whilst Bernard and Hartemink [8] use binding site data to influence prior probability distributions prior to BN learning from expression data. However, the solutions presented in the literature are ad-hoc and specific to the data types used, and no clear framework has emerged. My current research focuses generating gene networks from multiple datasets. This can be defined as two main tasks:
Combining multiple gene expression datasets. Utilising multiple expression datasets has the potential to produce more robust regulatory network models with greater confidence, that place less reliance on a single (possibly biased) dataset. However, combining datasets directly through methods such as normalisation remains difficult as experiments are often conducted on different microarray platforms, and in different laboratories leading to inherent biases in the data. I have developed two novel approaches for learning regulatory network structures from multiple microarray datasets, each based on aggregating high-level features of models generated from multiple microarray datasets. Thus, they do not rely on special normalisation treatments of the data. Meta-analysis Bayesian networks are based on combining statistical confidences attached to network edges whilst Consensus Bayesian networks identify consistent network features across all datasets. The application of both approaches to multiple datasets from synthetic and real (E. coli and yeast) networks has demonstrated that both methods can improve on networks learnt from a single dataset or an aggregated dataset formed using a standard scale normalisation. We have found that the consensus approach is particularly good for filtering out noisy datasets; it can be used when little is known about the datasets to establish persistent regulatory relationships. On the other hand, meta-analysis establishes aggregate statistical confidences attached to regulatory interactions and can be effective with a small number of datasets. A journal publication of this work is currently under review. Combining heterogeneous types of data. At present, my research focuses on combining information from the scientific literature with gene expression dataset. In collaboration with researchers from the Biosemantics group at Erasmus University, Rotterdam, who can generate association or correlation matrices for a set of genes based on a set of abstracts from the scientific literature, the aim is to incorporate this data into the learning Bayesian networks. My current research looks at use of these association matrices as prior influences for the model structure. Furthermore, I work closely with biologists from the Human Genetics Centre at the University of Leiden (Netherlands) who are interested in gene interactions in muscular dystrophy cells, and researchers in Biosemantics at the University of Rotterdam (Netherlands) who are text-mining experts in Biology. The collaboration is beneficial to both sides. The biologists gain insight into their data using new approaches. We are able to develop and test new algorithmic approaches on their data, and the biologists are able to assist in evaluating our results. Working together drives the research forward more effectively.
References
|
|
Home | Research | Publications |