David Gilbert
drg@cs.city.ac.uk
Michael Schroeder
msch@cs.city.ac.uk
Department of Computing, City University, London, UK
and
Jacques van Helden
jvanheld@ebi.ac.uk
EBI, Cambridge, UK
Biologists often wish to analyse these databases in order to understand the relationship between the items that they contain. The traditional approach is to discover the evolutionary relationship between genes or proteins by analysing their sequence and structure similarities. More recently, similar methods have been applied to discover functional relationships between genes by comparing their transcriptional response to environmental modification. The usual approach to this is: Given a set of n objects and some pairwise comparison, which outputs a similarity or difference measure, (1) perform an all against all pairwise comparison, (2) cluster the items (possibly hierarchically), and (3) display the clusters in some visually meaningful way, normally as rooted (dendrograms) or unrooted trees.
Tree representations were first used in the biological field to represent systematic classifications and their evolutionary interpretation. This has been motivated by the similarity of molecular mechanisms which has let to the widely held view that all organisms diverged from a common ancestor. Specifically, the relationship between any set of species is termed a phylogeny; this relationship can be represented by an evolutionary, or phylogenetic tree. The task of phylogenetics is to infer this tree from observations of living organisms [Durbin et al.1998].
In their simplest form such trees are binary and can be used to approximate general n-ary trees. The leaves of the tree are labelled by the observed data, or taxa (species, biomolecular sequences, protein structures etc) and internal nodes can be labelled by hypothesised or known ancestors.
Each edge of the tree has a certain amount of evolutionary divergence associated with it, defined by the distance between the data items. Thus the distance between nodes (clusters) is biologically meaningful. Although a true phylogenetic tree has a root, i.e. common ancestor for all the species represented at the leaves, some algorithms give no indication about its position and thus return unrooted trees.
Trees can be constructed from pairwise distances by a variety of methods, including UPGMA (unweighted pair group method using arithmetic averages) [Sokal & Michener1958] and parsimony e.g [Fitch1971]. Since one, or in the case of parsimony several, optimal trees can be generated by tree building algorithms, an approach such as the bootstrap method [Feldenstein1985] is commonly used to assess the significance of some phylogenetic feature and thus give some measure of confidence for the tree.
In general each cluster, or node in a tree, can be associated with a classifier or conservation function (pattern) [Brazma et al.1998], a representative member of the cluster, or other information, for example data sources etc.
For example, hierarchically structured databases of protein structures (SCOP [Murzin et al.1995], CATH [Orengo et al.1997], annotate the nodes in the tree at certain levels with representative structures, as well as other (functional or structural) descriptions.
There two main problems associated with trees as a means of visualising relationships. First, there are isomorphisms if the comparison relation is symmetric, and second, the trees do not scale up for large amounts of data.
Given these shortcomings of the classical approach, we have set out to develop a different visualisation technique avoiding the above problems.
In this section we develop a design methodology for distance metrics. A distance metric is defined as follows:
if it is non-negative, i.e. dij >= 0 if i != j and dii = 0
if it is symmetric, i.e. dij = dji, and
if it satisfies the triangle inequality, i.e. dij =< dik + dkj
Often it is desirable to transform a distance table given the distance property is not violated. For example, one may simply want to re-scale a distance table by a factor of 1000. Or one may want to scale-up clusters and but still leave the bigger picture untouched. To this end, one can simply apply a logarithmic function, which increases small distances relatively more than it does large ones.
The following three general operations do not affect the property of being a distance table: Addition of two distance tables, multiplying with a scalar, and applying a monotonic and concave function. Intuitively, a function f is concave iff the image of the function is below its tangents; formally, this means that f''=0. Two examples of monotonic and concave functions are square root and logarithm. As outlined above, the latter is extremely useful to scale-up clusters.
To apply any of the above operations to a distance table, we need methodologies for converting raw data into a meaningful distance table. Although many approaches will do this ad hoc and possibly violating some of the distance properties, there are various functions which will lead to a distance table. A commonly used similarity measure, which forms a distance is the cardinality of symmetric set difference. We use this approach for example to compare hydrogen bonds and chiralities in the TOPS systems [Gilbert et al.1999]. Furthermore, the edit distance of two strings [Levenshtein1965] - as the name suggests - is a distance. However, an exception are asymmetrically defined derived edit distance measures, which penalise mismatches with different weights. A disadvantage of the basic edit distance is that it is not normalised. A relatively close match of two very long strings may be, for example, much higher than a complete mismatch of two small strings. As it turns out, we can normalise edit distance by dividing by the maximal length of the two strings. Interestingly, other candidate normalisations with the minimal length or the sum of the two lengths does not work.
Since we require a mathematical distance, we can guarantee that such points exist. Unfortunately, their dimension may be higher than three, so that we additionally aim to find a three-dimensional solution with least error. Further requirements are a rating of the layout, i.e. how reliable is it, and an anytime-behaviour, which allows us to improve layout with more computational resources available.
The full algorithm works as follows:
1. Compute A as defined above. If A is not positive semidefinite then exit.
2. Compute A=USUT by singular value decomposition with S having the eigenvalues li in decreasing order on its diagonal.
3. X = ( (l1)½u1T, (l2)½u2T (l3)½u3T)T where U = (u1,...,un)
4a.
If
l1 > ... > l
4b. Else X is the solution, i.e. ||xi,xj||2 = dij.
The advantage of this approach is that it is a flexible anytime-algorithm which comes up with an approximation of a solution with any time limitations. However, it is difficult to theoretically evaluate the quality of solutions and to optimally adjust the springs automatically. They are application dependent and thus difficult to find.
Besides VRML's excellent visualisation properties and our intrinsic layout approach, the visualisation benefits even more from extrinsic features. In the case of gene expression data for example, it desirable to interactively colour interesting functional families in the same colour. This feature allows the user to check whether the distance measure chosen reflects the functional properties of the objects.
To maximize the benefits to the user, it is desirable to combine existing visualisation approaches such as dendrograms with novel ideas. In a case study, we created a web-enabled user-interface including dendrograms, our novel 3D visualisation, and text window for detailed information on specific objects. Figure 1 shows a screenshot of our case study, a topology-based structural comparison of 98 proteins (NAD-binding domains, CATH superfamily 3.40.50.300). An on-line version is available here.
The above technique can also be applied to multi-attribute data and the next section discusses some preliminary results.
Classifications of organisms were originally based on morphological and anatomical criteria, but with the advent of molecular biology, the concept expanded to include comparison of protein sequences and structures. With the emergence of functional genomics [Hieter & Boguski1997], a big challenge became to compare genes on the basis of their functional similarities rather than structural features. The development of large-scale gene expression measurement methods [deRisi, Iyer, & Brown1997] has resulted in the need to analyse and visualise multi-dimensional data (i.e. each item in the database is associated with more than one attribute), and a lot of effort is currently being put into the development of dedicated clustering and visualization tools (e.g. [Eisen et al.1998]).
Gene expression data have been collected from different experiments, where each experiment can itself be a time series with several scalar values. In the case of yeast, 6200 genes are considered, and for each of them, the number of expression measurement under different conditions is rapidly growing. Currently, about 60 values per genes are publicly available, and probably many more will soon be published. For higher organisms, the order of magnitude is of approximately 100,000 genes. Eisen's clustering algorithm [Eisen et al.1998] permits the weighting of the different experimental values through a linear transform of the expression measurement vector. The gene expression profiles can thus generate alternative trees, depending on the precise experiments one wants to emphasize. This is an indirect way of mimicking non-hierarchical clustering, which would possibly be more appropriate to analyze such multidimensional data.
We are in the process of adapting our experimental prototype in order to treat gene expression, and other multi-attribute data. The adaptations include the ability to group and weight different attributes using linear transforms, and the use of colour in the VRML display in order to highlight those parts of the 3D cluster whose grouping has been caused by different attributes. Initial results on clustering gene expression data are promising. This ongoing work has already provided us with some insights into the ways in which different visual attributes can be exploited in order to display this highly complex data.
Interactive display is clearly a useful method to aid the comprehension and analysis of large amounts of complex data; developing such approaches is a significant challenge for computer scientists (and cognitive psychologists) working in bioinformatics. However the bottom line is that biologists must find such methods useful.
In this paper, we aimed to provide an alternative visualisation technique to dendrograms. We have developed a design methodology for distance metrics and two layout algorithms. In contrast to dendograms our approach is scalable. As an essential pratical requirement, we have also outlined how to compute confidence values for the layout. Finally, we have described our experimental prototype, which caters for interactive exploration and we have outlined some on-going work on multi-variate data visualisation.