Techniques for pattern matching, pattern discovery and structure comparison From sequences to protein structure

24/11/01


Click here to start


Table of Contents

Techniques for pattern matching, pattern discovery and structure comparison From sequences to protein structure

Patterns etc

Some terminology

What is my Seq/Structure related to?

Protein family analysis

Protein comparison & motif discovery

Steps

Classification of string functions

Discrete patterns

Regular expression notation

Biosequences - general

Pattern notation and matching

PROSITE patterns

PROSITE examples

Example property

Example family (zinc finger c2h2)

RNA structural patterns

Possible patterns

Stem loops

Pseudo-knot

Learning

Pattern discovery in biosequences

Pattern discovery in biosequences

Approaches to pattern discovery

Pattern driven algorithms

Sequence driven algorithms

Sequence driven approach

Characteristic string function for family F+

Clean / Noisy Data

Classification & conservation problems

Classification problem C1

Characterisation: conservation problem C2

Training and test sets

Training and test sets

True positives, true negatives, false positives, false negatives

Some measures

Methodology

Defining string functions via patterns

Why pair-wise comparison? - Some evolutionary relationships revealed by comparing ?-haemoglobins

Other evolutionary issues

Edit distance

Edit distance - simplest

Edit distance - a little more sophisticated

Variations on dynamic programming

Dynamic programming table

Calculating a cell value

Dynamic programming table - record history of moves

Recovering the transcript

Dynamic programming table - minimal traceback moves

One possible transcript

The possibilities?

Multiple alignments

Multiple aligment - methods

Multiple sequence alignment (globins)

sequence alignments & phylogenetic trees

What can we do with multiple alignments?

PSI-BLAST (position specific iterated BLAST)

Protein structure?

PPT Slide

TOPS

Structure modelling with TOPS

Protein structure

PPT Slide

History

Topological description

Example - a plait

PPT Slide

PDB: Protein Data Bank

Generating TOPS descritpions

Formal definitions

What is a pattern?

Plait motif

Plait formal definition

Pattern matching

PPT Slide

Matching algorithm (sub-graph isomorphism over vertex-ordered graphs)

PPT Slide

PPT Slide

Topological constraints

Beta-sheet connectivities

“Greek key” motif

“Jelly roll” motifs (anti-parallel ?-sandwich)

Protein comparison & motif discovery

Topological search with plait motif

Pattern searches on TOPS databases

Approaches to pattern discovery

Topological pattern discovery (pattern extension and repeated matching)

Discovering common patterns and making multiple alignments

PPT Slide

Edge product graph / maximal cliques

Comparison based approach

Structure comparison

Comparison: alignment using discovered patterns

Rating patterns

Compression

Compression

Specialised case when n=2 Using a pattern to compare 2 structures

Topological structure comparison: 2bop

Search using 2bop

Comparing structures - NADP binding domains

Dendrogram from pairwise comparisons & hierarchical clustering

NAD comparisons

Coverage vs Error

Coverage versus error: PDB40

Back to patterns - use in classification: the CATH hierarchy

CATH database hierarchy

Pattern discovery - problems

Outliers!

Restricting the learning set

Extending TOPS patterns to unions

Grouping data by discovered patterns

Grouping by pattern discovery

Group 1

Group 2

Dendrogram of pairwise comparisons annotated by discovered groups

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Grouping data by discovered patterns

Learning patterns for CATH homologous superfamilies

Case study GHKL an emergent ATPase/kinase superfamily

Common GHKL motif

21 domains, 4 groups, pruneval=10

GHKL binding motif

Summary

Protein comparison & motif discovery

Current TOPS project, joint with Leeds (Biochem)

TOPS comparison server http://tops.ebi.ac.uk/tops/compare.html

Acknowledgements

Resources / contacts

Author: David Gilbert

Email: drg@soi.city.ac.uk

Home Page: www.soi.city.ac.uk/~drg