Table of Contents
Techniques for pattern matching, pattern discovery andstructure comparisonFrom sequences to protein structure
Patterns etc
Some terminology
What is my Seq/Structure related to?
Protein family analysis
Protein comparison & motif discovery
Steps
Classification of string functions
Discrete patterns
Regular expression notation
Biosequences - general
Pattern notation and matching
PROSITE patterns
PROSITE examples
Example property
Example family (zinc finger c2h2)
RNA structural patterns
Possible patterns
Stem loops
Pseudo-knot
Learning
Pattern discovery in biosequences
Pattern discovery in biosequences
Approaches to pattern discovery
Pattern driven algorithms
Sequence driven algorithms
Sequence driven approach
Characteristic string function for family F+
Clean / Noisy Data
Classification & conservation problems
Classification problem C1
Characterisation: conservation problem C2
Training and test sets
Training and test sets
True positives, true negatives, false positives, false negatives
Some measures
Methodology
Defining string functions via patterns
Why pair-wise comparison? - Some evolutionary relationships revealed by comparing ?-haemoglobins
Other evolutionary issues
Edit distance
Edit distance - simplest
Edit distance - a little more sophisticated
Variations on dynamic programming
Dynamic programming table
Calculating a cell value
Dynamic programming table - record history of moves
Recovering the transcript
Dynamic programming table - minimal traceback moves
One possible transcript
The possibilities?
Multiple alignments
Multiple aligment - methods
Multiple sequence alignment (globins)
sequence alignments & phylogenetic trees
What can we do with multiple alignments?
PSI-BLAST (position specific iterated BLAST)
Protein structure?
PPT Slide
TOPS
Structure modelling with TOPS
Protein structure
PPT Slide
History
Topological description
Example - a plait
PPT Slide
PDB: Protein Data Bank
Generating TOPS descritpions
Formal definitions
What is a pattern?
Plait motif
Plait formal definition
Pattern matching
PPT Slide
Matching algorithm(sub-graph isomorphism over vertex-ordered graphs)
PPT Slide
PPT Slide
Topological constraints
Beta-sheet connectivities
“Greek key” motif
“Jelly roll” motifs (anti-parallel ?-sandwich)
Protein comparison & motif discovery
Topological search with plait motif
Pattern searches on TOPS databases
Approaches to pattern discovery
Topological pattern discovery (pattern extension and repeated matching)
Discovering common patterns and making multiple alignments
PPT Slide
Edge product graph / maximal cliques
Comparison based approach
Structure comparison
Comparison:alignment using discovered patterns
Rating patterns
Compression
Compression
Specialised case when n=2Using a pattern to compare 2 structures
Topological structure comparison: 2bop
Search using 2bop
Comparing structures - NADP binding domains
Dendrogram from pairwise comparisons & hierarchical clustering
NAD comparisons
Coverage vs Error
Coverage versus error: PDB40
Back to patterns - use in classification: the CATH hierarchy
CATH database hierarchy
Pattern discovery - problems
Outliers!
Restricting the learning set
Extending TOPS patterns to unions
Grouping data by discovered patterns
Grouping by pattern discovery
Group 1
Group 2
Dendrogram of pairwise comparisons annotated by discovered groups
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Grouping data by discovered patterns
Learning patterns for CATH homologous superfamilies
Case studyGHKL an emergent ATPase/kinase superfamily
Common GHKL motif
21 domains, 4 groups, pruneval=10
GHKL binding motif
Summary
Protein comparison & motif discovery
Current TOPS project, joint with Leeds (Biochem)
TOPS comparison serverhttp://tops.ebi.ac.uk/tops/compare.html
Acknowledgements
Resources / contacts
|