Bioinformatics - Lab 5
Sequence pattern discovery
David Gilbert
The aims of this lab are to
enable you to gain experience with sequence
pattern discovery and pattern searching.
You will need to refer to the lecture on Multiple alignments,
patterns
and profiles.
Databases and search tools:
- PROSITE is a database of protein families and domains. It consists
of biologically significant sites, patterns and profiles that help to reliably
identify to which known protein family (if any) a new sequence belongs.
-
The
ScanProsite tool allows users to scan protein sequence(s) (either from
UniProt Knowledgebase (Swiss-Prot/TrEMBL) or PDB or provided by the user) for
the occurrence of patterns, profiles and rules (motifs) stored in the PROSITE
database, or to search protein database(s) for hits by specific motif(s)
Note: many of the programs used in this tutorial can be found at
http://ca.expasy.org/tools/#pattern.
Some pattern discovery programs (taken from
http://bioweb.pasteur.fr/seqanal/motif/intro-uk.html) are:
-
PRATT at EBI,
http://ca.expasy.org/tools/pratt/,
or
PrattWWW:
a tool for sequence motif discovery (I. Jonassen). Pratt is also installed locally in
/users/students4/software/public/Bio4/bin/pratt. A quick user guide is
here.
Note that pratt works by repeated pattern extension, generates regular
expression patterns in a Prosite syntax, and can find
several patterns which are descriptive of the input sequences. Moreover it
can also refine (improve) the pattern set which it finds.
- MEME
(Multiple EM for Motif Elicitation): motifs discovery (highly
conserved regions) in groups of related DNA or protein
sequences
(T. Bailey, C. Elkan, B. Grundy)
- PFTOOLS:
PROFILE tools (P. Bucher)
- SMILE:
Structured Motif Inference and Evaluation (L. Marsan, J. Allali)
- prophet web
server:
Gapped alignment for profiles
EMBOSS -- prophet -- can be run locally.
- consensus:
Identification of consensus patterns in unaligned DNA and protein
sequences: a large-deviation statistical basis for penalizing gaps
(G.Z. Hertz and G.D. Stormo)
(advanced form)
Cytochrome P450 is a family of the body's more powerful detox enzymes. Over 60
key forms are known, with hundreds of genetic variations possible, producing a
wide variety of susceptibility to specific toxins.
More on P450's is here.
A well-known biochemist suspects (by intuition and experience)
that P450 protein sequences are characterised by the
following sequence motif:
FMFEGHDTTA
He has found this motif in a the sequences of a
set of Cytochrome P450's that he is studying. These sequences are
here.
The biochemist has asked, however, if there are other motifs which
characterise P450 sequences, and also are discriminatory against other protein
families.
In this realistic exercise, we are going to:
- Confirm whether the
FMFEGHDTTA
motif does indeed identify P450's, by using the motif to scan through the
Prosite motifs database.
- Use any hits that we find to make some other patterns characteristic of
the retrieved set. These patterns will be regular expressions, generated
using pratt.
- Evaluate these generated patterns against the set of retrieved P450
sequences, and also against a set of sequences which are not P450's in order
to ascertain the descriptive and discriminatory powers of the generated
patterns.
-
Check that
following sequence motif:
FMFEGHDTTA
does indeed match into the
the sequences of the
set of Cytochrome P450's that the biochemist is studying. These sequences are
here.
You can use the Emboss
patmatdb
program to
search with this pattern into the set above. Or you can just use a Unix
search tool (e.g. grep).
-
Use the
ScanProsite
-- select the blue
Search Swiss-Prot with the PROSITE pattern(s)/profile(s)
part of the form
to search for
sequences matches to this motif in SwissProt.
-
Retrieve these hits in fasta format (you will have to go to the page for each
sequence hit and click on the 'fasta' button), and save them all in one file -- you will need
these several times during this exercise.
-
Check that these sequences are specific to the search motif, by pasting them
into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then
scanning the Prosite database with them.
Now we are now going to automatically generate patterns that characterise
the set of P450 protein sequences that we have retrieved, and then test the
goodness of these patterns:
-
Use PRATT as an automated method
to
construct regular expressions characterising these sequences.
An example of what PRATT does can be found
here.
-
Search back into SwissProt with ScanProsite
using (some of) these patterns and see what
hits are obtained.
- Do these searches return all of the original sequences?
- What other sequences (if any) are identified by these patterns? Are these
also P450 sequences?
- You can also try IBM's
TEIRESIAS
program
to generate patterns.
-
Take the set of P450 sequences that you have identified with the
FMFEGHDTTA
motif as the set of positive examples.
- Randomly divide it into a training set and a test set. (Set the
ratio of training to test sequences as somewhere between 1:1 and 1:3)
- Use
PRATT to learn a significant pattern for the training set.
- Search
into the test set (or the entire database if you can't do this) with that
pattern.
You can use the Emboss
patmatdb
program to
search some of the patterns learned previously into your
test set.
As usual, the EMBOSS suite is installed on the DCS system
in /users/students4/software/public/Bio4/bin/.
Note that the output of patmatdb is described in the patmatdb
documentation.
- Create a new test set with negative examples (e.g. select
globins for different organisms from the past lab) using the FASTA format.
Here we are assuming that the globins do not contain the motif
FMFEGHDTTA. Is this assumption valid?
You can check this by pasting the sequences
into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then
scanning the Prosite database with them to see what motif(s) match the sequences.
- Now you have two test sets (i.e. a Positive_test generated in 6 and
a Negative_test generated with globins data).
- Use the Emboss
patmatdb
program to
search some of the patterns learned previously with pratt into the positive and negative
test sets.
- Make sure that you know for each search what the True Positives (TP),
True Negatives (TN), False Positives (FP) and False Negatives (FN) are:
- TP = pattern hits in Positive_test
- FN = pattern misses in Positive_test
- FP = pattern hits in Negative_test
- TN = pattern misses in Negative_test
- Use your results to compute the
various goodness measures for your
pattern. Clearly if you choose a weak pattern suggested by pratt, then
the goodness measures will be poor. But what is the intuitive
definition of a weak
pattern? Can you tell if a pattern is weak just by looking at the pattern in
isolation from any sets of terget sequences? Are short (small) patterns necessarily
weak?
-
Try to make patterns for the globins (i.e. now they are the positive set), and
use the P450s as the negative set. Can you make very discriminatory patterns
for globins?
Other programs and resources
-
ppsearch :
Search your query sequence for protein motifs, rapidly compare your query
protein
sequence against all patterns stored in the PROSITE
pattern database and determine
what the function of an uncharacterised protein is. This
tool requires a protein
sequence as input, but DNA/RNA may be translated into a
protein sequence using
transeq
and then queried.
-
The EMOTIF database is a collection of more than 170 000 highly specific and
sensitive protein sequence motifs representing
conserved biochemical properties and biological functions.
These protein motifs are derived from over 7600 sequence alignments in the
BLOCKS+ database (released on June 23, 2000) and all (8244)
protein sequence alignments in the PRINTS database
using the emotif-maker algorithm developed by
Nevill-Manning et al.
Since
the amino acids and the groups of amino acids in these
sequence motifs represent critical positions conserved in evolution, search
algorithms employing the EMOTIF patterns can identify and
classify more widely divergent sequences than methods based on global
sequence similarity. The emotif protein pattern database
is available at
http://motif.stanford.edu/emotif/.
-
emotif maker -- program to make a motif (regular expression with
putative biological function)
from an alignment.
-
emotif scan -- program to scan a regular expression against a
database of sequences.
-
emotif search -- program to match a
sequence against a database of
blocks and
prints.
Blocks are multiply aligned ungapped segments corresponding to
the most highly conserved regions of proteins.
PRINTS is a compendium of protein fingerprints -- a fingerprint is a group of
conserved motifs used to characterise a protein family.
-
New!
PHI-BLAST
(from
http://bioweb.pasteur.fr/seqanal/interfaces/phiblast.html
or
from
NCBI)
is BLAST which uses a pattern
in
this format
(following PROSITE conventions)
to initiate a BLAST search. Try it!