Bioinformatics - Lab 5
Sequence pattern discovery
David Gilbert

The aims of this lab are to enable you to gain experience with sequence pattern discovery and pattern searching. You will need to refer to the lecture on Multiple alignments, patterns and profiles.

Databases and search tools:

PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
The ScanProsite tool allows users to scan protein sequence(s) (either from UniProt Knowledgebase (Swiss-Prot/TrEMBL) or PDB or provided by the user) for the occurrence of patterns, profiles and rules (motifs) stored in the PROSITE database, or to search protein database(s) for hits by specific motif(s)

Note: many of the programs used in this tutorial can be found at http://ca.expasy.org/tools/#pattern.

Some pattern discovery programs (taken from http://bioweb.pasteur.fr/seqanal/motif/intro-uk.html) are:

PRATT at EBI, http://ca.expasy.org/tools/pratt/, or PrattWWW: a tool for sequence motif discovery (I. Jonassen). Pratt is also installed locally in /users/students4/software/public/Bio4/bin/pratt. A quick user guide is here. Note that pratt works by repeated pattern extension, generates regular expression patterns in a Prosite syntax, and can find several patterns which are descriptive of the input sequences. Moreover it can also refine (improve) the pattern set which it finds.
MEME (Multiple EM for Motif Elicitation): motifs discovery (highly conserved regions) in groups of related DNA or protein sequences (T. Bailey, C. Elkan, B. Grundy)
PFTOOLS: PROFILE tools (P. Bucher)
SMILE: Structured Motif Inference and Evaluation (L. Marsan, J. Allali)
- simple motifs
- structured motifs, with 2 boxes
prophet web server: Gapped alignment for profiles EMBOSS -- prophet -- can be run locally.
consensus: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps (G.Z. Hertz and G.D. Stormo) (advanced form)

Cytochrome P450 is a family of the body's more powerful detox enzymes. Over 60 key forms are known, with hundreds of genetic variations possible, producing a wide variety of susceptibility to specific toxins. More on P450's is here.

A well-known biochemist suspects (by intuition and experience) that P450 protein sequences are characterised by the following sequence motif: FMFEGHDTTA

He has found this motif in a the sequences of a set of Cytochrome P450's that he is studying. These sequences are here.

The biochemist has asked, however, if there are other motifs which characterise P450 sequences, and also are discriminatory against other protein families.

In this realistic exercise, we are going to:

Confirm whether the FMFEGHDTTA motif does indeed identify P450's, by using the motif to scan through the Prosite motifs database.
Use any hits that we find to make some other patterns characteristic of the retrieved set. These patterns will be regular expressions, generated using pratt.
Evaluate these generated patterns against the set of retrieved P450 sequences, and also against a set of sequences which are not P450's in order to ascertain the descriptive and discriminatory powers of the generated patterns.

Check that following sequence motif: FMFEGHDTTA does indeed match into the the sequences of the set of Cytochrome P450's that the biochemist is studying. These sequences are here. You can use the Emboss patmatdb program to search with this pattern into the set above. Or you can just use a Unix search tool (e.g. grep).
Use the ScanProsite -- select the blue Search Swiss-Prot with the PROSITE pattern(s)/profile(s) part of the form to search for sequences matches to this motif in SwissProt.
Retrieve these hits in fasta format (you will have to go to the page for each sequence hit and click on the 'fasta' button), and save them all in one file -- you will need these several times during this exercise.
Check that these sequences are specific to the search motif, by pasting them into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then scanning the Prosite database with them.
Now we are now going to automatically generate patterns that characterise the set of P450 protein sequences that we have retrieved, and then test the goodness of these patterns:
Use PRATT as an automated method to construct regular expressions characterising these sequences. An example of what PRATT does can be found here.
Take the set of P450 sequences that you have identified with the FMFEGHDTTA motif as the set of positive examples.
Randomly divide it into a training set and a test set. (Set the ratio of training to test sequences as somewhere between 1:1 and 1:3)
Use PRATT to learn a significant pattern for the training set.
Search into the test set (or the entire database if you can't do this) with that pattern.
You can use the Emboss patmatdb program to search some of the patterns learned previously into your test set. As usual, the EMBOSS suite is installed on the DCS system in /users/students4/software/public/Bio4/bin/. Note that the output of patmatdb is described in the patmatdb documentation.
Create a new test set with negative examples (e.g. select globins for different organisms from the past lab) using the FASTA format. Here we are assuming that the globins do not contain the motif FMFEGHDTTA. Is this assumption valid? You can check this by pasting the sequences into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then scanning the Prosite database with them to see what motif(s) match the sequences.
Now you have two test sets (i.e. a Positive_test generated in 6 and a Negative_test generated with globins data).
Use the Emboss patmatdb program to search some of the patterns learned previously with pratt into the positive and negative test sets.
Make sure that you know for each search what the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are:
- TP = pattern hits in Positive_test
- FN = pattern misses in Positive_test
- FP = pattern hits in Negative_test
- TN = pattern misses in Negative_test
Use your results to compute the various goodness measures for your pattern. Clearly if you choose a weak pattern suggested by pratt, then the goodness measures will be poor. But what is the intuitive definition of a weak pattern? Can you tell if a pattern is weak just by looking at the pattern in isolation from any sets of terget sequences? Are short (small) patterns necessarily weak?
Try to make patterns for the globins (i.e. now they are the positive set), and use the P450s as the negative set. Can you make very discriminatory patterns for globins?

Other programs and resources

ppsearch : Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein is. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried.
The EMOTIF database is a collection of more than 170 000 highly specific and sensitive protein sequence motifs representing conserved biochemical properties and biological functions. These protein motifs are derived from over 7600 sequence alignments in the BLOCKS+ database (released on June 23, 2000) and all (8244) protein sequence alignments in the PRINTS database using the emotif-maker algorithm developed by Nevill-Manning et al. Since the amino acids and the groups of amino acids in these sequence motifs represent critical positions conserved in evolution, search algorithms employing the EMOTIF patterns can identify and classify more widely divergent sequences than methods based on global sequence similarity. The emotif protein pattern database is available at http://motif.stanford.edu/emotif/.
- emotif maker -- program to make a motif (regular expression with putative biological function) from an alignment.
- emotif scan -- program to scan a regular expression against a database of sequences.
- emotif search -- program to match a sequence against a database of blocks and prints. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. PRINTS is a compendium of protein fingerprints -- a fingerprint is a group of conserved motifs used to characterise a protein family.
New! PHI-BLAST (from http://bioweb.pasteur.fr/seqanal/interfaces/phiblast.html or from NCBI) is BLAST which uses a pattern in this format (following PROSITE conventions) to initiate a BLAST search. Try it!

Bioinformatics - Lab 5 Sequence pattern discovery David Gilbert

Other programs and resources

Bioinformatics - Lab 5
Sequence pattern discovery
David Gilbert