Bioinformatics - Lab 5
Sequence pattern discovery
David Gilbert

The aims of this lab are to enable you to gain experience with sequence pattern discovery and pattern searching. You will need to refer to the lecture on Multiple alignments, patterns and profiles.

Databases and search tools:

Note: many of the programs used in this tutorial can be found at http://ca.expasy.org/tools/#pattern.

Some pattern discovery programs (taken from http://bioweb.pasteur.fr/seqanal/motif/intro-uk.html) are:


Cytochrome P450 is a family of the body's more powerful detox enzymes. Over 60 key forms are known, with hundreds of genetic variations possible, producing a wide variety of susceptibility to specific toxins. More on P450's is here.

A well-known biochemist suspects (by intuition and experience) that P450 protein sequences are characterised by the following sequence motif: FMFEGHDTTA

He has found this motif in a the sequences of a set of Cytochrome P450's that he is studying. These sequences are here.

The biochemist has asked, however, if there are other motifs which characterise P450 sequences, and also are discriminatory against other protein families.

In this realistic exercise, we are going to:

  1. Check that following sequence motif: FMFEGHDTTA does indeed match into the the sequences of the set of Cytochrome P450's that the biochemist is studying. These sequences are here. You can use the Emboss patmatdb program to search with this pattern into the set above. Or you can just use a Unix search tool (e.g. grep).

  2. Use the ScanProsite -- select the blue Search Swiss-Prot with the PROSITE pattern(s)/profile(s) part of the form to search for sequences matches to this motif in SwissProt.
  3. Retrieve these hits in fasta format (you will have to go to the page for each sequence hit and click on the 'fasta' button), and save them all in one file -- you will need these several times during this exercise.

  4. Check that these sequences are specific to the search motif, by pasting them into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then scanning the Prosite database with them.

    Now we are now going to automatically generate patterns that characterise the set of P450 protein sequences that we have retrieved, and then test the goodness of these patterns:

  5. Use PRATT as an automated method to construct regular expressions characterising these sequences. An example of what PRATT does can be found here.

  6. Take the set of P450 sequences that you have identified with the FMFEGHDTTA motif as the set of positive examples.

  7. Randomly divide it into a training set and a test set. (Set the ratio of training to test sequences as somewhere between 1:1 and 1:3)

  8. Use PRATT to learn a significant pattern for the training set.

  9. Search into the test set (or the entire database if you can't do this) with that pattern.
    You can use the Emboss patmatdb program to search some of the patterns learned previously into your test set. As usual, the EMBOSS suite is installed on the DCS system in /users/students4/software/public/Bio4/bin/. Note that the output of patmatdb is described in the patmatdb documentation.

  10. Create a new test set with negative examples (e.g. select globins for different organisms from the past lab) using the FASTA format. Here we are assuming that the globins do not contain the motif FMFEGHDTTA. Is this assumption valid? You can check this by pasting the sequences into the yellow 'Proteins to be scanned' box on the ScanProsite page, and then scanning the Prosite database with them to see what motif(s) match the sequences.

  11. Now you have two test sets (i.e. a Positive_test generated in 6 and a Negative_test generated with globins data).

  12. Use the Emboss patmatdb program to search some of the patterns learned previously with pratt into the positive and negative test sets.

  13. Make sure that you know for each search what the True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are:

  14. Use your results to compute the various goodness measures for your pattern. Clearly if you choose a weak pattern suggested by pratt, then the goodness measures will be poor. But what is the intuitive definition of a weak pattern? Can you tell if a pattern is weak just by looking at the pattern in isolation from any sets of terget sequences? Are short (small) patterns necessarily weak?

  15. Try to make patterns for the globins (i.e. now they are the positive set), and use the P450s as the negative set. Can you make very discriminatory patterns for globins?

Other programs and resources