Bioinformatics - Lab 3
Pairwise sequence comparison
David Gilbert
Last week you were briefly introduced to searching in sequence databases using
the Blast program which performs pairwise sequence comparison, returning a
score, E-value and a pairwise alignment between two sequences. You should
have realised that Blast can perform pairwise sequence comparison very fast.
For example, searching using the 1568 nucleotide or 416 amino-acid sequences
from
J02799
over
all of the GenBank+EMBL+DDBJ+PDB sequences (but no EST,
STS,
GSS,environmental samples or phase 0, 1 or 2
HTGS sequences), i.e. over 3.7
million sequences (over 16.5 billion letters) returned results in very
acceptable times! Blast achieves this by not searching for global alignments,
and by using effecive heuristics. However, you should be aware that other
approaches are sometimes more effective, e.g.
exact (i.e. non-heuristic) methods and/or those based on global alignments.
global and local sequence alignment programs, and to dot plots -- an effective way to
visually compare two sequences.
This week we are going to use programs from the EMBOSS suite to perform sequence
comparison based on
- global alignments, along the lines of approach introduced
in last week's
Sequence Comparison (1)
lecture,
- local alignments, and
- using dot-plots -- an effective way to visually compare two sequences.
Note
You will need to ensure that you
type the command at the unix prompt when starting each
session:-
source /usr/local/lab/lab.env
in order to use all of the software in the practicals.
Alternatively, have the line
source /usr/local/lab/lab.env
in your ~/.cshrc
Reminder:
- The EMBOSS suite is installed on the DCS system
in /usr/local/lab/packages but
Note You will need to ensure that you have the line
source /usr/local/bin/bioinformaticslab.env
to your ~/.cshrc in order to use some of the software in the
practicals.
- The EMBOSS documentation is visible at
compbio.dcs.gla.ac.uk/courses/Bio4/emboss/
You do not have
write permissions in that directory! hence:
YOU SHOULD WORK IN YOUR OWN DIRECTORY, WHERE YOU WILL HAVE DOWNLOADED THE
DATA FILES FOR THE EXERCISES. THE EMBOSS PROGRAMS SHOULD BE EXECUTED FROM
YOUR OWN DIRECTORY, USING THE PATH ABOVE (YOU CAN SET IT IN YOUR .cshrc).
Data:
Some example, presumably related,
haemoglobin (beta-chain) amino-acid sequences are
here. You will need to download them
and split them up into separate files in order to use them this practical.
-
needle is
a program which performs
global
alignment using the
Needleman-Wunsch
algorithm.
-
Use needle with the default parameters to produce alignments of:
- Human (normal)
- Gorilla
- Rabbit
- Pig
storing your results, and producing a table recording the pairwise score,
sequence identity, sequence similarity and number of gaps.
-
Try to construct a rudimentary phylogenetic tree of the Human, Gorilla, rabbit
and Pig, based on pairwise the global alignment scores. Is it similar to the one given
on slide 14 of the
Introduction slides?
Hint -- create a 4x4
table to display the scores; you will only need to fill in the top half of the
matrix (why?):
| Human | Gorilla | Rabbit | Pig |
Human | x | x | x | x |
Gorilla | | x | x | x |
Rabbit | | | x | x |
Pig | | | | x |
You can also use the sequence identity or the % sequence
similarity (repeat the table for each of these methods). Do they give
the same phylogenetic relationships?
-
Use the skills that you learned in Lab 1 to obtain files with the amino-acid
and nucleotide sequences of human isocitrate dehydrogenase. Use
needle to compare the
sequences with those that you have for isocitrate dehydrogenase from E coli.
-
Repeat the previous exercises, using this time the
stretcher
program.
What differences are there in
- the score, sequence identity, sequence similarity, number of gaps and
alignments
- the execution time. (?is there any noticeable difference?)
-
Dot plots are probably the oldest way of comparing sequences.
A dot plot is a visual representation of the similarities between two
sequences. Each axis of a rectangular array represents one of the two sequences
to be compared. A window length is fixed, together with a criterion when two
sequence windows are deemed to be similar. Whenever one one window in one
sequence resembles another a window in the other sequence, a dot or short
diagonal is drawn at the corresponding position of the array. Thus, when two
sequences share similarity over their entire length a diagonal line will extend
from one corner of the dot plot to the diagonally opposite corner. If two
sequences only share patches of similarity this will be revealed by diagonal
stretches. [http://www.bioinfo.org.cn/lectures/index-12.html].
dotmatcher
displays a thresholded dotplot of two sequences.
Dotter is a more interactive dotplot program.
It is installed locally in /users/students4/software/public/Bio4/bin/
The dot-plot programs permit the user to very easily identify conserved a
and non-conserved regions between two sequences.
-
Use the dot plot programs
to visually compare
- isocitrate dehydrogenase from human and E coli.
- haemoglobin from two different species
- any sequence against itself
- the reverse of any sequence against itself (look at Lab 2 for information about how to use
revseq)
- the concatenation of two very different sequences, in different orders.
E.g. concatenate a haemoglobin sequence (H) and an isocitrate dehydrogenase
sequence (I) in the orders HI and IH, and see what
differences there are between using local and global
alignment programs to compare these hybrid (chimera) sequences. N.B.
Should you concatenate the nucleotide or amino-acid versions of he sequences?
-
Try this on the
- nucleotide sequences, and
- amino-acid sequences
.
What is the difference?
-
Vary the threshold parameter to see if you can get a better
indication of the matching regions.
-
The
water
and
matcher
programs calculate the
local alignment between two sequences by searching for
egions of local similarity between two sequences and need not include the
entire length of the sequences. Local alignment methods are very useful for
scanning databases or other circumsatnces when you wish to find matches
between small regions of sequences, for example between protein domains.
-
Use these programs to align some of the sequences that you have been working
with today.
Are you able to detect any local alignments?