Bioinformatics - Lab 4
Multiple sequence alignment
David Gilbert
This week we are going to perform multiple sequence
alignments. A popular program to do this is clustalW -- see the 2can
clustalW tutorial for some details.
Last week we used programs to perform sequence comparison based on
global alignments and local alignments. In the second half of this week's
tutorial we will continue on this theme
by familiarising ourselves with different sequence comparison programs,
and the use of different scoring matrices.
Why perform Multiple Sequence Alignment:
(from Wikipedia)
Multiple alignment can be
used to study evolutionary relationships between related proteins. Since the
changes between gene sequences due to evolution are incremental, we can take
homologous genes, i.e. genes with a common evolutionary origin, from a diverse
range of organisms and then compare them by aligning identical or similar
residues. The comparison of these related genes may then be used to study which
regions of genes have been conserved and which are sensitive to mutation over
the years. This is very useful in designing experiments to test and modify the
function of specific proteins, to predict the function and structure of
proteins, and to identify new members of protein families.
We will also show in a later lecture and lab that multiple alignments
can be very useful to represent properties of a set or family of
sequences, and that descriptors based on multiple alignments can be
used as probes to search in sequence databases for sequences which are
related to the probe set.
Multiple Sequence Alignment programs and techniques:
Progressive strategies for multiple alignment: A common approach for multiple
sequence alignment is to progressively align pairs of sequences. First two
sequences are selected and are aligned together, and then this alignment is used
to align each subsequent sequences.
One of the most popular programs for multiple sequence alignment is known as
ClustalW. It is a general purpose multiple alignment program for DNA or proteins.
It calculates the best match for the selected sequences, and lines them up so
that the similarities and differences can be seen. It also generates a phylogram
and a cladogram
which can be useful for studying the evolutionary relationships between the set
of sequences.
(The difference between a cladogram and a phylogenetic tree is that a
phylogenetic tree is a branching diagram (tree) in which branch lengths are
proportional to the amount of inferred evolutionary change. A cladogram is a tree
where the branches are of equal length, thus cladograms show common ancestry, but
do not indicate the amount of evolutionary "time".)
-
Use ClustalW
at the EBI, to perform multiple alignment
of some sequences. The easiest
supported format
to use is that of FASTA:
>SEQUENCE_NAME1 PLUS ANY OTHER COMMENTS
SEQUENCE1 (can be on several lines)
>SEQUENCE_NAME2 PLUS ANY OTHER COMMENTS
SEQUENCE2 (can be on several lines)
...
>SEQUENCE_NAMEn PLUS ANY OTHER COMMENTS
SEQUENCEn (can be on several lines)
Note: if you wish, you can use clustalw at the command line; it should
be available to you if you have followed this advice:
Note You will need to ensure that you have the line
source /usr/local/bin/bioinformaticslab.env
to your ~/.cshrc in order to use some of the software in the
practicals.
A very quick intro is
here.
An intuitive guide is
- run clustalw without any arguments
- select
- 1. Sequence Input From Disc
- 2. Multiple Alignments
- 1. Do complete multiple alignment now (Slow/Accurate)
- Choose or select output filenames from there
-
You can use some amino-acid sequences from the
there is a list of globins from different organisms.
I have edited the file to fasta format
here.
Consensus symbols:
An alignment with display by default the following symbols denoting the degree
of conservation observed in each column:
"*" means that the residues or nucleotides in that column are identical in all
sequences in the alignment.
":" means that conserved substitutions have been observed, according to the colour table.
"." means that semi-conserved substitutions are observed.
You should look at
-
The alignment file (best with colours).
-
The phylogram and cladogram. (What is he difference between
the two?).
-
The alignment in the Jalview alignment editor which can be launched by a button from the
ClustalW results page
(explanation here)
-
You can create multiple alignments for your own sequences of interest by searching the NCBI database for (sets of)
sequences, and obtain them in fasta format.
- For example, if you are interested in Avian
Influenza, then search on e.g. "h5n1"; for AIDS search on "hiv-2".
- You then should go to either
the nucleotide or protein sequence databases, select the sequences of interest,
- Display FASTA
- Send to text
- Save the sequences in a file
- Hint: edit the fasta comment lines to
- remove any spaces (clustl will only display the comment up to the first space)
- reomve any other characters int eh comments that you don't want to appear in the
output
- avoid having any identical comment lines in your file by adding a unique identifier to
each such comment line if needed
I have collected some H5N1 sequences from varied locations and dates - they are available
here. I have tried to preserve the
location, date and species of each entry. If you are interested in the relationships
between the locations and you may need to search on e.g. PR China province names
in a map in order to see where many of the sequences come from -- e.g.
Hebei.
I have added a unique number as a
prefix to the fasta comment in order to distinguish between comment lines which have been
made identical in my comment-editing process.
-
Alternatively, you can take one sequence of interest (e.g. a globin) and use
Blast to identify retrieve similar sequences. You can then retrieve some of
these which are of interest (e.g. from different organisms) in fasta format, and create a
multiple alignment using clustalW.
-
(from the 2can clustalW
tutorial)
We will now consider aligning several tropomyosin nucleotide sequences, represented by the accession
numbers below. These will be exhibit more sequence diversity than the globins which you
tried above.
BF056441 BE8487196 BF022813 BF452255 BG089808 BG147728 BI817778 AF186109
AF186110 AF310722 AF362886 AF362887 AF087679 SSAJ803 SSAJ804
The sequences for these nucleotides are available
here.
-
Try out some other multiple alignment programs on your favourite data sets:
- T-coffee
("This program is more accurate than ClustalW for sequences with less than 30% identity,
but it is slower...")
- muscle
(email server only)
Reminder -- if you are using the EMBOSS suite:
- the EMBOSS documentation is visible at
compbio.dcs.gla.ac.uk/courses/Bio4/emboss/.
- the EMBOSS suite is installed on the DCS system
in /users/students4/software/public/Bio4/bin/
- You do not have
write permissions in that directory! hence:
YOU SHOULD WORK IN YOUR OWN DIRECTORY, WHERE YOU WILL HAVE DOWNLOADED THE
DATA FILES FOR THE EXERCISES. THE EMBOSS PROGRAMS SHOULD BE EXECUTED FROM
YOUR OWN DIRECTORY, USING THE PATH ABOVE (YOU CAN SET IT IN YOUR .cshrc).
-
(Exercise from Lab 3.)
The
EMBOSS suite's
water
and
matcher
programs calculate the
local alignment between two sequences by searching for
egions of local similarity between two sequences and need not include the
entire length of the sequences. Local alignment methods are very useful for
scanning databases or other circumsatnces when you wish to find matches
between small regions of sequences, for example between protein domains.
-
Use these programs to align some of the sequences that you have been working
with today.
Are you able to detect any local alignments?
- Here are some sequence search programs from the
EBI tools page.
Try some of the following; do they give the
same/similar results for the same probe sequence? What about the
execution time?
Note the difference between the detailed results you get for the alignments.
There is extensive on-line help about homology and similarity at
http://www.ebi.ac.uk/help/homology_frame.html
Fasta3 |
Sequence similarity and homology searching
against nucleotide and protein database using Fasta3 |
WU-Blast2 |
Washington University blast2 (blast
2.0 with gaps) |
NCBI-Blast2 |
NCBI blast2 (blastall)
program |
MPsrch |
Edinburgh University's new implementation of the Smith and Waterman
algorithm |
Scanps2.3 |
Version 2.3 of
Scanps.Fast implementation of the true Smith
& Waterman algorithm for protein database searches |
- Repeat some searches using different matrices. There is
some explanation on matrices here.
- Repeat the search using BLAST with different expected thresholds
(EXP.THR). You should be able to restrict or extend the list of
matches that you get.
There is a lot of information on BLAST at
http://www.ncbi.nlm.nih.gov/BLAST/.
-
You can perform searches selectively on completed genomes and proteomes using
the Proteomes & Genomes
Fasta3 server. Try it! (The biologists may be able to help with the names of
the organisms...).