Bioinformatics - Lab 4
Multiple sequence alignment
David Gilbert

This week we are going to perform multiple sequence alignments. A popular program to do this is clustalW -- see the 2can clustalW tutorial for some details.

Last week we used programs to perform sequence comparison based on global alignments and local alignments. In the second half of this week's tutorial we will continue on this theme by familiarising ourselves with different sequence comparison programs, and the use of different scoring matrices.

Why perform Multiple Sequence Alignment:
(from Wikipedia)
Multiple alignment can be used to study evolutionary relationships between related proteins. Since the changes between gene sequences due to evolution are incremental, we can take homologous genes, i.e. genes with a common evolutionary origin, from a diverse range of organisms and then compare them by aligning identical or similar residues. The comparison of these related genes may then be used to study which regions of genes have been conserved and which are sensitive to mutation over the years. This is very useful in designing experiments to test and modify the function of specific proteins, to predict the function and structure of proteins, and to identify new members of protein families.

We will also show in a later lecture and lab that multiple alignments can be very useful to represent properties of a set or family of sequences, and that descriptors based on multiple alignments can be used as probes to search in sequence databases for sequences which are related to the probe set.

Multiple Sequence Alignment programs and techniques:

Progressive strategies for multiple alignment: A common approach for multiple sequence alignment is to progressively align pairs of sequences. First two sequences are selected and are aligned together, and then this alignment is used to align each subsequent sequences.
One of the most popular programs for multiple sequence alignment is known as ClustalW. It is a general purpose multiple alignment program for DNA or proteins. It calculates the best match for the selected sequences, and lines them up so that the similarities and differences can be seen. It also generates a phylogram and a cladogram which can be useful for studying the evolutionary relationships between the set of sequences. (The difference between a cladogram and a phylogenetic tree is that a phylogenetic tree is a branching diagram (tree) in which branch lengths are proportional to the amount of inferred evolutionary change. A cladogram is a tree where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time".)

Use ClustalW at the EBI, to perform multiple alignment of some sequences. The easiest supported format to use is that of FASTA:
```
>SEQUENCE_NAME1 PLUS ANY OTHER COMMENTS
SEQUENCE1 (can be on several lines)
>SEQUENCE_NAME2 PLUS ANY OTHER COMMENTS
SEQUENCE2 (can be on several lines)
... 
>SEQUENCE_NAMEn PLUS ANY OTHER COMMENTS
SEQUENCEn (can be on several lines)
```
Note: if you wish, you can use clustalw at the command line; it should be available to you if you have followed this advice:
Note You will need to ensure that you have the line
source /usr/local/bin/bioinformaticslab.env
to your ~/.cshrc in order to use some of the software in the practicals. A very quick intro is here. An intuitive guide is
- run clustalw without any arguments
- select
  1. 1. Sequence Input From Disc
  2. 2. Multiple Alignments
  3. 1. Do complete multiple alignment now (Slow/Accurate)
  4. Choose or select output filenames from there
1. You can use some amino-acid sequences from the there is a list of globins from different organisms. I have edited the file to fasta format here.
  Consensus symbols:
  An alignment with display by default the following symbols denoting the degree of conservation observed in each column:
  "*" means that the residues or nucleotides in that column are identical in all sequences in the alignment.
  ":" means that conserved substitutions have been observed, according to the colour table.
  "." means that semi-conserved substitutions are observed.
  You should look at
  - The alignment file (best with colours).
  - The phylogram and cladogram. (What is he difference between the two?).
  - The alignment in the Jalview alignment editor which can be launched by a button from the ClustalW results page (explanation here)
2. You can create multiple alignments for your own sequences of interest by searching the NCBI database for (sets of) sequences, and obtain them in fasta format.
  - For example, if you are interested in Avian Influenza, then search on e.g. "h5n1"; for AIDS search on "hiv-2".
  - You then should go to either the nucleotide or protein sequence databases, select the sequences of interest,
  - Display FASTA
  - Send to text
  - Save the sequences in a file
  - Hint: edit the fasta comment lines to
    - remove any spaces (clustl will only display the comment up to the first space)
    - reomve any other characters int eh comments that you don't want to appear in the output
    - avoid having any identical comment lines in your file by adding a unique identifier to each such comment line if needed
    I have collected some H5N1 sequences from varied locations and dates - they are available here. I have tried to preserve the location, date and species of each entry. If you are interested in the relationships between the locations and you may need to search on e.g. PR China province names in a map in order to see where many of the sequences come from -- e.g. Hebei.
    I have added a unique number as a prefix to the fasta comment in order to distinguish between comment lines which have been made identical in my comment-editing process.
3. Alternatively, you can take one sequence of interest (e.g. a globin) and use Blast to identify retrieve similar sequences. You can then retrieve some of these which are of interest (e.g. from different organisms) in fasta format, and create a multiple alignment using clustalW.
4. (from the 2can clustalW tutorial)
  We will now consider aligning several tropomyosin nucleotide sequences, represented by the accession numbers below. These will be exhibit more sequence diversity than the globins which you tried above.
```
BF056441 BE8487196 BF022813 BF452255 BG089808 BG147728 BI817778 AF186109 
AF186110 AF310722  AF362886 AF362887 AF087679 SSAJ803 SSAJ804 
```
  The sequences for these nucleotides are available here.
5. Try out some other multiple alignment programs on your favourite data sets:
  - T-coffee ("This program is more accurate than ClustalW for sequences with less than 30% identity, but it is slower...")
  - muscle (email server only)
Reminder -- if you are using the EMBOSS suite:
- the EMBOSS documentation is visible at compbio.dcs.gla.ac.uk/courses/Bio4/emboss/.
- the EMBOSS suite is installed on the DCS system in /users/students4/software/public/Bio4/bin/
- You do not have write permissions in that directory! hence: YOU SHOULD WORK IN YOUR OWN DIRECTORY, WHERE YOU WILL HAVE DOWNLOADED THE DATA FILES FOR THE EXERCISES. THE EMBOSS PROGRAMS SHOULD BE EXECUTED FROM YOUR OWN DIRECTORY, USING THE PATH ABOVE (YOU CAN SET IT IN YOUR .cshrc).
(Exercise from Lab 3.)
The EMBOSS suite's water and matcher programs calculate the local alignment between two sequences by searching for egions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumsatnces when you wish to find matches between small regions of sequences, for example between protein domains.
- Use these programs to align some of the sequences that you have been working with today. Are you able to detect any local alignments?

Here are some sequence search programs from the EBI tools page. Try some of the following; do they give the same/similar results for the same probe sequence? What about the execution time? Note the difference between the detailed results you get for the alignments. There is extensive on-line help about homology and similarity at http://www.ebi.ac.uk/help/homology_frame.html

Fasta3	Sequence similarity and homology searching against nucleotide and protein database using Fasta3
WU-Blast2	Washington University blast2 (blast 2.0 with gaps)
NCBI-Blast2	NCBI blast2 (blastall) program
MPsrch	Edinburgh University's new implementation of the Smith and Waterman algorithm
Scanps2.3	Version 2.3 of Scanps.Fast implementation of the true Smith & Waterman algorithm for protein database searches

Repeat some searches using different matrices. There is some explanation on matrices here.
Repeat the search using BLAST with different expected thresholds (EXP.THR). You should be able to restrict or extend the list of matches that you get. There is a lot of information on BLAST at http://www.ncbi.nlm.nih.gov/BLAST/.
You can perform searches selectively on completed genomes and proteomes using the Proteomes & Genomes Fasta3 server. Try it! (The biologists may be able to help with the names of the organisms...).

Bioinformatics - Lab 4 Multiple sequence alignment David Gilbert

Bioinformatics - Lab 4
Multiple sequence alignment
David Gilbert