Bioinformatics - Lab 3
Pairwise sequence comparison
David Gilbert

Last week you were briefly introduced to searching in sequence databases using the Blast program which performs pairwise sequence comparison, returning a score, E-value and a pairwise alignment between two sequences. You should have realised that Blast can perform pairwise sequence comparison very fast. For example, searching using the 1568 nucleotide or 416 amino-acid sequences from J02799 over all of the GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,environmental samples or phase 0, 1 or 2 HTGS sequences), i.e. over 3.7 million sequences (over 16.5 billion letters) returned results in very acceptable times! Blast achieves this by not searching for global alignments, and by using effecive heuristics. However, you should be aware that other approaches are sometimes more effective, e.g. exact (i.e. non-heuristic) methods and/or those based on global alignments. global and local sequence alignment programs, and to dot plots -- an effective way to visually compare two sequences.

This week we are going to use programs from the EMBOSS suite to perform sequence comparison based on

global alignments, along the lines of approach introduced in last week's Sequence Comparison (1) lecture,
local alignments, and
using dot-plots -- an effective way to visually compare two sequences.

Note

You will need to ensure that you type the command at the unix prompt when starting each session:-
source /usr/local/lab/lab.env
in order to use all of the software in the practicals.

Alternatively, have the line
source /usr/local/lab/lab.env
in your ~/.cshrc

Reminder:

The EMBOSS suite is installed on the DCS system in /usr/local/lab/packages but
Note You will need to ensure that you have the line
source /usr/local/bin/bioinformaticslab.env
to your ~/.cshrc in order to use some of the software in the practicals.
The EMBOSS documentation is visible at compbio.dcs.gla.ac.uk/courses/Bio4/emboss/

You do not have write permissions in that directory! hence: YOU SHOULD WORK IN YOUR OWN DIRECTORY, WHERE YOU WILL HAVE DOWNLOADED THE DATA FILES FOR THE EXERCISES. THE EMBOSS PROGRAMS SHOULD BE EXECUTED FROM YOUR OWN DIRECTORY, USING THE PATH ABOVE (YOU CAN SET IT IN YOUR .cshrc).

Data: Some example, presumably related, haemoglobin (beta-chain) amino-acid sequences are here. You will need to download them and split them up into separate files in order to use them this practical.

needle is a program which performs global alignment using the Needleman-Wunsch algorithm.
- Use needle with the default parameters to produce alignments of:
  - Human (normal)
  - Gorilla
  - Rabbit
  - Pig
  storing your results, and producing a table recording the pairwise score, sequence identity, sequence similarity and number of gaps.
- Try to construct a rudimentary phylogenetic tree of the Human, Gorilla, rabbit and Pig, based on pairwise the global alignment scores. Is it similar to the one given on slide 14 of the Introduction slides?
  Hint -- create a 4x4 table to display the scores; you will only need to fill in the top half of the matrix (why?):
  
  Human Gorilla Rabbit Pig
  
  Human x x x x
  
  Gorilla x x x
  
  Rabbit x x
  
  Pig x
You can also use the sequence identity or the % sequence similarity (repeat the table for each of these methods). Do they give the same phylogenetic relationships?
Use the skills that you learned in Lab 1 to obtain files with the amino-acid and nucleotide sequences of human isocitrate dehydrogenase. Use needle to compare the sequences with those that you have for isocitrate dehydrogenase from E coli.
Repeat the previous exercises, using this time the stretcher program. What differences are there in
- the score, sequence identity, sequence similarity, number of gaps and alignments
- the execution time. (?is there any noticeable difference?)
Dot plots are probably the oldest way of comparing sequences. A dot plot is a visual representation of the similarities between two sequences. Each axis of a rectangular array represents one of the two sequences to be compared. A window length is fixed, together with a criterion when two sequence windows are deemed to be similar. Whenever one one window in one sequence resembles another a window in the other sequence, a dot or short diagonal is drawn at the corresponding position of the array. Thus, when two sequences share similarity over their entire length a diagonal line will extend from one corner of the dot plot to the diagonally opposite corner. If two sequences only share patches of similarity this will be revealed by diagonal stretches. [http://www.bioinfo.org.cn/lectures/index-12.html].
- Use the dot plot programs to visually compare
  - isocitrate dehydrogenase from human and E coli.
  - haemoglobin from two different species
  - any sequence against itself
  - the reverse of any sequence against itself (look at Lab 2 for information about how to use revseq)
  - the concatenation of two very different sequences, in different orders. E.g. concatenate a haemoglobin sequence (H) and an isocitrate dehydrogenase sequence (I) in the orders HI and IH, and see what differences there are between using local and global alignment programs to compare these hybrid (chimera) sequences. N.B. Should you concatenate the nucleotide or amino-acid versions of he sequences?
  - Try this on the
    - nucleotide sequences, and
    - amino-acid sequences
    . What is the difference?
  - Vary the threshold parameter to see if you can get a better indication of the matching regions.
The water and matcher programs calculate the local alignment between two sequences by searching for egions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumsatnces when you wish to find matches between small regions of sequences, for example between protein domains.
- Use these programs to align some of the sequences that you have been working with today. Are you able to detect any local alignments?

	Human	Gorilla	Rabbit	Pig
Human	x	x	x	x
Gorilla		x	x	x
Rabbit			x	x
Pig				x

Bioinformatics - Lab 3 Pairwise sequence comparison David Gilbert

Note

Bioinformatics - Lab 3
Pairwise sequence comparison
David Gilbert