Bioinformatics - Lab 2
Basic sequence analysis
David Gilbert

On-line resources about bioinformatics This exercise:

Introduces you to software of the EMBOSS suite.
Introduces you to the conceptual decoding of a DNA sequence into protein, which also illustrates a sequence comparison algorithm of low complexity.
Familiarizes you with the idea of DNA reading frames.
Introduces you to some sequence comparison and alignment techniques, and how to interpret the results.

Note

You will need to ensure that you type the command at the unix prompt when starting each session:-
source /usr/local/lab/lab.env
in order to use all of the software in the practicals.

Alternatively, have the line
source /usr/local/lab/lab.env
to your ~/.cshrc

EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology user community. The software automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web. EMBOSS integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages. [More at emboss.sourceforge.net/what].

The EMBOSS suite is installed on the DCS system in /usr/local/lab/packages but
Note You will need to ensure that you have the line
source /usr/local/bin/bioinformaticslab.env
to your ~/.cshrc in order to use some of the software in the practicals.
The EMBOSS documentation is visible at compbio.dcs.gla.ac.uk/courses/Bio4/emboss/

Find the E. Coli isocitrate dehydrogenase sequence file J20799 which you retrieved last week. If you do not have the sequence, then retreive it again using the instructions in Lab 1, or you can find it here.
Identify the nucleotide and amino-acid sequences within the file. Show that the nucleotide (DNA) sequence does indeed code for the amino-acid sequence by identifying the exact nucleotides in the sequence which match to each base in the amino-acid sequence. Take care with this! You should make no presumptions regarding where the reading frame starts... I.e. the codons may not start from the first base in the nucleotide sequence, but from the second or the third!
Hints:
- The genetic code can be obtained from here or from molbio.info.nih.gov/molbio/gcode.html. Amino-acid codes are here or at e.g. www.hgu.mrc.ac.uk/Softdata/Misc/aacode.htm.
- You may need to remove the comment line, integers, white space and even line feeds from the nucleotide sequence.
- A quick and dirty way to find the start of the open reading frame [i.e. the place in the nucleotide sequence corresponding to the first amino-acid] could be to use some character search facilities within unix (e.g. `grep') or an editor (emacs, vi). We know from the genetic code that ATG=M (methionine) can act as a start codon; you may however need to search for a more specific (i.e. longer) sequence than just 'M' -- try at least the first 6 amino-acids of the sequence, if not the whole sequence. Similarly you will need to find out where the end of the sequence matches by using several amino-acids from the end of the sequence. Remember that there are 3 codons that code for `stop', and that there is no translation of this codon into an amino-acid. You might consider using regular expression matching. E.g. Lysine is represented by AA[A,G] as a unix regex. How will you represent Leucine?
Another problem with the translation is that the nucleotide region of interest may actually be on the other strand to that given in the database. How can we obtain the sequence for the complementary DNA strand? If you look at the slides from last week (e.g. slide 4) you will see that one strand is the reverse and complement of the other.:-
```
5'   C-A-T-G-T-C-C-A   3'
3'   G-T-A-C-A-G-G-T   5'
```
Make sure that you understand what complement means in the context of molecular genetics -- it is effectively given by the base-pairing rules for DNA.
Use the revseq EMBOSS tool in order to
- reverse
- complement
- reverse and complement
the J20799 nucleotide sequence.
How many possible translations are possible for one sequence? Hint: 3 possible starts for the first codon in the forward direction, and...
We will not try to do all the possible translations ourselves!
Use the transeq EMBOSS tool in order to translate the J20799 nucleic acid sequence to the possible corresponding amino-acid peptide sequence.
Note that it can translate in any of the 3 forward or three reverse sense frames, or in all three forward or reverse frames, or in all six frames.
Hint: use the -frame flag in transeq.
Reading frame: One of the three possible ways to read a nucleotide sequence in DNA or RNA as a series of nonoverlapping triplets, depending upon whether reading starts with the first, second or third base in the sequence. For example: with the nucleotide sequence TGCTGCTGC, the three possible reading frames are: TGC TGC TGC; and GCT GCT; and CTG CTG. (from http://www.medterms.com).
The problem highlighted by the above transeq exercise is that of deciding which reading-frame to use for the translation. Fortunately in the case of J02799 there is only one of the 6 that produces a sensible ORF (Open Reading Frame).
Use the plotorf EMBOSS tool in order to produce an output that will enable you to identify the longest ORF and the associated best reading frame to use. Use this to identify the numbers in the the DNA sequence of the first and last nucleotides used to generate the amino-acid sequence.
Does this result concur with the the result that you generated in the first exercise in this lab?
What (approximate) rules could you come up with to decide which frame gives the best translation of an ORF?
We now will have a look at retrieving sequences which are similar to a `probe' sequence of interest, in this case J02799. Why are we interested in this? - because of
- evolution -- due to mutations, similar sequences may occur in different organisms
- human experimentation - workers may have mutated a sequence and want to record this
We will use the Blast sequence comparison program in order to retrieve sequences which are similar to that of a given `probe' sequence. In this case
blast search, using: http://www.ebi.ac.uk/blastall/ (or http://www.ncbi.nlm.nih.gov/BLAST/ if this doen't work, but it seems faster...).
!Store your results (as html pages) -you will need them for a later lab!.
Select the appropriate program and database. Use blastn on the embl database for nucleotides and blastp on swissprot for amino-acids.
In this case try to find out what sequences are similar to that of J20799
The H5N1 avian (bird) flu virus is very virulent and a danger not only to birds but also to humans. One of its attributes is its high mutation rate.
Retrieve a sequence from e.g. Genbank for any one of the components of H5N1 virus, and use Blast to search for similar sequences. What are the dates and locations of the determination these similar sequences?

Here are three versions a sequence of unknown origin and unknown function:

mRNA

     acauuugcuu cugacacaac uguguucacu agcaaccuca aacagacacc auggugcacc
     ugacuccuga ggagaagucu gcgguuacug cccugugggg caaggugaac guggaugaag
     uuggugguga ggcccugggc aggcugcugg uggucuaccc uuggacccag agguucuuug
     aguccuuugg ggaucugucc acuccugaug caguuauggg caacccuaag gugaaggcuc
     auggcaagaa agugcucggu gccuuuagug auggccuggc ucaccuggac aaccucaagg
     gcaccuuugc cacacugagu gagcugcacu gugacaagcu gcacguggau ccugagaacu
     ucaggcuccu gggcaacgug cuggucugug ugcuggccca ucacuuuggc aaagaauuca
     ccccaccagu gcaggcugcc uaucagaaag ugguggcugg uguggcuaau gcccuggccc
     acaaguauca cuaagcucgc uuucuugcug uccaauuucu auuaaagguu ccuuuguucc
     cuaaguccaa cuacuaaacu gggggauauu augaagggcc uugagcaucu ggauucugcc
     uaauaaaaaa cauuuauuuu cauugc

cDNA

     acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc
     tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag
     ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg
     agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc
     atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg
     gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact
     tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca
     ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc
     acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc
     ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc
     taataaaaaa catttatttt cattgc

Amino-acid translation

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFES
FGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENF
RLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Use each one to perform a blast search. Have a look at the hits. Make an intelligent guess about the meaning of the Score, E-value and Indentities (For this last item, you will need to scroll down in your results to the detailed lines beginning with ">" symbols).

What organism is this sequence from?
What is the function of the protein that this gene encodes.
What other sequences are `quite' similar to this query sequence? What organisms are they from, and what do they do?
Do all the searches produce the `same' results.
Carefully remove the first 21 nucleotides from the nucleotide sequence and the corresponding amino-acids from the amino-acid sequence. Does this make a difference to the results?
Carefully remove some nuleotides / amino-acids from the end of the probe sequences and try to run blast again.
- Do you get the same hits? What happens to the scores and identities?
- How many nuleotides / amino-acids can you remove before you no longer hit the same sequences as before.
- What happens if you chop out nuleotides / amino-acids from the middle of the sequences?
- What would happen to the translation from nucleotides to amino-acids if you had removed
  - 1
  - 2
  - 3
  nucleotides from the middle of the nucleotide sequence?

Bioinformatics - Lab 2 Basic sequence analysis David Gilbert

Note

Bioinformatics - Lab 2
Basic sequence analysis
David Gilbert