Bioinformatics - Lab 2
Basic sequence analysis
David Gilbert

On-line resources about bioinformatics This exercise:

Note

You will need to ensure that you type the command at the unix prompt when starting each session:-
source /usr/local/lab/lab.env
in order to use all of the software in the practicals.

Alternatively, have the line
source /usr/local/lab/lab.env
to your ~/.cshrc


EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology user community. The software automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web. EMBOSS integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages. [More at emboss.sourceforge.net/what].

  1. Find the E. Coli isocitrate dehydrogenase sequence file J20799 which you retrieved last week. If you do not have the sequence, then retreive it again using the instructions in Lab 1, or you can find it here.

  2. Identify the nucleotide and amino-acid sequences within the file. Show that the nucleotide (DNA) sequence does indeed code for the amino-acid sequence by identifying the exact nucleotides in the sequence which match to each base in the amino-acid sequence. Take care with this! You should make no presumptions regarding where the reading frame starts... I.e. the codons may not start from the first base in the nucleotide sequence, but from the second or the third!
    Hints:

  3. Another problem with the translation is that the nucleotide region of interest may actually be on the other strand to that given in the database. How can we obtain the sequence for the complementary DNA strand? If you look at the slides from last week (e.g. slide 4) you will see that one strand is the reverse and complement of the other.:-
    5'   C-A-T-G-T-C-C-A   3'
    3'   G-T-A-C-A-G-G-T   5'
    
    Make sure that you understand what complement means in the context of molecular genetics -- it is effectively given by the base-pairing rules for DNA.
    Use the revseq EMBOSS tool in order to the J20799 nucleotide sequence.

  4. How many possible translations are possible for one sequence? Hint: 3 possible starts for the first codon in the forward direction, and...
    We will not try to do all the possible translations ourselves!
    Use the transeq EMBOSS tool in order to translate the J20799 nucleic acid sequence to the possible corresponding amino-acid peptide sequence.
    Note that it can translate in any of the 3 forward or three reverse sense frames, or in all three forward or reverse frames, or in all six frames.

    Hint: use the -frame flag in transeq.

    Reading frame: One of the three possible ways to read a nucleotide sequence in DNA or RNA as a series of nonoverlapping triplets, depending upon whether reading starts with the first, second or third base in the sequence. For example: with the nucleotide sequence TGCTGCTGC, the three possible reading frames are: TGC TGC TGC; and GCT GCT; and CTG CTG. (from http://www.medterms.com).

  5. The problem highlighted by the above transeq exercise is that of deciding which reading-frame to use for the translation. Fortunately in the case of J02799 there is only one of the 6 that produces a sensible ORF (Open Reading Frame).
    Use the plotorf EMBOSS tool in order to produce an output that will enable you to identify the longest ORF and the associated best reading frame to use. Use this to identify the numbers in the the DNA sequence of the first and last nucleotides used to generate the amino-acid sequence.
    Does this result concur with the the result that you generated in the first exercise in this lab?

    What (approximate) rules could you come up with to decide which frame gives the best translation of an ORF?


  6. We now will have a look at retrieving sequences which are similar to a `probe' sequence of interest, in this case J02799. Why are we interested in this? - because of

    We will use the Blast sequence comparison program in order to retrieve sequences which are similar to that of a given `probe' sequence. In this case

    blast search, using: http://www.ebi.ac.uk/blastall/ (or http://www.ncbi.nlm.nih.gov/BLAST/ if this doen't work, but it seems faster...).
    !Store your results (as html pages) -you will need them for a later lab!.
    Select the appropriate program and database. Use blastn on the embl database for nucleotides and blastp on swissprot for amino-acids.

    In this case try to find out what sequences are similar to that of J20799

  7. The H5N1 avian (bird) flu virus is very virulent and a danger not only to birds but also to humans. One of its attributes is its high mutation rate.

    Retrieve a sequence from e.g. Genbank for any one of the components of H5N1 virus, and use Blast to search for similar sequences. What are the dates and locations of the determination these similar sequences?

  8. Here are three versions a sequence of unknown origin and unknown function:

    Use each one to perform a blast search. Have a look at the hits. Make an intelligent guess about the meaning of the Score, E-value and Indentities (For this last item, you will need to scroll down in your results to the detailed lines beginning with ">" symbols).