-
Find the E. Coli isocitrate
dehydrogenase sequence file J20799 which you retrieved last week. If you do not have
the sequence, then retreive it again using the instructions in
Lab 1, or you can find it
here.
-
Identify the nucleotide and amino-acid sequences within the file. Show that
the nucleotide (DNA) sequence does indeed code for the amino-acid sequence by
identifying the exact nucleotides in the sequence which match to each
base in the
amino-acid sequence. Take care with this! You should make no presumptions
regarding where the reading frame starts... I.e. the codons may not
start from the first base in the nucleotide sequence, but from the second or
the third!
Hints:
-
The genetic code can be obtained from
here or from
molbio.info.nih.gov/molbio/gcode.html.
Amino-acid codes are
here or at e.g.
www.hgu.mrc.ac.uk/Softdata/Misc/aacode.htm.
-
You may need to remove the comment line, integers, white space and even line
feeds from the nucleotide sequence.
-
A quick and dirty way to find the start of the open reading frame [i.e. the
place in the nucleotide sequence corresponding to the first amino-acid] could
be to use some character search facilities within unix (e.g. `grep') or an
editor (emacs, vi). We know from the genetic code that ATG=M (methionine) can act
as a start codon; you may however need to search for a more specific (i.e.
longer) sequence than just 'M' -- try at least the first 6 amino-acids of the
sequence, if not the whole sequence. Similarly you will need to find out
where the end of the sequence matches by using several amino-acids from the
end of the sequence. Remember that there are 3 codons that code for `stop',
and that there is no translation of this codon into an amino-acid. You might consider using regular expression matching.
E.g. Lysine is represented by AA[A,G] as a unix regex. How will you represent
Leucine?
-
Another problem with the translation is that the nucleotide region of interest
may actually be on the other strand to that given in the database. How can we
obtain the sequence for the complementary DNA strand? If you look at the
slides from last week (e.g. slide 4) you will see that one strand is the
reverse and complement of the other.:-
5' C-A-T-G-T-C-C-A 3'
3' G-T-A-C-A-G-G-T 5'
Make sure that you understand what complement means in the context
of molecular genetics
-- it is effectively given by the base-pairing rules for DNA.
Use the revseq
EMBOSS tool in order to
- reverse
- complement
- reverse and complement
the J20799 nucleotide sequence.
-
How many possible translations are possible for one sequence? Hint: 3
possible starts for the first codon in the forward direction, and...
We will not try to do all the possible translations ourselves!
Use the transeq EMBOSS tool in order to translate the J20799
nucleic acid sequence to the possible corresponding amino-acid peptide sequence.
Note that
it can translate in any of the 3 forward or three reverse sense frames, or in
all three forward or reverse frames, or in all six frames.
Hint: use the -frame flag in transeq.
Reading frame: One of the three possible ways to read a nucleotide sequence in
DNA or RNA as a series of nonoverlapping triplets, depending upon whether
reading starts with the first, second or third base in the sequence. For
example: with the nucleotide sequence TGCTGCTGC, the three possible reading
frames are: TGC TGC TGC; and GCT GCT; and CTG CTG.
(from http://www.medterms.com).
-
The problem highlighted by the above transeq exercise is that of deciding
which reading-frame to use for the translation. Fortunately in the case of
J02799 there is only
one of the 6 that produces a sensible ORF (Open Reading Frame).
Use the plotorf
EMBOSS tool in order to produce an output that will enable you to identify the
longest ORF and the associated best reading frame to use. Use this to
identify the numbers in the the DNA sequence of the first and last
nucleotides used to generate the amino-acid sequence.
Does this result concur with the the result that you generated in the first
exercise in this lab?
What (approximate) rules could you come up with to decide which frame gives
the best translation of an ORF?
-
We now will have a look at retrieving sequences which are similar to a `probe'
sequence of interest, in this case J02799.
Why are we interested in this? - because of
- evolution -- due to mutations, similar sequences may occur in different organisms
- human experimentation - workers may have mutated a sequence and want to
record this
We will use the
Blast sequence comparison
program in order to retrieve sequences which are similar to that of
a given `probe' sequence. In this case
blast search, using:
http://www.ebi.ac.uk/blastall/
(or
http://www.ncbi.nlm.nih.gov/BLAST/
if this doen't work, but it seems faster...).
!Store your results (as html pages) -you will need them for a later
lab!.
Select the appropriate
program and database. Use blastn on the embl
database for nucleotides
and blastp on swissprot for amino-acids.
In this case try to find out what sequences are similar to that of
J20799
-
The H5N1 avian (bird) flu virus is very virulent and a danger not only to birds but
also to humans. One of its attributes is its high mutation rate.
Retrieve a sequence from e.g. Genbank for any one of the components of H5N1
virus, and use Blast to search for similar sequences. What are the dates and
locations of the determination these similar sequences?
-
Here are three versions a sequence of unknown origin and unknown function:
- mRNA
acauuugcuu cugacacaac uguguucacu agcaaccuca aacagacacc auggugcacc
ugacuccuga ggagaagucu gcgguuacug cccugugggg caaggugaac guggaugaag
uuggugguga ggcccugggc aggcugcugg uggucuaccc uuggacccag agguucuuug
aguccuuugg ggaucugucc acuccugaug caguuauggg caacccuaag gugaaggcuc
auggcaagaa agugcucggu gccuuuagug auggccuggc ucaccuggac aaccucaagg
gcaccuuugc cacacugagu gagcugcacu gugacaagcu gcacguggau ccugagaacu
ucaggcuccu gggcaacgug cuggucugug ugcuggccca ucacuuuggc aaagaauuca
ccccaccagu gcaggcugcc uaucagaaag ugguggcugg uguggcuaau gcccuggccc
acaaguauca cuaagcucgc uuucuugcug uccaauuucu auuaaagguu ccuuuguucc
cuaaguccaa cuacuaaacu gggggauauu augaagggcc uugagcaucu ggauucugcc
uaauaaaaaa cauuuauuuu cauugc
- cDNA
acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc
tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag
ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg
agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc
atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg
gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact
tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca
ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc
acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc
ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc
taataaaaaa catttatttt cattgc
- Amino-acid translation
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFES
FGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENF
RLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Use each one to perform a blast search.
Have a look at the hits. Make an intelligent guess about the meaning of
the Score, E-value and Indentities (For this last
item, you will need
to scroll down in your results to the detailed lines beginning with
">" symbols).
- What organism is this sequence from?
- What is the function of the protein that this gene encodes.
- What other sequences are `quite' similar to this query sequence? What
organisms are they from, and what do they do?
- Do all the searches produce the `same' results.
- Carefully remove the first 21 nucleotides from the nucleotide
sequence and the corresponding amino-acids from the amino-acid sequence. Does this make a difference to the results?
-
Carefully remove some nuleotides / amino-acids from the end of the
probe sequences and try to run blast again.
- Do you get the same hits? What happens to the scores and identities?
- How many nuleotides / amino-acids can you remove before you no longer hit
the same sequences as before.
- What happens if you chop out nuleotides / amino-acids from the
middle of the sequences?
- What would happen to the translation from nucleotides to amino-acids if
you had removed
nucleotides from the middle of the nucleotide sequence?