Bioinformatics - Lab 1
Databases, genetic information and molecules
David Gilbert

Notes

You will need to ensure that you have the line
source /usr/local/lab/lab.env
to your ~/.cshrc in order to use all of the software in the practicals.
Alternatively, you can type the command at the unix prompt when starting each session:-
source /usr/local/lab/lab.env

You may wish to have a look at some on-line resources about bioinformatics

Aim

The aim of this introdutory tutorial is to introduce you to the two most important bioinformatics databases available on the Web -- GenBank database of gene and genome sequences, and the Protein Data Bank (PDB), a database of biological macromolecular structures.

During this exercise you will learn how to

search for a genomic sequence, given the biochemical name of a particular protein,
retrieve scientific papers about that protein,
retrieve a file describing the structure of the protein, and
explore the structure in a 3-D viewer.

This lab exercise was originally designed by David Leader.

The Entrez Facility at NCBI

The US National Centre for Biotechnological Information (NCBI) hosts the GenBank database of gene and genome sequences. Its web site links these to other databases and resources, including the National Library of Medicine, and provides a suite of powerful programs for searching GenBank (Blast). The site is professionally designed and maintained, access is free, and the staff even respond to technical queries. It is the best place to start looking at bioinformatics data.

⇒ Open your web browser and connect to the url: http://www.ncbi.nlm.nih.gov
Although we are primarily using only one facility on the site (and could have accessed it directly) it is worthwhile taking a moment to look at the home page to get an idea of the scope of the site.

⇒ Click on Entrez
You will be taken to a recently redesigned page for Entrez (French for 'enter!', pronounced 'awntray') from which you can search across several databases at the same time.

For this exercise we are interested in an enyzme (a protein catalyst) on which a colleague in IBLS used to work. Its name is isocitrate dehydrogenase (so called because it catalyses the oxidation - dehydrogenation - of a small molecule, isocitrate, related to the acid found in citrus fruits) and it has a role in generating chemical energy from foodstuffs. We are going to search for the gene for the enzyme from the gut bacterium, Escherichia coli. We shall compare the bacterial isocitrate dehydrogenase with the corresponding mammalian enzyme in a later session.

⇒ Type isocitrate dehydrogenase into the search field and press Go. This should return hits recorded for several of the databases. We are interested in the Nucleotide (DNA and RNA), Genome (whole chromosomes of an organism) and Structure (three-dimensional structures of proteins) databases, so the hits suggest that we are in business.

Getting the Nucleotide Sequence

⇒ Click on the Nucleotide hits: You will be taken to the first of over a hundred pages of search results. Clearly we need to trim this down to manageable proportions. The first thing to do is:

⇒ Click on the word Limits. You will be taken to a page where you can select from a range of options to limit your search.

⇒ Chose Title from the pull-down menu and press Go. You will be returned to the to a results page to find that the total number of hits has been reduced (somewhat). However we wish to confine our search to sequences from Escherichia coli (abbreviated E. coli). Therefore:

⇒ Add "and E coli" by typing this in the search field after "isocitrate dehydrogenase"

⇒ Click Go
You will be returned to a results page with only two hits, the second of which is the one we want.

⇒ Click on the accession number, J02799.
You will be transferred to a web page in standard GenBank format, with documentation (including links on the web page ) followed by the DNA sequence. We shall return to the links presently, but to ensure that the file is useable in programs when we have downloaded it we must be careful that it contains only relevant text. The recommended way of doing this is as follows.

⇒ Select the Text option in the pull-down menu which is headed by 'Send To'. This will generate a page in plain text format, without links.

⇒ Select Save from the browser File menu and save under a name such as J02799.gbk
One point to note is that the, although the file contains only ASCII text characters, it is in Unix format. This is because the ASCII character used to indicate the end of a line differs on the Unix, PC and Mac platforms. As we shall use the file on Unix it is safer not to open it in Windows and risk introducing PC line endings. Instead view it on the web page and note the following.

⇒ The first line indicates that the length of the sequence is 1568 bp (base pairs) and the sequence is of DNA. The sequence itself is displayed after the line starting with the word ORIGIN, a feature of all GenBank files which you can rely on for parsing them.

⇒ If you look in the indented headings following FEATURES you will see CDS. This stands for coding sequence, and indicates the part of the gene that is decoded into protein-bases 291 to 1541. The fact that not all the genetic information specifies the amino acids that constitute proteins will be discussed in the lectures. It has practical implications in analysing DNA sequences with bioinformatics tools, as we shall do in the next session.

Getting the Scientific Paper

Experimental scientists will often be interested in the published work related to sequences that they download from databases. The NCBI web site provides links to abstracts of papers in the National Library of Medicine (NLM), and often to pdf versions of the originals. The NLM, of course, also contains abstracts of papers not describing nucleic acid and protein sequences, and these would be accessed through the NCBI PubMed facility.

Use your browser back button to return to the version of the sequence with html links.

⇒ Click on the PUBMED reference number (3112144). This will bring you to a page with a link to the original paper, which you can follow to the journal's web site, where the paper can be downloaded as a pdf file.

Getting the Structural Data for the Protein

The GenBank file, J02799, represents the isocitrate dehydrogenase (ICDH) protein as a string of characters. However proteins are biological molecules, and it will be instructive to explore this aspect briefly in this exercise. The three-dimensional structures of many proteins have been determined experimentally, and the data deposited in flat text files specifying their x, y and z co-ordinates in space. The database holding these files is the Protein Data Bank (formerly at Brookhaven), and the file format (which can be read by most molecular graphics programs) is called the PDB format. GenBank now holds a subset of the PDB in its MMDB (Molecular Modelling Data Base) but also has links to the Protein Data Bank.

Return to the main Entrez page where you may need to repeat the cross-database search for isocitrate dehydrogenase as the page is generated on the fly and is only transitory.

⇒ Click on the Structure icon. This will bring you to two pages of ICDH structures. We shall chose one in which the protein contains bound isocitrate - the re acting molecule (substrate) for the catalysis.

⇒ On one of the pages (probably the first) click on 5ICD. This will bring you to the entry for this structure on the MMDB page. Using the RasMol Chime plug-in you could view the protein here. However, we shall download the file.

⇒ Click on PDB: 5ICD (above). This will bring us to a structure summary page; click on 'PDB' to go to the page for this structure on the PDB site at http://www.rcsb.org/pdb/

⇒ Click on the Download File icon next to the '5icd' name.
(Alternatively select the 'Structure' tab and then from the download options select 'PDB format' and 'no compression' (i.e. not '.gz'). Save the file to disc (use the right mouse button - clicking will display the file using the RasMol plug-in) as 5ICD.pdb). Keep it somewhere in your filespace for subsequent examination in RasMol.

Using RasMol to view the protein structure RasMol

As, already remarked, GenBank files represent proteins as linear strings from a 20-character alphabet. Although these strings embody information, like DNA, this information is expressed through the three-dimensional structure of the protein molecule and the properties this possesses. We shall use the free cross-platform application program, RasMol, to illustrate this.

⇒ Start RasMol by typing rasmol at the unix prompt.
⇒
You will be presented with a black graphic screen. Although RasMol is a GUI application, much of its power is through commands typed into a text widow, which is initially minimized.

Have a look at the RasMol Reference Card. (If you are really interested, there is an on-line manual but you are unlikely to need this...). If you like RasMol and want to use it on your own machine, you can obtain it for Linux and Windows from e.g. http://www.bernstein-plus-sons.com/software/rasmol/
Alternatively, you can use the 'fink' program:-
fink install rasmol

⇒ Maximize the command-line window.

⇒ Optionally type "set background white" in the command window.

⇒ Load the file, 5ICD.pdb, after selecting Open in the File menu. The protein will appear as the wire-frame representation. This is useful for a biologist to study individual amino acids, but we cannot relate it easily to our linear string.

⇒ Type "restrict protein"

⇒ Select Backbone from the Display menu. The protein appears with the start visible at the bottom. However the end is difficult to locate.

⇒ Select Group from the Colours menu. You can now trace the chain along the colour gradient from blue to red, rotating the molecule as necessary.

⇒ Select Spacefill from the Display menu. You can now see how the protein is a solid object, rather than full of space. (This view makes it difficult to see inside the protein, though.)

⇒ You can
rotate the view of the protein by using the left mouse button or the right mouse mutton + shift key.
move the view using the right mouse button
zoom using the shift key + left mouse button

Now let us look at isocitrate, the reacting molecule (substrate) bound to the protein.

⇒ Select Wireframe from the Display menu.

⇒ Select CPK from the Colours menu.

⇒ Type select ligand.

⇒ Select Spacefill from the Display menu.

You can now see the isocitrate, apparently inside the protein.

⇒ Type colour blue

⇒ Type select protein

⇒ Select Spacefill from the Display menu. By rotating the molecule you should be able to see that the (blue) isocitrate, is sitting in a pocket in the protein. This is where the catalysis occurs.

There is one final point to make, of relevance to the sequence comparisons we shall be performing in the following weeks. Changes to the protein in the region in which the isocitrate binds (resulting from mutations in the DNA) are likely to alter the way the molecule binds, perhaps preventing it from doing so. However changes to the protein in other parts of the protein may have little or no effect. We shall see this reflected in conserved and non-conserved regions found when comparing proteins with similar functions in different organisms.

Bioinformatics - Lab 1 Databases, genetic information and molecules David Gilbert