Bioinformatics, Coursework
D. Gilbert

Issue date: 31/01/2008
Due date: 06/03/2008 -- hard copy to be deposited in the locked box in the Department by 5pm.

THIS COURSEWORK COUNTS FOR 30% OF THE TOTAL MARKS FOR THIS MODULE.

Given this data file in the form of
'>' unique_6_character_identfier SPACE Group_id SPACE string

1. Give a dynamic programming algorithm to compute the comparison score and alignments for two strings over an alphabet of {e,E,h,H} .

2. Implement a program based on your algorithm in any programming language of your choice and give runs for some pairs of strings drawn from the data file.

Note that
(i) uppercase and lowercase letters are distinct.
(ii) these strings can be entirely 'inverted': For example hHeHhE matches HhEhHe
You can use the following parameters: Gap score (penalty) = -1, match score = 1, mismatch penalty = -1

3. Design and implement an algorithm to compute the multiple alignment(s) for a set of n strings based on your pairwise algorithm in (1) above. Give runs of your program with input strings from some of the groups in the data file. The output should be one (or more) patterns and a rating of each pattern in terms of compression. You are permitted to discover just one pattern for a set of examples but bonus points will be given if your program can discover more than one best pattern if these exist.

Evaluate the pattern(s) that your program produces on at least one group from the data file. You may chose to evaluate your patterns using one or more of: compression, sensitivity, specificity, positive predictive value, correlation coefficient or the F-measure, which are defined for example here.

4. M-level students and optionally H level students: Optimise your parameters for your pairwise alignment program in (2) above. You can use as a target the Group_id in the data file; strings from the same group should be more similar to each other than strings in other groups.
Hints:
a. You may use pairwise alignments to generate scores and then sort hits with the intention that members of the correct group have the highest scores, or multiple alignments to generate motifs which should be discriminatory over groups. Note that you can decrease specificity by considering only a prefix of the Group_id.
b. It is thought that E and e are more similar to each other than to H or h, and that H and h are much more similar to each other than to E or e. Can you take this into account in the scoring table to improve the discriminatory power of your algorithms?


Notes:

Your submission should comprise
(1) a hard-copy submission indicating your level (H or M) and including

H-level students

M-level students

(2) soft-copy, submitted by email to drg@dcs.gla.ac.uk [2 mark: code correctly executes] PENALTY FOR NO ELECTRONIC SUBMISSION = -6 marks