THIS COURSEWORK COUNTS FOR 30% OF THE TOTAL MARKS FOR THIS MODULE.
Given
this
data file in the form of
'>' unique_6_character_identfier SPACE Group_id SPACE string
1. Give a dynamic programming algorithm to compute the comparison score and alignments for two strings over an alphabet of {e,E,h,H} .
2. Implement a program based on your algorithm in any programming language of your choice and give runs for some pairs of strings drawn from the data file.
Note that
(i)
uppercase and lowercase letters are distinct.
(ii) these strings can be entirely 'inverted':
For example hHeHhE matches HhEhHe
You can use the following
parameters:
Gap score (penalty) = -1, match score = 1, mismatch penalty = -1
3. Design and implement an algorithm to compute the multiple alignment(s) for a set of n strings based on your pairwise algorithm in (1) above. Give runs of your program with input strings from some of the groups in the data file. The output should be one (or more) patterns and a rating of each pattern in terms of compression. You are permitted to discover just one pattern for a set of examples but bonus points will be given if your program can discover more than one best pattern if these exist.
Evaluate the pattern(s) that your program produces on at least one group from the data file. You may chose to evaluate your patterns using one or more of: compression, sensitivity, specificity, positive predictive value, correlation coefficient or the F-measure, which are defined for example here.
4. M-level students and optionally H level students:
Optimise your parameters for your pairwise alignment program in (2) above.
You can use as a target the Group_id in the data file; strings from the
same group should be more similar to each other than strings in other
groups.
Hints:
a. You may use pairwise alignments to generate scores and then sort hits with
the intention that members of the correct group have the highest scores, or multiple alignments
to generate motifs which should be discriminatory over groups. Note that you
can decrease specificity by considering only a prefix of the Group_id.
b. It is thought that E and e are more similar to each other than to H or h,
and that H and h are much more similar to each other than to E or e. Can you
take this into account in the scoring table to improve the discriminatory
power of your algorithms?
Notes:
Your submission should comprise
(1) a hard-copy submission indicating your level (H or M) and
including
H-level students
M-level students