Download this notebook and write your solutions inside it.
Add the notebook with your solutions to your allocated GitHub repo. (Remember to commit and push.)
To solve the tasks, finish the code skeletons provided. The code must be runnable to get full points. If you're having trouble, try to explain in your own words what you were trying to do. Proper understanding is half of the points.
Each task is independent but subtasks are linked.
To pass the homework, you need 3 / 6 points.
Remember to run the tests for each function you'll be completing. And feel free to add some of your own. By testing code in small modular parts (e.g. functions), you can trust these parts in "downstream" (more complex) applications.
The terms "cost", "distance", and "penalty" are used interchangeably since in this context they represent equivalent ways of looking at the same concepts. It also helps illustrate that these algorithms are quite general and can be applied to many different problems.
Always restart the notebook (Kernel > Restart Kernel...) and rerun the notebook from the top to double-check (i.e. reproduce) your solutions.
Remember that python starts indexing vectors at position 0
. Other languages in which these algorithms are implemented such as MatLab and R start at 1
.
Complete the Needleman–Wunsch implementation.
Fill in the "blanks" in the code below (replace the # WRITE CONDITION HERE) with the appropriate logical checks such that this distance function gives the correct distance between two letters.
Remember to run the code chunk so the notebook starts using your function.
# imports
import numpy as np
from itertools import permutations
def dist_global(char1, char2):
#############################
# Function that computes the distance between two characters in a sequence,
# used for global alignment.
#
# Input arguments:
# char1, char2 = Values of 2 characters, either nucleobases or gaps,
# as 1-character strings.
#
# E.g. char1 = 'A', char2 = 'G'
# char1 = 'A', char2 = '-'
#
# Returns:
# The distance (or cost of change) between char1 and char2,
# using the DIST_... values
#############################
DIST_MATCH = 0
DIST_GAP = 8
DIST_TRANSITION = 2
DIST_TRANSVERSION = 4
if (char1 == char2):
return DIST_MATCH
if ():# WRITE CONDITION HERE
return DIST_GAP
if ():# WRITE CONDITION HERE
return DIST_TRANSITION
if ():# WRITE CONDITION HERE
return DIST_TRANSITION
else:
return DIST_TRANSVERSION
Check your code with the test below, assuming you use the same values as in the lecture. If you want to experiment with different distance values, then naturally the test will have to be adapted.
print(dist_global('A', 'G') == 2)
print(dist_global('A', '-') == 8)
Here you are provided with skeleton code to compute the global alignment between two sequences using the Needleman–Wunsch algorithm algorithm. The same notation is used as in the course.
def global_alignment_matrix(seq1, seq2):
##############################################################
# Computes the global alignment matrix D between
# sequences seq1 and seq2
#
# Input:
# seq1, seq2 = nucleobase sequences as strings
#
# E.g. seq1 = 'TATACCTGAAGGGCCT', seq2 = 'TATACGAGACCGTTT'
#
# Returns:
# D = global alignment matrix
###############################################################
# Convert to character vectors to easily work letter by letter
# E.g. seq1 = "ABC" becomes ["A", "B", "C"] and seq1[1] == "B"
seq1 = list(seq1)
seq2 = list(seq2)
D = np.zeros([len(seq1)+1, len(seq2)+1])
for i in range(1, len(seq1) + 1):
D[i][0] = ### WRITE CODE HERE ###
for j in range(1, len(seq2) + 1):
D[0][j] = ### WRITE CODE HERE ###
for i in range(1, len(seq1) + 1):
for j in range(1, len(seq2) + 1):
dist_from_upper_left = dist_global(seq1[i - 1], seq2[j - 1])
dist_from_above = dist_global(seq1[i - 1], "-")
dist_from_left = dist_global("-", seq2[j - 1])
D[i][j] = min(
D[i - 1][j - 1] + ### WRITE CODE HERE ###,
D[i - 1][j] + ### WRITE CODE HERE ###,
D[i][j - 1] + ### WRITE CODE HERE ###
)
return D
# This one's complete. Just a convenience function to fetch the overall value.
def global_alignment_value(global_alignment_matrix):
# Input:
# global_alignment_matrix = a computed global alignment matrix
#
# Returns:
# The lower-right corner of the matrix, since we know that holds
# the global alignment value
return(global_alignment_matrix[-1,-1])
Check your code with these. The last 2 lines compare expected and actual values.
seq1 = "TACGTCAGC"
seq2 = "TATGTCATGC"
expected_distance = 10
expected_dist_matrix = np.array([
[ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80],
[ 8, 0, 8, 16, 24, 32, 40, 48, 56, 64, 72],
[16, 8, 0, 8, 16, 24, 32, 40, 48, 56, 64],
[24, 16, 8, 2, 10, 18, 24, 32, 40, 48, 56],
[32, 24, 16, 10, 2, 10, 18, 26, 34, 40, 48],
[40, 32, 24, 16, 10, 2, 10, 18, 26, 34, 42],
[48, 40, 32, 24, 18, 10, 2, 10, 18, 26, 34],
[56, 48, 40, 32, 26, 18, 10, 2, 10, 18, 26],
[64, 56, 48, 40, 32, 26, 18, 10, 6, 10, 18],
[72, 64, 56, 48, 40, 34, 26, 18, 12, 10, 10]
])
D = global_alignment_matrix(seq1, seq2)
print(np.all(np.equal(D, expected_dist_matrix)))
print(global_alignment_value(D) == expected_distance)
Complete the Smith-Waterman implementation
Finish the provided code and check yourself with the tests provided.
def dist_local(char1, char2):
##########################################
# Function that computes the distance between two characters in a sequence,
# used for local alignment.
#
# Input arguments:
# char1, char2 = Values of 2 characters, either nucleobases or gaps,
# as 1-character strings.
#
# E.g. char1 = 'A', char2 = 'G'
# char1 = 'A', char2 = '-'
#
# Returns:
# The distance (or cost of change) between char1 and char2,
# using the DIST_... values
##########################################
DIST_MATCH = 2
DIST_GAP = -6
DIST_MISMATCH = -4
if # WRITE CONDITION HERE
return DIST_MATCH
if # WRITE CONDITION HERE
return DIST_GAP
else
return DIST_MISMATCH
print(dist_local('A', '-') + dist_local('A', 'T') == -10)
print(dist_local('A', 'C') + dist_local('G', 'T') == -8)
Finish the provided code.
Since it is a variation of the global alignment algorithm, the Smith-Waterman algorithm has a similar structure.
A big difference is that one takes the maximum of different possibilities, since once keeps track of an optimum score.
def local_alignment_matrix(seq1, seq2):
#######################################################
# Computes the local alignment matrix D between
# sequences seq1 and seq2
#
# Input:
# seq1, seq2 = nucleobase sequences as strings
#
# E.g. seq1 = 'TATACCTGA', seq2 = 'TATACGAGACCGTTT'
#
# Returns:
# D = local alignment matrix
########################################################
# Convert to character vectors to easily work letter by letter
# E.g. seq1 = "ABC" becomes c("A", "B", "C") and seq1[2] == "B"
seq1 = list(seq1)
seq2 = list(seq2)
#D = [[0] * (len(seq2) + 1) for _ in range(len(seq1) + 1)]
D = np.zeros([(len(seq1) + 1), (len(seq2) + 1)])
for i in range(1, len(seq1) + 1):
for j in range(1, len(seq2) + 1):
dist_from_upper_left = ### WRITE CODE HERE ###
dist_from_above = ### WRITE CODE HERE ###
dist_from_left = ### WRITE CODE HERE ###
D[i, j] = max(
[ ### WRITE CODE HERE ###,
### WRITE CODE HERE ###,
### WRITE CODE HERE ###,
0] # make sure to have a non-negative value since this is a distance
)
return D
def local_alignment_value(alignment_matrix):
return max(map(max, alignment_matrix))
Check your code with these. The last 2 lines compare expected and actual values.
seq1 = "GGTATGCTGGCGCTA"
seq2 = "TATATGCGGCGTTT"
expected_distance = 12
expected_dist_matrix = np.array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 0, 2, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 0, 2, 4, 0, 2, 0, 0, 0],
[ 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 4, 2, 2],
[ 0, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 2, 0, 6, 0, 6, 0, 0, 0, 0, 0, 0, 2, 2, 2],
[ 0, 0, 0, 0, 2, 0, 8, 2, 2, 2, 0, 2, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 10, 4, 0, 4, 0, 0, 0, 0],
[ 0, 2, 0, 2, 0, 2, 0, 4, 6, 0, 0, 0, 2, 2, 2],
[ 0, 0, 0, 0, 0, 0, 4, 0, 6, 8, 2, 2, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 0, 2, 8, 4, 4, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 10, 4, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 0, 6, 2, 4, 12, 6, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 4, 6, 8, 2, 0],
[ 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 8, 10, 4],
[ 0, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 4, 6]])
D = local_alignment_matrix(seq1, seq2)
print(np.all(np.equal(D, expected_dist_matrix)))
print(local_alignment_value(D) == expected_distance)
Finish the code below to compute the shortest common superstring (SCS) using the greedy algorithm.
To assemble reads, we need a function to compute the length of the overlap of two given sequences
For example:
seq1: abc123
seq2: 123xyz
len_overlap('abc123', '123xyz') == 3
The skeleton for this function is given below. Finish the code and use the tests below to check yourself.
Note that this time we're not converting the strings seq1
and seq2
to character vectors, because R has some functions which we can use, but these work on strings (e.g. we use "hello"
instead of ["h", "e", "l", "l", "o"]
as before).
Read about these functions: startswith()
, endswith()
and think about how to use them to solve this tasks. Also, recall len()
counts the number of characters in a string or list. len("hello") == 5
and nchar(["hello"]) == 1
.
You are of course free to expand the function if you feel constrained by the given "blanks". You can even rewrite it entirely as you wish, just please explain what you're doing.
! Careful !
The python while(CONDITION)
loop command simply repeats until CONDITION is no longer true. If there is a mistake in the code and the logical value of CONDITION never changes, the code will run forever. If you see your code running for very long (like when running the tests), press the stop button at the top of the notebook session.
def len_overlap(seq1, seq2):
##############################################
# Find the longest overlap between a suffix of seq1 and a prefix of seq2
# and return its length.
# E.g. len_overlap('abc123', '123xyz') == 3
##############################################
# We take shorter and shorter suffixes of seq1
# and test whether they overlap with a prefix of seq2.
# We start at index 1 in seq1 (i.e. thus including seq1 entirely)
overlap_start_index = 0
while overlap_start_index < ## WRITE CODE HERE ##:
current_prefix = ## WRITE CODE HERE ##
if ## WRITE CODE HERE ##:
return len(seq1) - overlap_start_index
else:
overlap_start_index += 1
return 0 # no overlap
You can add your own of course. Think of edge cases and negative examples.
print(len_overlap('abc123', '123xyz') == 3)
print(len_overlap('abc1234', '1234xyz') == 4)
print(len_overlap('abc124', '123xyz') == 0)
## WRITE TEST HERE ##
## WRITE TEST HERE ##
## WRITE TEST HERE ##
## WRITE TEST HERE ##
This approach requires generating all permutations of a vector of strings. To make our lives easier, we're going to use the permutations()
function from the itertools
package. So first install this package:
To see how it works, run the chunk below:
items = ["I", "prefer", "Python"]
test_permutations = list(permutations(items))
print(test_permutations)
Remember there are n!
(n factorial) permutations of n
elements, so we should refrain from long vectors (e.g. factorial(10)
is already 3628800
).
def greedy_shortest_common_superstring(strings):
#############################################################
# Takes a list of strings and computes
# an approximation of the shortest common supperstring (SCS),
# using a greedy algorithm.
#
# Input:
# strings = list of strings e.g. c('ABCD', 'CDBC', 'BCDA')
#
# Returns:
# shortest_superstring = the approximation of the SCS
# of the input list of strings
# E.g. 'ABCDBCDA'
###############################################################
# Start off with no SCS (empty string)
shortest_superstring = ""
all_permutations_of_the_strings = permutations(strings)
# Consider all permutations of the input list of stings
for current_permutation in all_permutations_of_the_strings:
# For the current permutation,
# start growing a superstring using the first string in the permutation
superstring_candidate = current_permutation[0]
# Go over all strings in this permutation and overlap consecutive strings i and i + 1
# So e.g. i: ABB i+1: BBA -> ABBA
for i in range(len(strings) - 1):
# determine length of the overlap ...
overlap_length = len_overlap(#### WRITE CODE HERE ####)
# ... and use this to get the non-overlapping part of string i + 1
non_overlapping_portion = #### WRITE CODE HERE ####
# The non-overlapping part of string i + 1 is then pasted
# to the end of the superstring we are growing for this permutation
superstring_candidate += non_overlapping_portion
# If this is the first permutation we're considering,
# of it turns out that the superstring candidate we grew for this permutation of input strings
# is shorter than our current SCS candidate,
# make the current superstring candidate the new SCS
if shortest_superstring == "" or len(superstring_candidate) < #### WRITE CODE HERE #### :
shortest_superstring = superstring_candidate
return shortest_superstring
The second case will be a bit slow, because of the many permutations that are considered. You can add your own tests of course. Think of edge cases.
Note that there may be multiple SCS and the procedure will choose only the first one, and the order is determined by how the permutations are generated. So if you add your own test, check the output of permutations(my_test_vector)
to see the order in which each SCS candidate is constructed.
print(greedy_shortest_common_superstring(c('ABCD', 'CDBC', 'BCDA')) == 'ABCDBCDA')
print(greedy_shortest_common_superstring(c('BAA', 'AAB', 'BBA', 'ABA', 'ABB', 'BBB', 'AAA', 'BAB') == 'BAAABABBBA'))