Instructions

  • Download this notebook and write your solutions inside it.

  • Add the notebook with your solutions to your allocated GitHub repo. (Remember to commit and push.)

  • To solve the tasks, finish the code skeletons provided. The code must be runnable to get full points. If you're having trouble, try to explain in your own words what you were trying to do. Proper understanding is half of the points.

  • Each task is independent but subtasks are linked.

  • To pass the homework, you need 3 / 6 points.

  • Remember to run the tests for each function you'll be completing. And feel free to add some of your own. By testing code in small modular parts (e.g. functions), you can trust these parts in "downstream" (more complex) applications.

The terms "cost", "distance", and "penalty" are used interchangeably since in this context they represent equivalent ways of looking at the same concepts. It also helps illustrate that these algorithms are quite general and can be applied to many different problems.

Advice

  • Always restart the notebook (Kernel > Restart Kernel...) and rerun the notebook from the top to double-check (i.e. reproduce) your solutions.

  • Remember that python starts indexing vectors at position 0. Other languages in which these algorithms are implemented such as MatLab and R start at 1.

Task 1: Global Alignment

Complete the Needleman–Wunsch implementation.

Part a: The distance between two letters [1 pt]

Fill in the "blanks" in the code below (replace the # WRITE CONDITION HERE) with the appropriate logical checks such that this distance function gives the correct distance between two letters.

Remember to run the code chunk so the notebook starts using your function.

# imports
import numpy as np
from itertools import permutations
def dist_global(char1, char2):
    #############################
    # Function that computes the distance between two characters in a sequence,
    # used for global alignment.
    # 
    # Input arguments:
    #    char1, char2 = Values of 2 characters, either nucleobases or gaps, 
    #                   as 1-character strings.
    # 
    #    E.g. char1 = 'A', char2 = 'G'
    #         char1 = 'A', char2 = '-'
    # 
    # Returns:
    #    The distance (or cost of change) between char1 and char2, 
    #    using the DIST_... values
    #############################

    DIST_MATCH  = 0
    DIST_GAP = 8
    DIST_TRANSITION = 2
    DIST_TRANSVERSION = 4

    if (char1 == char2):
        return DIST_MATCH 

    if ():# WRITE CONDITION HERE
        return DIST_GAP 

    if ():# WRITE CONDITION HERE
        return DIST_TRANSITION 

    if  ():# WRITE CONDITION HERE
        return DIST_TRANSITION

    else:
        return DIST_TRANSVERSION

Tests

Check your code with the test below, assuming you use the same values as in the lecture. If you want to experiment with different distance values, then naturally the test will have to be adapted.

print(dist_global('A', 'G') == 2)
print(dist_global('A', '-') == 8)

Part b: Computing the global alignment [1 pt]

Here you are provided with skeleton code to compute the global alignment between two sequences using the Needleman–Wunsch algorithm algorithm. The same notation is used as in the course.

def global_alignment_matrix(seq1, seq2):
    ##############################################################
    # Computes the global alignment matrix D between 
    # sequences seq1 and seq2
    # 
    # Input:
    #     seq1, seq2 = nucleobase sequences as strings
    # 
    #     E.g. seq1 = 'TATACCTGAAGGGCCT', seq2 = 'TATACGAGACCGTTT'
    # 
    # Returns:
    #     D = global alignment matrix
    ###############################################################
    
    # Convert to character vectors to easily work letter by letter
    # E.g. seq1 = "ABC" becomes ["A", "B", "C"] and seq1[1] == "B"
    seq1 = list(seq1)
    seq2 = list(seq2)

    D = np.zeros([len(seq1)+1, len(seq2)+1])

    for i in range(1, len(seq1) + 1):
        D[i][0] = ### WRITE CODE HERE ###

    for j in range(1, len(seq2) + 1):
        D[0][j] = ### WRITE CODE HERE ###

    for i in range(1, len(seq1) + 1):
        for j in range(1, len(seq2) + 1):
            dist_from_upper_left = dist_global(seq1[i - 1], seq2[j - 1])
            dist_from_above = dist_global(seq1[i - 1], "-")
            dist_from_left = dist_global("-", seq2[j - 1])

            D[i][j] = min(
                D[i - 1][j - 1] + ### WRITE CODE HERE ###,
                D[i - 1][j] + ### WRITE CODE HERE ###,
                D[i][j - 1] + ### WRITE CODE HERE ###
            )

    return D

# This one's complete. Just a convenience function to fetch the overall value.
def global_alignment_value(global_alignment_matrix):
    # Input:
    # global_alignment_matrix = a computed global alignment matrix
    # 
    # Returns:
    # The lower-right corner of the matrix, since we know that holds
    # the global alignment value
    return(global_alignment_matrix[-1,-1])

Tests

Check your code with these. The last 2 lines compare expected and actual values.

seq1 = "TACGTCAGC"
seq2 = "TATGTCATGC"

expected_distance = 10

expected_dist_matrix = np.array([
        [ 0,  8, 16, 24, 32, 40, 48, 56, 64, 72, 80],
        [ 8,  0,  8, 16, 24, 32, 40, 48, 56, 64, 72],
        [16,  8,  0,  8, 16, 24, 32, 40, 48, 56, 64],
        [24, 16,  8,  2, 10, 18, 24, 32, 40, 48, 56],
        [32, 24, 16, 10,  2, 10, 18, 26, 34, 40, 48],
        [40, 32, 24, 16, 10,  2, 10, 18, 26, 34, 42],
        [48, 40, 32, 24, 18, 10,  2, 10, 18, 26, 34],
        [56, 48, 40, 32, 26, 18, 10,  2, 10, 18, 26],
        [64, 56, 48, 40, 32, 26, 18, 10,  6, 10, 18],
        [72, 64, 56, 48, 40, 34, 26, 18, 12, 10, 10]
        ])


D = global_alignment_matrix(seq1, seq2)

print(np.all(np.equal(D, expected_dist_matrix)))
print(global_alignment_value(D) == expected_distance)

Task 2: Local Alignment

Complete the Smith-Waterman implementation

Part a: The distance between two letters [1 pt]

Finish the provided code and check yourself with the tests provided.

def dist_local(char1, char2): 
    ##########################################
    # Function that computes the distance between two characters in a sequence,
    # used for local alignment.
    # 
    # Input arguments:
    #     char1, char2 = Values of 2 characters, either nucleobases or gaps,
    #                    as 1-character strings.
    # 
    #     E.g. char1 = 'A', char2 = 'G'
    #          char1 = 'A', char2 = '-'
    # 
    # Returns:
    #     The distance (or cost of change) between char1 and char2, 
    #     using the DIST_... values
    ##########################################

    DIST_MATCH = 2
    DIST_GAP = -6
    DIST_MISMATCH = -4
    
    if # WRITE CONDITION HERE
        return DIST_MATCH 
    
    if # WRITE CONDITION HERE
        return DIST_GAP
    
    else
        return DIST_MISMATCH

Tests

print(dist_local('A', '-') + dist_local('A', 'T') == -10)
print(dist_local('A', 'C') + dist_local('G', 'T') == -8)

Part b: Computing the local alignment [1 pt]

Finish the provided code.

Since it is a variation of the global alignment algorithm, the Smith-Waterman algorithm has a similar structure.

A big difference is that one takes the maximum of different possibilities, since once keeps track of an optimum score.

def local_alignment_matrix(seq1, seq2):
    #######################################################
    # Computes the local alignment matrix D between 
    # sequences seq1 and seq2
    # 
    # Input:
    #    seq1, seq2 = nucleobase sequences as strings
    # 
    #    E.g. seq1 = 'TATACCTGA', seq2 = 'TATACGAGACCGTTT'
    # 
    # Returns:
    #    D = local alignment matrix
    ########################################################
    # Convert to character vectors to easily work letter by letter
    # E.g. seq1 = "ABC" becomes c("A", "B", "C") and seq1[2] == "B"
    seq1 = list(seq1)
    seq2 = list(seq2)

    #D = [[0] * (len(seq2) + 1) for _ in range(len(seq1) + 1)]
    D = np.zeros([(len(seq1) + 1),  (len(seq2) + 1)])
    for i in range(1, len(seq1) + 1):
        for j in range(1, len(seq2) + 1):
            dist_from_upper_left =  ### WRITE CODE HERE ###
            dist_from_above =  ### WRITE CODE HERE ###
            dist_from_left =  ### WRITE CODE HERE ###

            D[i, j] = max(
                [ ### WRITE CODE HERE ###,
                 ### WRITE CODE HERE ###,
                 ### WRITE CODE HERE ###,
                0]  # make sure to have a non-negative value since this is a distance
            )

    return D


def local_alignment_value(alignment_matrix):
    return max(map(max, alignment_matrix))

Tests

Check your code with these. The last 2 lines compare expected and actual values.

seq1 = "GGTATGCTGGCGCTA"
seq2 = "TATATGCGGCGTTT"

expected_distance = 12

expected_dist_matrix = np.array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  2,  0,  2,  2,  0,  2,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  2,  0,  2,  4,  0,  2,  0,  0,  0],
                                   [ 0,  2,  0,  2,  0,  2,  0,  0,  0,  0,  0,  0,  4,  2,  2],
                                   [ 0,  0,  4,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
                                   [ 0,  2,  0,  6,  0,  6,  0,  0,  0,  0,  0,  0,  2,  2,  2],
                                   [ 0,  0,  0,  0,  2,  0,  8,  2,  2,  2,  0,  2,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  2, 10,  4,  0,  4,  0,  0,  0,  0],
                                   [ 0,  2,  0,  2,  0,  2,  0,  4,  6,  0,  0,  0,  2,  2,  2],
                                   [ 0,  0,  0,  0,  0,  0,  4,  0,  6,  8,  2,  2,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  2,  0,  2,  8,  4,  4,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  0,  4,  0,  2, 10,  4,  0,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  2,  0,  6,  2,  4, 12,  6,  0,  0],
                                   [ 0,  0,  0,  0,  0,  0,  0,  4,  0,  2,  4,  6,  8,  2,  0],
                                   [ 0,  2,  0,  2,  0,  2,  0,  0,  0,  0,  0,  0,  8, 10,  4],
                                   [ 0,  0,  4,  0,  4,  0,  0,  0,  0,  0,  0,  0,  2,  4,  6]])


D = local_alignment_matrix(seq1, seq2)


print(np.all(np.equal(D, expected_dist_matrix)))
print(local_alignment_value(D) == expected_distance)

Task 3: Assembly

Finish the code below to compute the shortest common superstring (SCS) using the greedy algorithm.

Part a: Max overlap of two sequences [1 pt]

To assemble reads, we need a function to compute the length of the overlap of two given sequences

For example:

seq1: abc123
seq2:    123xyz

len_overlap('abc123', '123xyz') == 3

The skeleton for this function is given below. Finish the code and use the tests below to check yourself.

Note that this time we're not converting the strings seq1 and seq2 to character vectors, because R has some functions which we can use, but these work on strings (e.g. we use "hello" instead of ["h", "e", "l", "l", "o"] as before).

Read about these functions: startswith(), endswith() and think about how to use them to solve this tasks. Also, recall len() counts the number of characters in a string or list. len("hello") == 5 and nchar(["hello"]) == 1.

You are of course free to expand the function if you feel constrained by the given "blanks". You can even rewrite it entirely as you wish, just please explain what you're doing.

! Careful !

The python while(CONDITION) loop command simply repeats until CONDITION is no longer true. If there is a mistake in the code and the logical value of CONDITION never changes, the code will run forever. If you see your code running for very long (like when running the tests), press the stop button at the top of the notebook session.

def len_overlap(seq1, seq2):
    ##############################################    
    # Find the longest overlap between a suffix of seq1 and a prefix of seq2
    # and return its length.
    # E.g. len_overlap('abc123', '123xyz') == 3
    ##############################################

    # We take shorter and shorter suffixes of seq1
    # and test whether they overlap with a prefix of seq2.
    # We start at index 1 in seq1 (i.e. thus including seq1 entirely)
    overlap_start_index = 0
    while overlap_start_index < ## WRITE CODE HERE ##:
        current_prefix = ## WRITE CODE HERE ##
        if ## WRITE CODE HERE ##:
            return len(seq1) - overlap_start_index
        else:
            overlap_start_index += 1

    return 0 # no overlap

Tests

You can add your own of course. Think of edge cases and negative examples.

print(len_overlap('abc123', '123xyz') == 3)
print(len_overlap('abc1234', '1234xyz') == 4)
print(len_overlap('abc124', '123xyz') == 0)


## WRITE TEST HERE ##
## WRITE TEST HERE ##
## WRITE TEST HERE ##
## WRITE TEST HERE ##

Part b: Greedy Shortest Common Superstring [1 pt]

This approach requires generating all permutations of a vector of strings. To make our lives easier, we're going to use the permutations() function from the itertools package. So first install this package:

To see how it works, run the chunk below:

items = ["I", "prefer", "Python"]

test_permutations = list(permutations(items))

print(test_permutations)

Remember there are n! (n factorial) permutations of n elements, so we should refrain from long vectors (e.g. factorial(10) is already 3628800).

def greedy_shortest_common_superstring(strings):
    #############################################################    
    # Takes a list of strings and computes
    # an approximation of the shortest common supperstring (SCS),
    # using a greedy algorithm.
    # 
    # Input:
    #       strings = list of strings e.g. c('ABCD', 'CDBC', 'BCDA')
    # 
    # Returns:
    #       shortest_superstring = the approximation of the SCS 
    #                              of the input list of strings
    # E.g. 'ABCDBCDA'
    ###############################################################
    
    # Start off with no SCS (empty string)
    shortest_superstring = ""
    all_permutations_of_the_strings = permutations(strings)

    # Consider all permutations of the input list of stings
    for current_permutation in all_permutations_of_the_strings:
        # For the current permutation,
        # start growing a superstring using the first string in the permutation
        superstring_candidate = current_permutation[0]
        # Go over all strings in this permutation and overlap consecutive strings i and i + 1
        # So e.g. i: ABB   i+1: BBA  ->  ABBA
        for i in range(len(strings) - 1):
            # determine length of the overlap ...
            overlap_length = len_overlap(#### WRITE CODE HERE ####)
                
            # ... and use this to get the non-overlapping part of string i + 1
            non_overlapping_portion = #### WRITE CODE HERE ####
            
            # The non-overlapping part of string i + 1 is then pasted 
            # to the end of the superstring we are growing for this permutation
            superstring_candidate += non_overlapping_portion

        # If this is the first permutation we're considering,
        # of it turns out that the superstring candidate we grew for this permutation of input strings
        # is shorter than our current SCS candidate, 
        # make the current superstring candidate the new SCS
        if shortest_superstring == "" or len(superstring_candidate) < #### WRITE CODE HERE #### :
            shortest_superstring = superstring_candidate

    return shortest_superstring

Tests

The second case will be a bit slow, because of the many permutations that are considered. You can add your own tests of course. Think of edge cases.

Note that there may be multiple SCS and the procedure will choose only the first one, and the order is determined by how the permutations are generated. So if you add your own test, check the output of permutations(my_test_vector) to see the order in which each SCS candidate is constructed.

print(greedy_shortest_common_superstring(c('ABCD', 'CDBC', 'BCDA')) == 'ABCDBCDA')
print(greedy_shortest_common_superstring(c('BAA', 'AAB', 'BBA', 'ABA', 'ABB', 'BBB', 'AAA', 'BAB') == 'BAAABABBBA'))