1 Aim

This notebook goes over some basic concepts in data structures and algorithms. You should go through it and try to answer the questions before the “Introduction to Algorithms” tutorial. We will use time during the tutorial to clarify any confusions and do the exercises.

For a proper overview of such topics, there are plenty of free online courses on Coursera, Youtube, etc. Classic books in the area are:

though note that these were written for computer science undergraduates and quickly get technical.

2 Preliminaries

We’ll be using the the Python programming language to discuss some algorithmic concepts. R is perhaps not the best choice for introducing such topics, since it’s a high-level language, abstracting over many of the low-level operations. Still, many of the concepts can be discussed at multiple levels and when writing your own code it’s good to be aware of the time and memory cost of your routines.

Click here to download this R notebook so you can run it and write your own code.

3 Basic Concepts

The whole point of programming is to automate tasks and let computers do the “heavy lifting” of processing. Regardless of application area (bioinformatics, computer games, video streaming), the basic building blocks of programming are essentially the same and revolve around storing and processing data. These two aspects are in a constant interplay and are tackled by data structures and algorithms, respectively. In fact, structure reflects function and no single data structure will best support all types of processing.

4 Basic Data Structures

4.1 Atomic values and operations

Like their mathematical counterparts, variables are labels assigned to various values. These values may be atomic (e.g. a number) or structured collections of atomic values (e.g. a vector of numbers):

protein_count = 2345                    # atomic value
organism = "yeast"                      # string
temperatures = [35.2, 34, 35.7, 34.5]  # vector of floats (real values)

#len(protein_count)
len(organism)
## 5
len(temperatures)
## 4

(Note: The len of organism is 5 since Python treats characters as single values. However, every time we perform some string manipulation operation (e.g. pattern matching), these strings are processed character by character.)

The main thing to note here from an algorithmic point of view is:

We can measure a program’s running time in terms of atomic operations, i.e. getting and setting atomic values, as well as arithmetic operations with these.

# 1 atomic operation: assigning (setting) a value
protein_count = 2345                          

# 3 atomic operations: 
# - getting the protein_count value from RAM into the CPU
# - adding an atomic value 
# - then setting the resulting adjusted_protein_count value from the CPU into RAM
adjusted_protein_count = protein_count + 100 
print(adjusted_protein_count)
## 2445

This is our unit of time. Of course the exact number of milliseconds will differ between computers, but this intrinsic unit allows us to reason about how the cost of programs scales with the amount of data they’re fed.

And because the focus is how things scale, it turns out that we can also think about the second operation above as effectively atomic, because constant multiplication factors can be ignored. For now, don’t worry about that or the whole CPU-RAM dynamic mentioned there.

The same reasoning in terms of atomic and compound values applies to memory cost though nowadays you’re more likely to bump into long running times than memory limitations.

4.2 Vectors

Also called arrays or lists in other languages, these are linear collections of atomic values of the same type.

temperatures = [35.2, 34, 35.7, 34.5]

Let’s say the length of a vector is N.

One requires N atomic operations to fully process a vector

The time cost (or “time complexity” as it’s formally known) of a program is defined as a function of the size of the input. So if our input is N numbers collected in a vector, the time cost is time_cost(N) = k * N

As I hinted two examples ago, k can be theoretically ignored. This has more to do with the fact that as N grows to larger and larger values, some programs will have a time cost of e.g k * N and others of e.g. k * N * N (N squared). When comparing such programs, the constants really don’t matter and it’s the power of N that “dominates” the time cost. But of course, in practice, if you can reduce the k from 10 to 1, and N is the number of days, that does indeed make a difference.

Back to vectors: they are meant for storing and fetching values by position. As in, assuming you “know” which element you want, you can retrieve it in one atomic operation (or “constant time”).

print(temperatures[3])  # 1 atomic operation: getting the value
## 34.5

This is important to understand. If you want element number 1000 from a vector, the computer directly gives you that element (in 1 atomic operation), it doesn’t cross the entire list to get there.

Now, the problem is if you don’t know which element you want, say if you’re looking for the largest value. If you’re just given a vector and know nothing about it, you really have no choice but to go element-by-element to find the largest one. Thus, computing max(vector) has a time cost of N.

One often thinks of program time costs in terms of worst-case scenarios, since these cases will inform how many resources must be guaranteed for the program.

Many real world problems are intrinsically difficult, meaning the optimal solution requires very large time costs. The figure below gives a few examples. O() basically means worst case. As you can see, for some of these the cost explodes even with relatively few data (N)

4.3 Hash Maps

Also known as “dictionaries” in Python, “hash tables”, or “associative array” (in the theory), this data type holds a collection of elements, similar to a vector, except that elements are indexed by keys (unique IDs) and are not sorted in any particular order. These keys can be of any data type, though frequently strings are used. Indeed, vectors can be considered a particular case of hash map, where the keys are successive integers.

They are often used for “reversed lookup” scenarios. So for instance, if you want to look up the coordinates of a gene, you can use a hash map indexed by gene name to fetch the associated numeric value. With hash maps, this is an atomic operation (i.e. the list of keys is not searched until the matching one is found).

gene_pos = {} 

# We insert elements one by one using gene names as string keys
# The syntax is 
# hashmap[key] <- value   Note the double brackets

gene_pos["gene_a"] = [1, 1000]
gene_pos["gene_b"] = [10001, 2010]
gene_pos["gene_c"] = [2011, 2999]

To retrieve an element from this hash map using a gene name (the key), simply refer to it like this

print(gene_pos["gene_b"])
## [10001, 2010]

To get a list of keys, use the dictionary function keys():

gene_pos.keys()
## dict_keys(['gene_a', 'gene_b', 'gene_c'])

If a key is not present, a KeyNotFound error will be triggered . The proper way to check that is with the has_key() command:


query_gene = "gene_a"

if not gene_pos.get(query_gene):
    print(f"{query_gene} not preset")
else:
    print(gene_pos[query_gene]) 
## [1, 1000]

Why is this a useful data structure? Because

Entries in hash maps are accessed in one atomic operation

4.3.1 Question 1

How would you perform this look up using vectors? That is, given vectors gene_start = [100, 200, ...] and gene_name = ["gene_a", "gene_b", ...], how would you fetch the start position of a query gene?

Click to reveal solution To locate a specific gene, you would initially search the vector gene_name for the desired gene to obtain its index. Subsequently, you would fetch the starting position of the gene by indexing the vector gene_start. Assuming the vector gene_name is sorted, this process would have a time complexity of O(log n). However, if gene_name were unsorted, the time complexity would instead be O(n). Alternatively, if you would have used a hashmap to store gene start positions with gene names as associated keys, the lookup of a single gene would have a time complexity of O(1)

4.4 Sets

In many languages, a “set” data structure is simply a “bag” of values (like a vector, but without any ordering). The great advantage is that they can tell you in a single atomic operation if a value is present or not in the set.

Testing set membership takes a single atomic operation

Which is great if you’re testing whether a million values are part of a set containing 10 million known values.

gene_id_set = {"AB1234", "AB1235", "AB1236"}

query_gene = "XY1234"

if query_gene not in gene_id_set:
    print("Gene not preset")
else:
    gene_id_set.add(query_gene) 
## Gene not preset

To “remove” a value from the set, you can use the discarde function

gene_id_set.discard("AB1235")

5 Control Flow

Programs usually consist of collections of tasks that are applied repetitively to multiple input values, or are applied conditionally based on some logic. Technically, this is called control flow and the basic patterns one must know are loops and logical execution conditions. These are likely familiar to any beginner programmer but here we will focus more on algorithmic aspects.

5.1 Looping

The most typical pattern is to apply some sort of processing to many elements in a loop or list comprehension.

CELSIUS_ZERO_IN_KELVIN = 273.15

#initialise
temperatures_kelvin = [0.0 for _ in temperatures]
# loop
for  i in range(len(temperatures)): 
    temperatures_kelvin[i] <- CELSIUS_ZERO_IN_KELVIN + temperatures[i]


# List comprehesion
## False
## False
## False
## False
temperatures_kelvin = [ CELSIUS_ZERO_IN_KELVIN + temperature for temperature in temperatures]


temperatures_kelvin
## [308.34999999999997, 307.15, 308.84999999999997, 307.65]

Before we move on, please note that we do not change the input data and instead create a new vector for the transformed vales. This is a good practice that you should follow, primarily for reproducibility reasons.

Here we applied a simple transformation to all elements of a unidimensional vector. Suppose we have a list of lists or a matrix, and we need to change all elements, say get log values of protein counts across samples.

import numpy as np
N_PROTEINS = 3
N_SAMPLES = 4

protein_counts_across_samples = np.array(
    [[100, 121, 10, 15],
     [1000, 1337, 175, 165],
     [100, 121, 10, 15]])

protein_log_counts_across_samples = np.zeros((N_PROTEINS, N_SAMPLES))

for protein_num in range(N_PROTEINS):
    for sample_num in range(N_SAMPLES):
        protein_log_counts_across_samples[protein_num, sample_num] = np.log(protein_counts_across_samples[protein_num, sample_num])


print(protein_log_counts_across_samples)
## [[4.60517019 4.79579055 2.30258509 2.7080502 ]
##  [6.90775528 7.19818358 5.16478597 5.10594547]
##  [4.60517019 4.79579055 2.30258509 2.7080502 ]]

5.1.1 Question 2

What is the time cost of the matrix transformation above? Try to phrase your answer as a formula, not actual numbers.

Click to reveal solution The size of the matrix grows proportionally to the product of the number of proteins (N_PROTEINS) and the number of samples (N_SAMPLES). Consequently, any operation or transformation applied to all elements within the matrix will exhibit the same growth pattern. Therefore, the time complexity of such transformations is O(N_PROTEINS * N_SAMPLES).


Now, the above is only for illustration. Since numpy is suporting vectorized operations, you don’t need to do any of that in actual code. You’d simply write:

#import numpy as np
temperatures = np.array(temperatures)
temperatures_kelvin = temperatures + CELSIUS_ZERO_IN_KELVIN
protein_log_counts_across_samples = np.log(protein_counts_across_samples)
print(protein_log_counts_across_samples)
## [[4.60517019 4.79579055 2.30258509 2.7080502 ]
##  [6.90775528 7.19818358 5.16478597 5.10594547]
##  [4.60517019 4.79579055 2.30258509 2.7080502 ]]

In fact, for loops in python tend to be slower than these vectorized operations (i.e. operations/functions that can be applied on the vectors as a whole). Though keep in mind that “under the hood”, it’s still for loops! (Just in C or Fortran and hence much faster).

for loops still need to be used in same cases, but there are better patterns to use.

5.2 Functional Approach

As with any technological process or pipeline, certain functionality can be made generic and reused. In python, procedures can be encapsulated in functions.

For example, let’s have a function that truncates any negative value to 0 and doubles any positive value. Why not.

def amplify_positive(value):
    # The value of the last executed command is the value the function gives back
    # One can also be explicit and use the return() command for clarity
    
    if (value < 0):
        return(0)
    else: 
        return(2 * value)
    

Python is a procedural language that does play well with functions for various data structures. Quite often in statistical, mathematical, and bioinformatical programming, one works with vectors as the unit of calculation and the focus is more on the functions that are applied to these vectors. Functional programming is a very deep subject area but it helps to be aware of some common patterns.

Here we will look at two common functional operations:

  • map aka apply (the former name is the most common and the one used in the theory)
  • filter
  • reduce aka summarize aka combine aka fold

5.2.1 Map

This basically means applying a function to each element in a vector. In base Python this operation is called apply (and exists in several version).

Because our example function has a logical test that is defined per atomic value, it cannot simply be fed an entire vector (like we did for the log function). The code below will given a warning and not produce what we want.

test_values = [-3, -2, -1, 0, 1, 2, 3] 
amplify_positive(test_values)

So we must tell Python how to map the function onto a vector (or “apply” it). There are a few version of this function and you are encouraged to read the vignette and try these out. The basic syntax is

map( function, vector)     
test_values = [-3, -2, -1, 0, 1, 2, 3] 
new_values = map( amplify_positive, test_values)

print(new_values)
## <map object at 0x7f9cadf4db80>
print(list(new_values))
## [0, 0, 0, 0, 2, 4, 6]

Fun fact: the dplyr pipe operator %>% in R is a type of map operation, just written as a “connector” that has the meaning data %>% function == function(data).

If you can (easily) express a procedure as a function, the functional approach is strongly encouraged instead of using for loops. So don’t write

amplified_values = [0 for _ in test_values]
for i in range(len(test_values)):
    if (test_values[i] < 0):
        amplified_values[i] = 0
    else:
        amplified_values[i] = 2 * test_values[i]
    

Of course, “under the hood”

Mapping a function onto a vector still has a time cost of N (the length of the vector).

5.2.2 Filter

This pattern uses a logical condition function to choose only elements from a vector that satisfy the function’s condition. In base R (without extra packages like dplyr or purrr), this can be done in 2 steps: 1) map the condition function onto the vector to produce a logical (true/false) vector 2) use R’s logical subsetting of a vector to select only those elements that are marked with TRUE

In a single command, this would be vector[apply(vector, condition)]

We’ll use a non-trivial toy example, where we’re given a vector of query values and wish to keep only “interesting” values above a threshold.

baseline_values = [i for i in range(8)]
interesting_values = [10, 100, 1337, 1e6]

def is_interesting (value):
    if (value > max(baseline_values)):
        for element in interesting_values:
            if (element == value):
                # Match found: the function can immediately return TRUE
                return(True)
            
        
        # If we went through all interesting_values without a match
        return(False)
        # The value was not above the baseline 
    return(False)
   

query_values = np.array([ i for i in range(2000)])

print(query_values[list(map(is_interesting, query_values))])
## [  10  100 1337]

5.2.2.1 Question 3

What’s the time cost of the example filter function? Try to express it as a formula.

Click to reveal solution The computation is dependant on the number in query_values baseline_values and interesting_values. The max finction is linear in the case of an unsorted vector. Thus the filtering would have a time complexity of O(query_valuesbaseline_valuesinteresting_values)

5.2.2.2 Question 4

How can you speed up this example filtering? Hint: There are 2 ways and they involve baseline_values and interesting_values

Click to reveal solution We can first recognise that baseline_values are not needed since none of the interesting_values are below any baseline values. In general we could exclude any interesting values that is contained by baseline_values with the use of sets as baseline_values = set(baseline_values) interesting_values = set(interesting_values) interesting_values = interesting_values - interesting_values.intersection(baseline values) With the interesting_values expressed as a set we can also speed up this comparison as well by querying directly if the value is in the set. The new filter function thus becomes def is_interesting (value): if value in interesting_values: return True else: return False This new filter function would have a time complexity of O(query_values)

5.2.3 Reduce

This means applying on a vector a function that somehow combines the elements into a single (or at least lower-dimensional) value.

The classic example is sum() which returns a single number, given a vector as argument. Pretty straightforward, but good to be aware of it formally when designing your programs.

5.2.3.1 Question 5

What’s the time cost of reduce functions? Try to express it as a formula.

Click to reveal solution The time complexity of the reduce function is typically linear, O(n), where n is the number of elements in the iterable being reduced. This is because reduce iterates over each element of the iterable exactly once, performing the specified binary operation on pairs of elements until only a single result remains.

5.2.4 Map-Reduce

This is a frequent combination in data science and baked into the dplyr package, as well as the overall tidyverse ecosystem. In tidy data frames, columns are variables and rows observations. Many operations performed can be expressed using map-reduce patterns.

For example, let’s take the Iris dataset, and compute the average petal area. The data frame looks like this

import pandas as pd

iris = pd.read_csv("iris.csv")

We compute the areas (very simplistically), then the average of all areas:


# Assuming 'iris' is a DataFrame with columns 'Petal.Length' and 'Petal.Width'
# You can replace it with your actual DataFrame and column names

iris['petal area (cm^2)'] = iris['petal.length'] * iris['petal.width']
result = iris['petal area (cm^2)'].mean()

print(result)
## 5.794066666666667

5.2.4.1 Question 6

What is the time cost of the above calculation? Try to express it as a formula.

Click to reveal solution Time complexity is proportional to the number of rows in the data frame as we need to performe one multiplication per row. Thus O(n)

Using such patterns often makes code easier to understand and faster to write, and may even lead to faster execution if the system is optimized for functional or vector-based programming.

We will not cover parallelized code but another huge advantage is that this sort of pattern is very easy run on multiple CPUs, leading to manifold speed improvements. Indeed, many big data, machine learning, and bioinformatics frameworks automatically do that for you, provided you structure the code this way.

6 Testing Algorithms

We’ve looked at it different data structures and ways to control the flow of program execution. These are the building blocks of algorithms. Besides time cost, another major concern when considering algorithms is correctness. Even if one is developing a probabilistic or approximate algorithm, it’s important to verify that results are within acceptable bounds.

Many fundamental algorithms have rigorous mathematical proofs of correctness, that take into account all possible inputs. In common practice, it’s very difficult to guarantee that any non-trivial algorithm produces the results you want (provided of course you even explicitly know what you want). This is complicated by the fact that the implementation of an algorithm using a certain language, library, OS, etc. can hide a plethora of bugs.

Testing correctness of programs is a vast field, however

Using a few good practices will get you a long way toward correct and reproducible code.

6.1 Toward correctness: Clean, modular code

Here are just two of these good programming practices. It’s very easy to pick up bad habits and hard to lose them so make sure to practice these principles from the beginning :)

6.1.1 (C1) Give meaningful names to variables and functions

Remember the rant about giving variables meaningful names? The pragmatic side of it is that you need to communicate what your code does. People (even future you) will evaluate code to see if it makes sense and does what it’s supposed to. The flipside is that people won’t trust your code if it looks like spaghetti.

6.1.2 (C2) Don’t repeat yourself

It’s very tempting to copy-paste code to do the same thing mulitple times. We’ve all done it. It’s also a very bad idea. It creates a “code swamp” that hides the main flow of the code but more importantly, it’s highly error-prone, especially when making changes. And changes will be made. There’s no such thing as 100% single-use code.

So instead of something like this:

data1 = read_csv("my_data_file_1.csv")
avg1 = mean(data1)
data2 = read_csv("my_data_file_2.csv")
avg2 = mean(data2)
data2 =read_csv("my_data_file_3.csv")  # this copy-paste-change error is on purpose, but it happens
avg3 = mean(data2)

Generalize:

filenames = ["my_data_file_1.csv", "my_data_file_2.csv", "my_data_file_3.csv"]
averages = map(mean,
                map(read_csv, filenames)
              )

Now you can just extend the list of file names, or use list.files() to get all files names in the directory. And if you ever want to summarize the values differently, you just replace mean in one place.

If the benefit of the general form is not obvious, imagine the procedure for each file is like 20 lines long.

Of course, you need not (or can not) think up the best code structure beforehand, but do take steps to avoid bad patterns.

6.1.3 Toward correctness: Testing

The most obvious way to test code is to feed it input and see if it produces expected output. It’s very difficult or even impossible to cover all cases, especially if one is not aware of all corner cases of more complex code. But this can be alleviated by writing your programs in a modular way and testing each “module” separately. This is easier if one follows principle (C2) above.

A good practice is to actually prepare test cases before writing the code. This helps thinking a bit deeper about the problem being solved and potential gotchas.

This can be as simple as writing something like:

def my_procedure(x):
    return x^2 * 10 

# Tests
test_inputs = [1, 2, 3, -1, -2]
expected_outputs = [10, 40, 90, 10, 40]

actual_outputs = map( my_procedure, test_inputs)
print(expected_outputs == list(actual_outputs))
## False

There is also a package called testthat which offers more functionality for writing tests but as you can see, it’s super easy to write large batteries of tests to assess whehter you have the solution you think you have.

7 Exercise: ID Frequency

  1. Write a function that counts the number of times an ID appears in a vector. E.g. in ["abc", "xyz", "abc", "def"], “abc” appears 2 times, the rest 1 time.

  2. Describe its time cost using a formula. Aim for a low time cost when building your solution.

A solution written in every-day python is given below, for you to test against. Please write the function using only base python (“from scratch”) using the data structures and methods discussed earlier.

Run the code below to generate 10 thousand random IDs consisting of 2 letters and a digit, then write your function and test it against the solution written below.

import random
import string

# Set seed for reproducibility
random.seed(42)

# Generate random strings
random_strings = [''.join(random.choices(string.ascii_uppercase, k=2) + random.choices(string.digits, k=1)) for _ in range(10000)]

# Print the first 10 random strings
print(random_strings[:10])
## ['QA2', 'FT6', 'XC4', 'AF5', 'AF6', 'OF5', 'VA8', 'SI1', 'YI0', 'CW6']
# Your function here 
import pandas as pd

# Assuming 'random_strings' is a list of strings
# You can replace it with your actual list

# Create a DataFrame
df = pd.DataFrame({'id': random_strings})

# Group by 'id' and count occurrences
answer = df.groupby('id').size().reset_index(name='count')

# Print the result
print(answer)
##        id  count
## 0     AA1      1
## 1     AA3      1
## 2     AA4      2
## 3     AA5      1
## 4     AA6      2
## ...   ...    ...
## 5213  ZZ3      1
## 5214  ZZ4      2
## 5215  ZZ7      2
## 5216  ZZ8      3
## 5217  ZZ9      2
## 
## [5218 rows x 2 columns]