This notebook goes over some basic concepts in data structures and algorithms. You should go through it and try to answer the questions before the “Introduction to Algorithms” tutorial. We will use time during the tutorial to clarify any confusions and do the exercises.
For a proper overview of such topics, there are plenty of free online courses on Coursera, Youtube, etc. Classic books in the area are:
though note that these were written for computer science undergraduates and quickly get technical.
We’ll be using the the Python programming language to discuss some algorithmic concepts. R is perhaps not the best choice for introducing such topics, since it’s a high-level language, abstracting over many of the low-level operations. Still, many of the concepts can be discussed at multiple levels and when writing your own code it’s good to be aware of the time and memory cost of your routines.
Click here to download this R notebook so you can run it and write your own code.
The whole point of programming is to automate tasks and let computers do the “heavy lifting” of processing. Regardless of application area (bioinformatics, computer games, video streaming), the basic building blocks of programming are essentially the same and revolve around storing and processing data. These two aspects are in a constant interplay and are tackled by data structures and algorithms, respectively. In fact, structure reflects function and no single data structure will best support all types of processing.
Like their mathematical counterparts, variables are labels assigned to various values. These values may be atomic (e.g. a number) or structured collections of atomic values (e.g. a vector of numbers):
protein_count = 2345 # atomic value
organism = "yeast" # string
temperatures = [35.2, 34, 35.7, 34.5] # vector of floats (real values)
#len(protein_count)
len(organism)
## 5
len(temperatures)
## 4
(Note: The len of organism
is 5 since Python treats
characters as single values. However, every time we perform some string
manipulation operation (e.g. pattern matching), these strings are
processed character by character.)
The main thing to note here from an algorithmic point of view is:
We can measure a program’s running time in terms of atomic operations, i.e. getting and setting atomic values, as well as arithmetic operations with these.
# 1 atomic operation: assigning (setting) a value
protein_count = 2345
# 3 atomic operations:
# - getting the protein_count value from RAM into the CPU
# - adding an atomic value
# - then setting the resulting adjusted_protein_count value from the CPU into RAM
adjusted_protein_count = protein_count + 100
print(adjusted_protein_count)
## 2445
This is our unit of time. Of course the exact number of milliseconds will differ between computers, but this intrinsic unit allows us to reason about how the cost of programs scales with the amount of data they’re fed.
And because the focus is how things scale, it turns out that we can also think about the second operation above as effectively atomic, because constant multiplication factors can be ignored. For now, don’t worry about that or the whole CPU-RAM dynamic mentioned there.
The same reasoning in terms of atomic and compound values applies to memory cost though nowadays you’re more likely to bump into long running times than memory limitations.
Also called arrays or lists in other languages, these are linear collections of atomic values of the same type.
temperatures = [35.2, 34, 35.7, 34.5]
Let’s say the length of a vector is N.
One requires N atomic operations to fully process a vector
The time cost (or “time complexity” as it’s formally known) of a
program is defined as a function of the size of the input. So if our
input is N numbers collected in a vector, the time cost is
time_cost(N) = k * N
As I hinted two examples ago, k
can be theoretically
ignored. This has more to do with the fact that as N grows to larger and
larger values, some programs will have a time cost of e.g
k * N
and others of e.g. k * N * N
(N
squared). When comparing such programs, the constants really don’t
matter and it’s the power of N that “dominates” the time cost. But of
course, in practice, if you can reduce the k from 10 to 1, and N is the
number of days, that does indeed make a difference.
Back to vectors: they are meant for storing and fetching values by position. As in, assuming you “know” which element you want, you can retrieve it in one atomic operation (or “constant time”).
print(temperatures[3]) # 1 atomic operation: getting the value
## 34.5
This is important to understand. If you want element number 1000 from a vector, the computer directly gives you that element (in 1 atomic operation), it doesn’t cross the entire list to get there.
Now, the problem is if you don’t know which element you
want, say if you’re looking for the largest value. If you’re just given
a vector and know nothing about it, you really have no choice but to go
element-by-element to find the largest one. Thus, computing
max(vector)
has a time cost of N
.
One often thinks of program time costs in terms of worst-case scenarios, since these cases will inform how many resources must be guaranteed for the program.
Many real world problems are intrinsically difficult, meaning the
optimal solution requires very large time costs. The figure below gives
a few examples. O()
basically means worst case. As you can
see, for some of these the cost explodes even with relatively few data
(N)
Also known as “dictionaries” in Python, “hash tables”, or “associative array” (in the theory), this data type holds a collection of elements, similar to a vector, except that elements are indexed by keys (unique IDs) and are not sorted in any particular order. These keys can be of any data type, though frequently strings are used. Indeed, vectors can be considered a particular case of hash map, where the keys are successive integers.
They are often used for “reversed lookup” scenarios. So for instance, if you want to look up the coordinates of a gene, you can use a hash map indexed by gene name to fetch the associated numeric value. With hash maps, this is an atomic operation (i.e. the list of keys is not searched until the matching one is found).
gene_pos = {}
# We insert elements one by one using gene names as string keys
# The syntax is
# hashmap[key] <- value Note the double brackets
gene_pos["gene_a"] = [1, 1000]
gene_pos["gene_b"] = [10001, 2010]
gene_pos["gene_c"] = [2011, 2999]
To retrieve an element from this hash map using a gene name (the key), simply refer to it like this
print(gene_pos["gene_b"])
## [10001, 2010]
To get a list of keys, use the dictionary function keys():
gene_pos.keys()
## dict_keys(['gene_a', 'gene_b', 'gene_c'])
If a key is not present, a KeyNotFound error will be triggered . The
proper way to check that is with the has_key()
command:
query_gene = "gene_a"
if not gene_pos.get(query_gene):
print(f"{query_gene} not preset")
else:
print(gene_pos[query_gene])
## [1, 1000]
Why is this a useful data structure? Because
Entries in hash maps are accessed in one atomic operation
How would you perform this look up using vectors?
That is, given vectors gene_start = [100, 200, ...]
and
gene_name = ["gene_a", "gene_b", ...]
, how would you fetch
the start position of a query gene?
In many languages, a “set” data structure is simply a “bag” of values (like a vector, but without any ordering). The great advantage is that they can tell you in a single atomic operation if a value is present or not in the set.
Testing set membership takes a single atomic operation
Which is great if you’re testing whether a million values are part of a set containing 10 million known values.
gene_id_set = {"AB1234", "AB1235", "AB1236"}
query_gene = "XY1234"
if query_gene not in gene_id_set:
print("Gene not preset")
else:
gene_id_set.add(query_gene)
## Gene not preset
To “remove” a value from the set, you can use the discarde function
gene_id_set.discard("AB1235")
Programs usually consist of collections of tasks that are applied repetitively to multiple input values, or are applied conditionally based on some logic. Technically, this is called control flow and the basic patterns one must know are loops and logical execution conditions. These are likely familiar to any beginner programmer but here we will focus more on algorithmic aspects.
The most typical pattern is to apply some sort of processing to many elements in a loop or list comprehension.
CELSIUS_ZERO_IN_KELVIN = 273.15
#initialise
temperatures_kelvin = [0.0 for _ in temperatures]
# loop
for i in range(len(temperatures)):
temperatures_kelvin[i] <- CELSIUS_ZERO_IN_KELVIN + temperatures[i]
# List comprehesion
## False
## False
## False
## False
temperatures_kelvin = [ CELSIUS_ZERO_IN_KELVIN + temperature for temperature in temperatures]
temperatures_kelvin
## [308.34999999999997, 307.15, 308.84999999999997, 307.65]
Before we move on, please note that we do not change the input data and instead create a new vector for the transformed vales. This is a good practice that you should follow, primarily for reproducibility reasons.
Here we applied a simple transformation to all elements of a unidimensional vector. Suppose we have a list of lists or a matrix, and we need to change all elements, say get log values of protein counts across samples.
import numpy as np
N_PROTEINS = 3
N_SAMPLES = 4
protein_counts_across_samples = np.array(
[[100, 121, 10, 15],
[1000, 1337, 175, 165],
[100, 121, 10, 15]])
protein_log_counts_across_samples = np.zeros((N_PROTEINS, N_SAMPLES))
for protein_num in range(N_PROTEINS):
for sample_num in range(N_SAMPLES):
protein_log_counts_across_samples[protein_num, sample_num] = np.log(protein_counts_across_samples[protein_num, sample_num])
print(protein_log_counts_across_samples)
## [[4.60517019 4.79579055 2.30258509 2.7080502 ]
## [6.90775528 7.19818358 5.16478597 5.10594547]
## [4.60517019 4.79579055 2.30258509 2.7080502 ]]
What is the time cost of the matrix transformation above? Try to phrase your answer as a formula, not actual numbers.
Now, the above is only for illustration. Since numpy is
suporting vectorized operations, you don’t need to do any of that in
actual code. You’d simply write:
#import numpy as np
temperatures = np.array(temperatures)
temperatures_kelvin = temperatures + CELSIUS_ZERO_IN_KELVIN
protein_log_counts_across_samples = np.log(protein_counts_across_samples)
print(protein_log_counts_across_samples)
## [[4.60517019 4.79579055 2.30258509 2.7080502 ]
## [6.90775528 7.19818358 5.16478597 5.10594547]
## [4.60517019 4.79579055 2.30258509 2.7080502 ]]
In fact, for
loops in python tend to be slower than
these vectorized operations (i.e. operations/functions that can be
applied on the vectors as a whole). Though keep in mind that “under the
hood”, it’s still for loops! (Just in C or Fortran and hence much
faster).
for
loops still need to be used in same cases, but there
are better patterns to use.
As with any technological process or pipeline, certain functionality can be made generic and reused. In python, procedures can be encapsulated in functions.
For example, let’s have a function that truncates any negative value to 0 and doubles any positive value. Why not.
def amplify_positive(value):
# The value of the last executed command is the value the function gives back
# One can also be explicit and use the return() command for clarity
if (value < 0):
return(0)
else:
return(2 * value)
Python is a procedural language that does play well with functions for various data structures. Quite often in statistical, mathematical, and bioinformatical programming, one works with vectors as the unit of calculation and the focus is more on the functions that are applied to these vectors. Functional programming is a very deep subject area but it helps to be aware of some common patterns.
Here we will look at two common functional operations:
This basically means applying a function to each element in
a vector. In base Python this operation is called apply
(and exists in several version).
Because our example function has a logical test that is defined per
atomic value, it cannot simply be fed an entire vector (like we did for
the log
function). The code below will given a warning and
not produce what we want.
test_values = [-3, -2, -1, 0, 1, 2, 3]
amplify_positive(test_values)
So we must tell Python how to map the function onto a vector (or “apply” it). There are a few version of this function and you are encouraged to read the vignette and try these out. The basic syntax is
map( function, vector)
test_values = [-3, -2, -1, 0, 1, 2, 3]
new_values = map( amplify_positive, test_values)
print(new_values)
## <map object at 0x7f9cadf4db80>
print(list(new_values))
## [0, 0, 0, 0, 2, 4, 6]
Fun fact: the dplyr
pipe operator %>%
in
R is a type of map operation, just written as a “connector” that has the
meaning data %>% function == function(data)
.
If you can (easily) express a procedure as a function, the functional
approach is strongly encouraged instead of using for
loops.
So don’t write
amplified_values = [0 for _ in test_values]
for i in range(len(test_values)):
if (test_values[i] < 0):
amplified_values[i] = 0
else:
amplified_values[i] = 2 * test_values[i]
Of course, “under the hood”
Mapping a function onto a vector still has a time cost of N (the length of the vector).
This pattern uses a logical condition function to choose only
elements from a vector that satisfy the function’s condition. In base R
(without extra packages like dplyr
or purrr
),
this can be done in 2 steps: 1) map the condition function onto the
vector to produce a logical (true/false) vector 2) use R’s logical
subsetting of a vector to select only those elements that are marked
with TRUE
In a single command, this would be
vector[apply(vector, condition)]
We’ll use a non-trivial toy example, where we’re given a vector of query values and wish to keep only “interesting” values above a threshold.
baseline_values = [i for i in range(8)]
interesting_values = [10, 100, 1337, 1e6]
def is_interesting (value):
if (value > max(baseline_values)):
for element in interesting_values:
if (element == value):
# Match found: the function can immediately return TRUE
return(True)
# If we went through all interesting_values without a match
return(False)
# The value was not above the baseline
return(False)
query_values = np.array([ i for i in range(2000)])
print(query_values[list(map(is_interesting, query_values))])
## [ 10 100 1337]
What’s the time cost of the example filter function? Try to express it as a formula.
How can you speed up this example filtering? Hint: There are 2 ways and they involve
baseline_values
andinteresting_values
baseline_values = set(baseline_values) interesting_values = set(interesting_values) interesting_values = interesting_values - interesting_values.intersection(baseline values)
With the interesting_values expressed as a set we can also speed up this
comparison as well by querying directly if the value is in the set. The
new filter function thus becomes
def is_interesting (value): if value in interesting_values: return True else: return False
This new filter function would have a time complexity of O(query_values)
This means applying on a vector a function that somehow combines the elements into a single (or at least lower-dimensional) value.
The classic example is sum()
which returns a single
number, given a vector as argument. Pretty straightforward, but good to
be aware of it formally when designing your programs.
What’s the time cost of reduce functions? Try to express it as a formula.
This is a frequent combination in data science and baked into the
dplyr
package, as well as the overall
tidyverse
ecosystem. In tidy data frames, columns are
variables and rows observations. Many operations performed can be
expressed using map-reduce patterns.
For example, let’s take the Iris dataset, and compute the average petal area. The data frame looks like this
import pandas as pd
iris = pd.read_csv("iris.csv")
We compute the areas (very simplistically), then the average of all areas:
# Assuming 'iris' is a DataFrame with columns 'Petal.Length' and 'Petal.Width'
# You can replace it with your actual DataFrame and column names
iris['petal area (cm^2)'] = iris['petal.length'] * iris['petal.width']
result = iris['petal area (cm^2)'].mean()
print(result)
## 5.794066666666667
What is the time cost of the above calculation? Try to express it as a formula.
Using such patterns often makes code easier to understand and faster to write, and may even lead to faster execution if the system is optimized for functional or vector-based programming.
We will not cover parallelized code but another huge advantage is that this sort of pattern is very easy run on multiple CPUs, leading to manifold speed improvements. Indeed, many big data, machine learning, and bioinformatics frameworks automatically do that for you, provided you structure the code this way.
We’ve looked at it different data structures and ways to control the flow of program execution. These are the building blocks of algorithms. Besides time cost, another major concern when considering algorithms is correctness. Even if one is developing a probabilistic or approximate algorithm, it’s important to verify that results are within acceptable bounds.
Many fundamental algorithms have rigorous mathematical proofs of correctness, that take into account all possible inputs. In common practice, it’s very difficult to guarantee that any non-trivial algorithm produces the results you want (provided of course you even explicitly know what you want). This is complicated by the fact that the implementation of an algorithm using a certain language, library, OS, etc. can hide a plethora of bugs.
Testing correctness of programs is a vast field, however
Using a few good practices will get you a long way toward correct and reproducible code.
Here are just two of these good programming practices. It’s very easy to pick up bad habits and hard to lose them so make sure to practice these principles from the beginning :)
Remember the rant about giving variables meaningful names? The pragmatic side of it is that you need to communicate what your code does. People (even future you) will evaluate code to see if it makes sense and does what it’s supposed to. The flipside is that people won’t trust your code if it looks like spaghetti.
It’s very tempting to copy-paste code to do the same thing mulitple times. We’ve all done it. It’s also a very bad idea. It creates a “code swamp” that hides the main flow of the code but more importantly, it’s highly error-prone, especially when making changes. And changes will be made. There’s no such thing as 100% single-use code.
So instead of something like this:
data1 = read_csv("my_data_file_1.csv")
avg1 = mean(data1)
data2 = read_csv("my_data_file_2.csv")
avg2 = mean(data2)
data2 =read_csv("my_data_file_3.csv") # this copy-paste-change error is on purpose, but it happens
avg3 = mean(data2)
Generalize:
filenames = ["my_data_file_1.csv", "my_data_file_2.csv", "my_data_file_3.csv"]
averages = map(mean,
map(read_csv, filenames)
)
Now you can just extend the list of file names, or use
list.files()
to get all files names in the directory. And
if you ever want to summarize the values differently, you just replace
mean
in one place.
If the benefit of the general form is not obvious, imagine the procedure for each file is like 20 lines long.
Of course, you need not (or can not) think up the best code structure beforehand, but do take steps to avoid bad patterns.
The most obvious way to test code is to feed it input and see if it produces expected output. It’s very difficult or even impossible to cover all cases, especially if one is not aware of all corner cases of more complex code. But this can be alleviated by writing your programs in a modular way and testing each “module” separately. This is easier if one follows principle (C2) above.
A good practice is to actually prepare test cases before writing the code. This helps thinking a bit deeper about the problem being solved and potential gotchas.
This can be as simple as writing something like:
def my_procedure(x):
return x^2 * 10
# Tests
test_inputs = [1, 2, 3, -1, -2]
expected_outputs = [10, 40, 90, 10, 40]
actual_outputs = map( my_procedure, test_inputs)
print(expected_outputs == list(actual_outputs))
## False
There is also a package called testthat which offers more functionality for writing tests but as you can see, it’s super easy to write large batteries of tests to assess whehter you have the solution you think you have.
Write a function that counts the number of times
an ID appears in a vector. E.g. in
["abc", "xyz", "abc", "def"]
, “abc” appears 2 times, the
rest 1 time.
Describe its time cost using a formula. Aim for a low time cost when building your solution.
A solution written in every-day python is given below, for you to test against. Please write the function using only base python (“from scratch”) using the data structures and methods discussed earlier.
Run the code below to generate 10 thousand random IDs consisting of 2 letters and a digit, then write your function and test it against the solution written below.
import random
import string
# Set seed for reproducibility
random.seed(42)
# Generate random strings
random_strings = [''.join(random.choices(string.ascii_uppercase, k=2) + random.choices(string.digits, k=1)) for _ in range(10000)]
# Print the first 10 random strings
print(random_strings[:10])
## ['QA2', 'FT6', 'XC4', 'AF5', 'AF6', 'OF5', 'VA8', 'SI1', 'YI0', 'CW6']
# Your function here
import pandas as pd
# Assuming 'random_strings' is a list of strings
# You can replace it with your actual list
# Create a DataFrame
df = pd.DataFrame({'id': random_strings})
# Group by 'id' and count occurrences
answer = df.groupby('id').size().reset_index(name='count')
# Print the result
print(answer)
## id count
## 0 AA1 1
## 1 AA3 1
## 2 AA4 2
## 3 AA5 1
## 4 AA6 2
## ... ... ...
## 5213 ZZ3 1
## 5214 ZZ4 2
## 5215 ZZ7 2
## 5216 ZZ8 3
## 5217 ZZ9 2
##
## [5218 rows x 2 columns]