Homework 2: Intro to Python programming for data analysis¶

Last updated: 24.01.2025

Instructions¶

Create a Jupyter Notebook and write the Python code that solves the tasks below. Work on Vera, since that's where the data files are located. This homework is due on Wednesday, February 5, 2025. Please submit the Jupyter Notebook file on Canvas.

To pass the homework you need 5 points. The tasks are meant to be rather independent (and a few have subtasks).

To get full points, please:

[1 pt] Describe what the code is doing. Even if you find solutions online, it's important that you understand what's going on.
[1 pt] Make sure your notebook can be run from top to bottom (i.e. is reproducible). To check, restart the Python 3 kernel to get a fresh session and re-run all cells (in Jupyter Lab: "Kernel" → "Restart Kernel and Run All Cells..."). This is crucial because as you work, various variables will accumulate in the session and this can "sabotage" your code.

Bonus point: Write code that is understandable! I won't mind not choosing the best possible names for variables but try to inform the reader what's going on by naming things in a clear way. So use something like reference_genome_df rather than simply df or x. This is also a service to future you!

Advice¶

Remember to always inspect both the input data, as well as intermediate results
We try to point to relevant sources of information, but using the Internet and biological databases is a skill in and of itself, so if the contents of certain biological files are confusing, try to find information about the formats. Most databases have these described somewhere.

Before you begin, a preliminary module loading section is provided below. These are all necessary to run the example codes provided to you throughout this homework. However, you are more than welcome to use other modules to solve the tasks, if you wish.

If you would like to download this notebook, you can do so here. If you are on the server, you can use wget https://bengtssonpalme.github.io/MPBIO-BBT045-2025/python_for_data_analysis/homework_python_intro.ipynb.

In order to complete the homework, you will need to copy the following files from your unix_tutorial/ directory:

S288C_reference_sequence_R64-5-1_20240529.fsa
saccharomyces_cerevisiae_R64-5-1_20240529.gff
Y55_JRIF00000000_SGD_cds.fsa

Once you are ready to start the Homework, launch JupyterLab the same way you did for the Tutorial, with the custom Python-Intro-HW.sh script in the ~/portal/jupyter/ directory selected as your Runtime on the Vera OnDemand portal.

In [ ]:

# load modules
import pandas as pd # enables manipulation of dataframes in Python
from Bio import SeqIO # enables manipulation of sequence files
from Bio.SeqUtils import GC # calculate GC content with Biopython

Task 1: How many genes are on yeast S288C chromosome II? [1 pt]¶

Let's redo task 1 in homework 1 using Python.

Read in your version of the file saccharomyces_cerevisiae_R64-5-1_20240529.gff that you downloaded for the first homework as a table (hint: use the pd.read_csv() function). Note that the first column is the chromosome, not the sequence ID as in other GFF files.

Count the number of genes on chromosome II using dataframe filtering and manipulation functions like you've seen in the tutorial. Please have your code produce a numerical result.

If you need help, try taking a look at the Pandas cheat sheet or searching for answers on StackOverflow (or other coding help sites). Being able to combine key search terms in a search engine in order to find the information you're looking for is a very important skill to have.

In [ ]:

Task 2: Compute average gene length by chromosome in the S288C yeast strain [1 pts]¶

Using the same dataframe you used in the Task 1, calculate the average gene length per chromosome. Remember to inspect your data, to see what you're working with. Please use the Pandas dataframe manipulation methods we've covered in the tutorial, as much as you can.

In [ ]:

Task 3: Extract gene IDs from the S288C yeast GFF file and look for some genes [2 pts]¶

Using the saccharomyces_cerevisiae_R64-5-1_20240529.gff GFF file again, you may have noticed the attributes column contains a lot of information, some of which is quite important, like the gene ID.

Here's a function that extracts the gene ID from that column:

In [ ]:

def extract_gene_id_from_attributes_s288c(df_name): 
	# this function requires the argument of the GFF dataframe name be given to it
	# it retrieves the IDs in a Pandas series from the attributes column of the GFF file
	# first save the dataframe to a temporary version to work with
	tmp_df = df_name.copy()
	# extract content of attributes column starting from either 'Name' or 'ID'
	# ref: https://stackoverflow.com/questions/60063716/pandas-dataframe-extract-string-between-two-strings-and-include-the-first-deli
	# ref: https://stackoverflow.com/questions/42320506/regex-way-to-say-either-a-semicolon-or-the-end-of-the-string
	tmp_series = tmp_df["attributes"].str.extract('((ID|Name).*(?:$))')
	# split the string into a list based on semicolon placement
	tmp_series[0] = tmp_series[0].str.split(";")
	# extract the name or ID value to a new column
	# ref: https://stackoverflow.com/questions/37125174/accessing-every-1st-element-of-pandas-dataframe-column-containing-lists
	tmp_series[2] = tmp_series[0].str[0]
	# split the string at the =
	tmp_series[2] = tmp_series[2].str.split("=")
	# extract the name/ID to a new column
	tmp_series[3] = tmp_series[2].str[1]
	# retain only the final column wiht the name/ID values
	tmp_series = tmp_series[3]
	# the series containing the gene IDs will be returned
	return tmp_series

a. Using the table manipulation functions, create a new column gene_id in the GFF table. You should find clues for how to do this in the last exercise of the tutorial. [1 pt]

In [ ]:

b. Write code (using the table manipulation functions) that 1) checks whether the following gene IDs are present in the table and 2) only prints the chromosomes on which these are found. [1 pt]

Hint: For 1), check out the isin() function though it's not the only way.
(Obs: only 0.5 points if you have a separate copy of the code for each gene ID :) )

YJR117W
YKL156C-A
YOL110W
YJR104C
YJL111W
YXYZZY
YNOPE

In [ ]:

Task 4: Find top 5 longest genes in the S288C yeast [1 pt]¶

Compute gene lengths as before, but this time simply get the highest 5 values. There are several ways to do this using the basic table manipulation functions you've seen in the tutorial. If you're having trouble with those, maybe there's something else you could use from the Pandas cheat sheet or searching for answers on StackOverflow.

In [ ]:

Task 5: Calculations per gene in the Y55 strain [2 pts]¶

The yeast Y55 strain FASTA: Y55_JRIF00000000_SGD_cds.fsa should also already be downloaded in HW1 so you can just copy it to the directory for HW2. If you are doing these exercises out of order, note that the file can be obtained from here: http://sgd-archive.yeastgenome.org/sequence/strains/Y55/Y55_SGD_2015_JRIF00000000/archive/Y55_JRIF00000000_SGD_cds.fsa.gz.

As a demo of Biopython, let's compute the GC content per gene. To do this you first need to load in the .fsa file with the Biopython's SeqIO module. Write that small bit of code to load the .fsa file, since you'll need it, plus the code below to solve this task.

To compute the GC contents you can use a function called GC() from the Biopython SeqUtils module, which you can load like so: from Bio.SeqUtils import GC

You can run the command directly on any given sequence in a FASTA file like so:

gc_content = GC(sequence)

You can use SeqIO in combination with the code snippet above to calculate the GC content for each gene, by running the code in a for loop, as shown below:

In [ ]:

%%script false --no-raise-error

# solving the GC content
# ref: https://github.com/vappiah/Python-Bioinformatics-Hacks/blob/main/Notebooks/GC_content.ipynb
# ref: https://biopython.org/wiki/SeqIO
# ref: https://bioinformatics.stackexchange.com/questions/307/changing-the-record-id-in-a-fasta-file-using-biopython

input_fasta = "Y55_JRIF00000000_SGD_cds.fsa"

# as before, let's use a counter to prevent printing all items
counter = 0

for record in SeqIO.parse(input_fasta, "fasta"):
    # use SeqIO to open the FASTA file
    if counter <=10: 
    # print onlyfirst few entries
        seq_gc_content = GC(record.seq)
        # print the results
        print("For sequence " + record.description + ", the GC content is " + str(round(seq_gc_content, 3)) + ".")
        # now increase the counter value by 1
        counter += 1

But what happens if we want to compute something else not supported by Biopython? In that case, the simplest solution may be to convert the FASTA file's contents into a Pandas dataframe.

So let's convert it to a dataframe instead using the DataFrame() function, setting the columns as gene_info and sequence:

In [ ]:

%%script false --no-raise-error

# create a Pandas dataframe from the FASTA file
# ref: https://stackoverflow.com/questions/19436789/biopython-seqio-to-pandas-dataframe
# ref: https://stackoverflow.com/questions/69696077/i-want-to-parse-sequences-and-sequence-ids-from-a-fasta-file-and-assign-them-to

# create two empty lists for the sequence IDs and sequences
id_list = []
sequence_list = []

with open(input_fasta) as fasta_file: 
    # open the file
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
        # parse the contents of the FASTA file
        # and append the relevant information to the lists
        id_list.append(seq_record.description)
        sequence_list.append(str(seq_record.seq))

# generate a new Pandas dataframe from the two lists
# ref: https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/
genes_y55_df = pd.DataFrame(list(zip(id_list, sequence_list)), columns = ['gene_info', 'sequence'])
genes_y55_df

Here's a function to extract the gene ID from the gene_info fields:

In [ ]:

# function to extract the gene IDs from the gene_info columns

def extract_gene_id_from_names_y55(gene_info_col):
    # define the function name & set internal argument
    return gene_info_col.split("|")[4]
    # split the string into a list using pipes ("|") as the delimiters
    # then take the 5th element in the list (Pythonic index 4)
    # return that value to outside of the function

a. Use the function to add a gene_id column. Think about how you might apply a function to a dataframe. (Note that the code here should look different from the code you used to generate the gene_id column in Task 3.) [1 pt]

Hint for cleaning up: you can delete a column from a Pandas dataframe by dropping it: pd.DataFrame.drop

In [ ]:

b. Rare codons are biologically significant (here's an example, if you're curious). In yeast, the rarest codon (besides stop codons) is CGG (mRNA) [source]. So, using dataframe manipulation functions on the resulting dataframe from a., create a column to count the occurrence of GCC (DNA) per gene.

In [ ]:

Task 6: Search for a motif in all genes [1 pt]¶

Filter the S288C_reference_sequence_R64-5-1_20240529.fsa file (which, if you haven't already, you can download from here: https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz) for those genes that contain the sequence motif "CCACACCACACCCAC".

Then convert the hits into a new dataframe (hint: Python & Pandas often don't like operations on variables that are copies of each other, so consider using the .copy() function).

Which chromosome has the most hits? (Hint: try adapting the code above to create a new list to add to the Pandas dataframe, which contains information on the chromosome IDs of each FASTA record.)

Write the code solution using dataframe manipulation functions. Your code should produce only the chromosome number with the highest motif count (i.e. not just printing the table of all chromosomes). Remember that the column "group" means "chromosome" since the input FASTA file simply has full chromosomes as entries, not individual genes ("group_name" is irrelevant here).

In [ ]: