Homework 2: Intro to Python programming for data analysis¶

Last updated: 11.01.2024

Instructions¶

Create an Jupyter Notebook and write the Python code that solves the tasks below. Then add that Jupyter Notebook to your homework repo (make sure you actually push to GitHub). Try to work on the server. That's where the data files are located.

To pass the homework you need 5 points. The tasks are meant to be rather independent (and a few have subtasks).

To get full points, please:

[1 pt] Describe what the code is doing. Even if you find solutions online, it's important that you understand what's going on.
[1 pt] Make sure your notebook can be run from top to bottom (i.e. is reproducible). To check, restart the Python 3 kernel to get a fresh session and re-run all cells (in Jupyter Lab: "Kernel" → "Restart Kernel and Run All Cells..."). This is crucial because as you work, various variables will accumulate in the session and this can "sabotage" your code.

Bonus point: Write code that is understandable! I won't mind not choosing the best possible names for variables but try to inform the reader what's going on by naming things in a clear way. So use something like reference_genome_df rather than simply df or x. This is also a service to future you!

Advice¶

Remember to always inspect both the input data, as well as intermediate results
We try to point to relevant sources of information, but using the Internet and biological databases is a skill in itself, so if the contents of certain biological files are confusing, try to find information about the formats. Most databases have these described somewhere.

Before you begin, a preliminary module loading section is provided below. These are all necessary to run the example codes provided to you throughout this homework. However, you are more than welcome to use other modules to solve the tasks, if you wish.

If you would like to download this notebook, you can do so here.

In [1]:

# load modules
import pandas as pd # enables manipulation of dataframes in Python
from Bio import SeqIO # enables manipulation of sequence files
from Bio.SeqUtils import GC # calculate GC content with Biopython

Task 1: How many genes are on yeast S288C chromosome II? [1 pt]¶

Let's redo task 1 in homework 1 using Python.

Read in your version of the file saccharomyces_cerevisiae_R64-4-1_20230830.gff that you downloaded for the first homework as a table (hint: use the pd.read_csv() function). Note that the first column is the chromosome, not the sequence ID as in other GFF files.

Count the number of genes on chromosome II using dataframe filtering and manipulation functions like you've seen in the tutorial. Please have your code produce a numerical result.

If you need help, try taking a look at the Pandas cheat sheet or searching for answers on StackOverflow (or other coding help sites). Being able to combine key search terms in a search engine in order to find the information you're looking for is a very important skill to have.

Task 2: Compute average gene length by chromosome in the S288C yeast strain [1 pts]¶

Using the same dataframe you used in the Task 1, calculate the average gene length per chromosome. Remember to inspect your data, to see what you're working with. Please use the Pandas dataframe manipulation methods we've covered in the tutorial, as much as you can.

Task 3: Extract gene IDs from the S288C yeast GFF file and look for some genes [2 pts]¶

Using the S288C GFF file again, you may have noticed the attributes column contains a lot of information, some of which is quite important, like the gene ID.

Here's a function that extracts the gene ID from that column:

In [2]:

def extract_gene_id_from_attributes_s288c(df_name): 
    # this function requires the argument of the GFF dataframe name be given to it
    # it retrieves the IDs in a Pandas series from the attributes column of the GFF file
    # first save the dataframe to a temporary version to work with
    tmp_df = df_name.copy()
    # split the contents of the attributes column into a list using the semicolons as delimiters
    tmp_series = tmp_df.attributes.str.split(";")
    # overwrite the contents of the attributes column into lists instead of strings 
    tmp_df["attributes"] = tmp_series
    # extract the first element of each list into a new series
    tmp_series = tmp_df.attributes.str[0]
    # split that item into a new list and save only the second element to the series
    tmp_series = tmp_series.str.split("=").str[1]
    # the series containing the gene IDs will be returned
    return tmp_series

a. Using the table manipulation functions, create a new column gene_id in the GFF table. You should find clues for how to do this in the last exercise of the tutorial. [1 pt]

b. Write code (using the table manipulation functions) that 1) checks whether the following gene IDs are present in the table and 2) only prints the chromosomes on which these are found. [1 pt]

Hint: For 1), check out the isin() function though it's not the only way.
(Obs: only 0.5 points if you have a separate copy of the code for each gene ID :) )

YJR117W
YKL156C-A
YOL110W
YJR104C
YJL111W
YXYZZY
YNOPE

Task 4: Find top 5 longest genes in the S288C yeast [1 pt]¶

Compute gene lengths as before, but this time simply get the highest 5 values. There are several ways to do this using the basic table manipulation functions you've seen in the tutorial. If you're having trouble with those, maybe there's something else you could use from the Pandas cheat sheet or searching for answers on StackOverflow.

Task 5: Calculations per gene in the Y55 strain [2 pts]¶

The yeast Y55 strain FASTA: Y55_JRIF00000000_SGD_cds.fsa should also already be downloaded in HW1 so you can just copy it to the directory for HW2. If you are doing these exercises out of order, note that the file can be obtained from here: http://sgd-archive.yeastgenome.org/sequence/strains/Y55/Y55_SGD_2015_JRIF00000000/archive/Y55_JRIF00000000_SGD_cds.fsa.gz.

As a demo of Biopython, let's compute the GC content per gene. To do this you first need to load in the .fsa file with the Biopython's SeqIO module. Write that small bit of code to load the .fsa file, since you'll need it, plus the code below to solve this task.

To compute the GC contents you can use a function called GC() from the Biopython SeqUtils module, which you can load like so: from Bio.SeqUtils import GC

You can run the command directly on any given sequence in a FASTA file like so:

gc_content = GC(sequence)

You can use SeqIO in combination with the code snippet above to calculate the GC content for each gene, by running the code in a for loop, as shown below:

In [3]:

%%script false --no-raise-error

# solving the GC content
# ref: https://github.com/vappiah/Python-Bioinformatics-Hacks/blob/main/Notebooks/GC_content.ipynb
# ref: https://biopython.org/wiki/SeqIO
# ref: https://bioinformatics.stackexchange.com/questions/307/changing-the-record-id-in-a-fasta-file-using-biopython

input_fasta = "Y55_JRIF00000000_SGD_cds.fsa"

# as before, let's use a counter to prevent printing all items
counter = 0

for record in SeqIO.parse(input_fasta, "fasta"):
    # use SeqIO to open the FASTA file
    if counter <=10: 
    # print onlyfirst few entries
        seq_gc_content = GC(record.seq)
        # print the results
        print("For sequence " + record.description + ", the GC content is " + str(round(seq_gc_content, 3)) + ".")
        # now increase the counter value by 1
        counter += 1

But what happens if we want to compute something else not supported by Biopython? In that case, the simplest solution may be to convert the FASTA file's contents into a Pandas dataframe.

So let's convert it to a dataframe instead using the DataFrame() function, setting the columns as gene_info and sequence:

In [4]:

%%script false --no-raise-error

# create a Pandas dataframe from the FASTA file
# ref: https://stackoverflow.com/questions/19436789/biopython-seqio-to-pandas-dataframe
# ref: https://stackoverflow.com/questions/69696077/i-want-to-parse-sequences-and-sequence-ids-from-a-fasta-file-and-assign-them-to

# create two empty lists for the sequence IDs and sequences
id_list = []
sequence_list = []

with open(input_fasta) as fasta_file: 
    # open the file
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):
        # parse the contents of the FASTA file
        # and append the relevant information to the lists
        id_list.append(seq_record.description)
        sequence_list.append(str(seq_record.seq))

# generate a new Pandas dataframe from the two lists
# ref: https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/
genes_y55_df = pd.DataFrame(list(zip(id_list, sequence_list)), columns = ['gene_info', 'sequence'])
genes_y55_df

Here's a function to extract the gene ID from the gene_info fields:

In [5]:

# function to extract the gene IDs from the gene_info columns

def extract_gene_id_from_names_y55(gene_info_col):
    # define the function name & set internal argument
    return gene_info_col.split("|")[3]
    # split the string into a list using pipes ("|") as the delimiters
    # then take the 4th element in the list (Pythonic index 3)
    # return that value to outside of the function

a. Use the function to add a gene_id column. Think about how you might apply a function to a dataframe. (Note that the code here should look different from the code you used to generate the gene_id column in Task 3.) [1 pt] Hint for cleaning up: you can delete a column from a Pandas dataframe by dropping it: pd.DataFrame.drop

b. Rare codons are biologically significant (here's an example, if you're curious). In yeast, the rarest codon (besides stop codons) is CGG (mRNA) [source]. So, using dataframe manipulation functions on the resulting dataframe from a., create a column to count the occurrence of GCC (DNA) per gene.

Task 6: Search for a motif in all genes [1 pt]¶

Filter the S288C_reference_sequence_R64-4-1_20230830.fsa file (which, if you haven't already, you can download from here: https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz) for those genes that contain the sequence motif "CCACACCACACCCAC".

Then convert the hits into a new dataframe (hint: Python & Pandas often don't like operations on variables that are copies of each other, so consider using the .copy() function).

Which chromosome has the most hits? (Hint: try adapting the code above to create a new list to add to the Pandas dataframe, which contains information on the chromosome IDs of each FASTA record.)

Write the code solution using dataframe manipulation functions. Your code should produce only the chromosome number with the highest motif count (i.e. not just printing the table of all chromosomes). Remember that the column "group" means "chromosome" since the input FASTA file simply has full chromosomes as entries, not individual genes ("group_name" is irrelevant here).