Skip to the content.

Homework 1: Introduction to Unix for bioinformatics

Notes:

Tasks:

  1. How many protein-coding genes are on chromosome II in saccharomyces_cerevisiae_R64-5-1_20240529.gff?
    • Here we’re interested in protein-coding genes, which are recorded as just “gene”. You can ignore other things like “tRNA_gene” etc.
  2. Calculate the GC-content in yeast strain S288C. The fasta file is S288C_reference_sequence_R64-5-1_20240529.fsa.
    • The GC-content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either G or C.
    • You can calculate the percentage by hand or using Unix tools like expr or bc. The important thing is to get the base counts.
    • Note that sequence files may contain characters like “N” (“nucleobase” - basically unknown) or lowercase letters. Ignore the “N”s and convert the lowercase letters to uppercase.
  3. Download and decompress the ORFs of another yeast strain (Y55) from: http://sgd-archive.yeastgenome.org/sequence/strains/Y55/Y55_SGD_2015_JRIF00000000/archive/Y55_JRIF00000000_SGD_cds.fsa.gz

    Then:

    1. Compare the GC content of the Y55 strain with the GC content of the of the S288C strain you calculated before.
      Which strain has the highest GC content?

    2. Compare number of ORFs in the Y55 strain (all the enties in the Y55_JRIF00000000_SGD_cds.fsa file since it only contains ORFs) with the number of ORFs in the S288C reference genome (you should have downloaded the file orf_coding_all_R64-5-1_20240529.fasta during the tutorial).

    3. Count the common ORFs between the Y55 and the S288C strains.

      • You can download this file http://sgd-archive.yeastgenome.org/sequence/strains/Y55/Y55_SGD_2015_JRIF00000000/Y55.README to take a look at which information is included in the header of the Y55 ORFs and its format.
      • Hint 1: You will need to extract the ORFs names for both strains. The ORF name is usually the first field in the FASTA header, but it’s always good to double-check!.
      • Hint 2: As you can see in the Y55.README file, the ORFs names in the Y55_JRIF00000000_SGD_cds.fsa file contain the strain name, it might be a good idea to remove it.