README - Project Container¶

Author: Vi Varga

Last Modified: 19.02.2024

Introduction¶

This README.ipynb file provides a brief explanation/guide for how to use the container that has been prepared for students in BBT045 to use for their projects. A README.md file will also be provided, with all markdown-formatted text included therein.

The bbt045-projects.sif container has been created for students to use to run their projects, so that students do not have to install their own software on the Vera cluster. Please contact the teaching staff (especially Vi), if you would like access to a program that is not included in the container, or if something malfunctions.

If you do not have it already, you can download this information in Jupyter Notebook format from here.

Installed software¶

The full list of programs installed in the bbt045-projects.sif container can be found in the bbt045-projects.yml and conda_environment_args_proj.def files included in the same directory as the container (/cephyr/NOBACKUP/groups/bbt045_2024/ProjectSoftware/). Below is a list of the most important software:

  • FastQC
  • TrimGalore!
  • Trimmomatic
  • MetaCompass
  • SPAdes (including metaSPAdes)
  • Prokka
  • CD-HIT
  • MetaPhlan2
  • Bowtie2
  • Python
    • Biopython
    • Jupyter
    • matplotlib, seaborn
    • numpy, pandas
    • scipy

Using the container¶

In order to use the bbt045-projects.sif container, please use the run_jupyter_proj.sh script found in the same directory as the container, and modify the time requirement and ID as you have done for the run_jupyter.sh script before. ALternatively, you can continue using your copy of the run_jupyter.sh, script, and simply change the PATH to the container to read:

container=/cephyr/NOBACKUP/groups/bbt045_2024/ProjectSoftware/bbt045-projects.sif

Of the programs mentioned above, all but MetaCompass have been installed using conda. All programs installed via conda can be run directly from within your Jupyter Notebook, like so:

In [ ]:
! metaspades.py -h
SPAdes genome assembler v3.15.5 [metaSPAdes mode]

Usage: spades.py [options] -o <output_dir>

Basic options:
  -o <output_dir>             directory to store all the resulting files (required)
  --iontorrent                this flag is required for IonTorrent data
  --test                      runs SPAdes on toy dataset
  -h, --help                  prints this usage message
  -v, --version               prints version

Input data:
  --12 <filename>             file with interlaced forward and reverse paired-end reads
  -1 <filename>               file with forward paired-end reads
  -2 <filename>               file with reverse paired-end reads
  -s <filename>               file with unpaired reads
  --merged <filename>         file with merged forward and reverse paired-end reads
  --pe-12 <#> <filename>      file with interlaced reads for paired-end library number <#>.
                              Older deprecated syntax is -pe<#>-12 <filename>
  --pe-1 <#> <filename>       file with forward reads for paired-end library number <#>.
                              Older deprecated syntax is -pe<#>-1 <filename>
  --pe-2 <#> <filename>       file with reverse reads for paired-end library number <#>.
                              Older deprecated syntax is -pe<#>-2 <filename>
  --pe-s <#> <filename>       file with unpaired reads for paired-end library number <#>.
                              Older deprecated syntax is -pe<#>-s <filename>
  --pe-m <#> <filename>       file with merged reads for paired-end library number <#>.
                              Older deprecated syntax is -pe<#>-m <filename>
  --pe-or <#> <or>            orientation of reads for paired-end library number <#> 
                              (<or> = fr, rf, ff).
                              Older deprecated syntax is -pe<#>-<or>
  --s <#> <filename>          file with unpaired reads for single reads library number <#>.
                              Older deprecated syntax is --s<#> <filename>
  --pacbio <filename>         file with PacBio reads
  --nanopore <filename>       file with Nanopore reads

Pipeline options:
  --only-error-correction     runs only read error correction (without assembling)
  --only-assembler            runs only assembling (without read error correction)
  --checkpoints <last or all>
                              save intermediate check-points ('last', 'all')
  --continue                  continue run from the last available check-point (only -o should be specified)
  --restart-from <cp>         restart run with updated options and from the specified check-point
                              ('ec', 'as', 'k<int>', 'mc', 'last')
  --disable-gzip-output       forces error correction not to compress the corrected reads
  --disable-rr                disables repeat resolution stage of assembling

Advanced options:
  --dataset <filename>        file with dataset description in YAML format
  -t <int>, --threads <int>   number of threads. [default: 16]
  -m <int>, --memory <int>    RAM limit for SPAdes in Gb (terminates if exceeded). [default: 250]
  --tmp-dir <dirname>         directory for temporary files. [default: <output_dir>/tmp]
  -k <int> [<int> ...]        list of k-mer sizes (must be odd and less than 128)
                              [default: 'auto']
  --phred-offset <33 or 64>   PHRED quality offset in the input reads (33 or 64),
                              [default: auto-detect]
  --custom-hmms <dirname>     directory with custom hmms that replace default ones,
                              [default: None]

MetaCompass does not have a conda package available, so it has been installed in the container from source. In order to use it, you must call the program using the full path to the executable, like so:

In [ ]:
! /opt/MetaCompass-2.0-beta/go_metacompass.py -h
MetaCompass metagenome assembler version 2.0.0 by Victoria Cepeda (vcepeda@cs.umd.edu)

usage: go_metacompass.py [-h] [-c [CONFIG]] [-1 [FORWARD]] [-2 [REVERSE]]
                         [-U [UNPAIRED]] [-r [REF]] [-s [REFSEL]]
                         [-p [PICKREF]] [-m [MINCOV]] [-g [MINCTGLEN]]
                         [-l [READLEN]] [-b] -o [OUTDIR] [-k] [-t [THREADS]]
                         -y [MEMORY] [--Force] [--unlock] [--nolock]
                         [--verbose] [--reason] [--dryrun]

snakemake and metacompass params

options:
  -h, --help            show this help message and exit

required:
  -c [CONFIG], --config [CONFIG]
                        config (json) file, set read length etc
  -1 [FORWARD], --forward [FORWARD]
                        Provide comma separated list of forward paired-end
                        reads
  -2 [REVERSE], --reverse [REVERSE]
                        Provide comma separated list of reverse paired-end
                        reads
  -U [UNPAIRED], --unpaired [UNPAIRED]
                        Provide comma separated list of unpaired reads
                        (r1.fq,r2.fq,r3.fq)

metacompass:
  -r [REF], --ref [REF]
                        reference genomes
  -s [REFSEL], --refsel [REFSEL]
                        reference selection [tax/all]
  -p [PICKREF], --pickref [PICKREF]
                        depth or breadth
  -m [MINCOV], --mincov [MINCOV]
                        min coverage to assemble
  -g [MINCTGLEN], --minctglen [MINCTGLEN]
                        min contig length
  -l [READLEN], --readlen [READLEN]
                        max read length

output:
  -b, --clobber         clobber output directory (if exists?)
  -o [OUTDIR], --outdir [OUTDIR]
                        output directory? (cwd default)
  -k, --keepoutput      keep all output generated (default is to delete all
                        but final fasta files)

performance:
  -t [THREADS], --threads [THREADS]
                        num threads
  -y [MEMORY], --memory [MEMORY]
                        memory

snakemake:
  --Force               force snakemake to rerun
  --unlock              unlock snakemake locks
  --nolock              remove stale locks
  --verbose             verbose
  --reason              reason
  --dryrun              dryrun

And of course, you can run Python code directly from within the code cells of your Jupyter Notebook, like so:

In [ ]:
print("Jupyter Notebook cells are Python cells by default.")
Jupyter Notebook cells are Python cells by default.