Basic Unix operations for bioinformatics
0.1. How to connect to remote accounts
See here.
0.2. Keep track of what you’re doing
Create a text file called unix_exercises_FIRSTNAME_LASTNAME.txt
on your own computer or directly on the server (replacing first and second name with yours)
and keep it open in a text editor (not Word).
In there, paste the commands you use for the tasks below, e.g.
# Contents of current directory
<command goes here>
Output:
file1 file2 ...
Use whatever format you like, but try to be consistent.
Why
It’s a good idea to save the list of commands you use, along with comments about what you were doing and what results you got. This will help trace your steps. The terminal does keep a history, but it’s limited and you may want to reproduce results on a different computer.
Later on, you can rerun commands by pasting them into the terminal. Even later on, you can change this file into a command script (similar to a program in R, MATLAB, etc) that runs all the commands inside.
Use whatever text editor you prefer. If you’re not sure, some good ones are Atom, Sublime, or Notepad++, and gedit (on Linux). You can even use Notepad or TextEdit, but these are a pain for any non-trivial editing.
Avoid Word though. It insists on replacing some characters with visually similar versions that are not understood by the Unix commands.
If you want to work directly on the server, you can connect from a second terminal window.
In one terminal window you can keep an open nano
, where you write down your solutions.
And in the other you would test out your commands.
Alternatively, you can write everyting on your laptop, then later paste the text into nano on the server.
Keep in mind that the homework should be handed in from the server however (more on this in the Git tutorial).
0.3. Quick note on command syntax for beginners
Keep in mind that spelling and spacing is crucially important. Generally, commands have the structure
command -option --option-with-long-name ARGUMENTS
Note the spaces separating the different elements. Generally it’s not important how many spaces there are.
Compare this with how you would write a command in a language like R or MATLAB:
command(option, option_with_long_name, ARGUMENTS)
- Arguments are the things on which you issue the command on (e.g. files, directories). E.g.
fix all_problems.txt
(not a real command :) ) - Options or “flags” are parameters that control the command’s behaviour.
Multiple options can usually be written together if they don’t require
values. E.g.
command -abcd -e 3
(-e
is not chained since it takes the numerical value). Going back to our comparison with other languages, something like this would look likecommand(a, b, c, d, e = 3)
in e.g. R or Python. - Long options names are often alternatives for one-letter options, to ease remembering/readability. Some options only have long names (usually the infrequently used).
You can get information on the various options by running command --help
(will print info in the terminal) or man command
(will load the manual page - move up/down with arrow keys, exit by pressing q
)
To issue a previous command, press the up arrow key to scroll back through the command history until you reach it. Press down to get back to the “present”
Some words of caution
- Unix environments (including macOS) are case sensitive, meaning
command
andCommand
are different things. Most commands have lower case names. - Careful not to separate an option from its
-
sign. The command will understand it got 2 separate options. - Also be careful that you actually use a minus sign (like you have on the keyboard). Some text processors like Word replace minus with a hyphen (a longer horizontal bar). While they look the same, programs ultimately only care about the numerical encoding of these symbols, so they will complain.
0.4. Managing directories
As you may have noticed there is no specific directories other then your home directory when you first log on to your server account. You thus need to create the directories that you want to structure your projects in.
You can create a new directory with the command mkdir [Directory-name]
If you for example is in your home directory and want to create the Documents directory you would run the command mkdir Documents
If you at some point make a mistake or for other reasons want to rename either a file or a directory you can do that with the command mv [Origin-File] [Target-File]. Thus, if you want to rename File1.txt to File2.txt you would type `mv File1.txt File2.txt
The mv command can also be used to move files or directories to new locations, mv [Original-location] [Target-location].
For instance, if file1.txt is located in your home directory and you want to move it to your newly created directory Documents
You would use the command mv file1.txt Documents/file1.txt
If you want to delete files or directories you can use the command rm [File]. If you want to delete a directory you need to include the option -r. the -r option will make rm delete all the files and other directories recursively in the specified directory.
Note There is no “bin” directory that deleted files will go to as a first deletion like in windows and MacOS when using rm.
1. Managing files
Note Here, we’ll use ALL_CAPS words to mean placeholders for the actual things to write.
- Start off by fetching a copy of the data we’ll be working with.
Run these commands one at a time (i.e. a line at a time).
Just paste these at the command prompt (the $
sign).
Everything following #
on a line is a comment to the user and has no effect.
You don’t need to actually run those.
# Download a reference yeast genome as compressed archive to the current directory
wget https://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz
# Decompress the archive (will create a new diretory)
tar -xzvf S288C_reference_genome_Current_Release.tgz
This will download and extract the data to a directory called
S288C_reference_genome_R64-2-1_20150113
-
List the contents of the current directory. Command:
ls
-
List the files in the newly created directory. Command:
ls DIRECTORY
. Try also using thels -l DIRECTORY
. It will give a detailed vertical list of directory contents. -
Rename the
S288C_reference_genome_R64-2-1_20150113
todata
. Command:mv SOURCE DESTINATION
One “moves” the directory to the same place but with a new name. -
See where you are on the system by running
pwd
(“print working directory”). The tilde~
you see next to the$
sign is shorthand for your home directory. -
Move your location to the
data
directory withcd data
(Hint: as you start typing the name, press[Tab]
to make the terminal complete the name for you; if multiple files share the first letters, only the common first part will be suggested and you need to continue typing). Confirm where you are by runningpwd
. -
Inspect the contents of file
rna_coding_R64-2-1_20150113.fasta
. It’s quite large, so you should useless FILENAME
to view it. Press the arrow keys to move up/down andq
to exit. -
Move back outside of
data
withcd ..
and make a directory calledresults
-
Create an empty file inside it called
counts.txt
. One quick (and safe) way istouch results/counts.txt
This command normally just changes the time the file was last accessed (“touched”) but will crate an empty file if it doesn’t exist. -
Actually, let’s timestamp the
results
directory for posterity. Rename it toresults_2020_01_28
2. Analyzing files
-
Count the lines in
orf_coding_all_R64-2-1_20150113.fasta
andorf_trans_all_R64-2-1_20150113.fasta
(Hint: thewc
command, runman wc
) -
Count how many genes are in
orf_coding_all_R64-2-1_20150113.fasta
. Note that this counting relies on the structure of the file only (i.e. no biological interpretation of the data is needed). Usegrep
and/orwc
, seeman grep
andman wc
(or Google).grep
is a standard (and very powerful) text search tool. The general usage isgrep "TEXT_PATTERN" FILE
. Note! Be very careful to use the quotation marks, to make suregrep
interprets the pattern as text only, not as another command. -
Save the number of genes in
results_2020_01_28/counts.txt
Use eithernano
to manually edit (see section below) or use the redirect symbol>
to save the output of the previous command to the file, likecount_genes_command > results_2019_01_22/counts.txt
If you do the latter, you’ll no longer also get output to the screen (this is on purpose). -
Count how many
tRNA
genes are inrna_coding_R64-2-1_20150113.fasta
(Hint:grep
) -
Save the results of command 4 to
counts.txt
by appending to the file using the>>
(double>
) redirect symbol, similar to before. Make sure to not replace existing content. Or usenano
-
Inspect that
counts.txt
has the content you expect. It’s a small file so runcat results_2020_01_28/counts.txt
. -
Make a copy of
counts.txt
calledcounts.txt.orig
(it should also be placed in theresults_2020_01_28
directory)
How to use the nano
editor
nano
is a very simple text editor, available on most Unix distributions.
To open a file with it simply run nano the_file.txt
.
If the file does not exist, an empty one will be created.
The editor will run “full screen”, hiding the command line.
- To save, press
Ctrl-o
, then[Enter]
. - To exit, press
Ctrl-x
3. Working with text columns
We’ll be using another data set with gene expression data.
Move back to your home directory (cd
with no argument).
Make a new directory for this exercise and change location to it.
# gene experssion data (fold change)
wget https://web.archive.org/web/20130211035221/http://www.cbs.dtu.dk/courses/27619/ex1.dat
# annotation
wget https://web.archive.org/web/20170706124217/http://www.cbs.dtu.dk/courses/27619/ex1.acc
The idea is to merge experimental result ex1.dat
with the annotations from
ex1.acc
Do this with the paste command:
paste ex1.acc ex1.dat
What did the command do?
Inspect both files and compare with the output you got from paste
4. Counting yeast-human orthologs
First, move back to your home directory (cd
with no argument), then
make a new directory for this exercise and change location to it.
Then download and decompress the EggNOG list of orthologous groups across Eukaryotes,
as well as a table of taxonomic IDs:
wget http://eggnog5.embl.de/download/eggnog_5.0/per_tax_level/2759/2759_members.tsv.gz
gunzip 2759_members.tsv.gz
wget http://eggnog5.embl.de/download/eggnog_5.0/e5.taxid_info.tsv
The file 2759_members.tsv
contains 6 columns (you can view it with less -S
).
Each row is an orthologous group.
The 2nd column is the group ID and the 5th column contains the list of protein IDs contained in that group.
The format of each protein ID is Taxon_ID.Sequence_ID
(note the dot).
We want to find all the groups that contain genes in both humans and yeast. There are multiple ways to do this. One way is to separately search for yeast and then human and intersect the results.
The TaxonID for H. sapiens == 9606
and for Saccharomyces cerevisiae == 4932
.
We find this out by grepping in the taxonomic information table e5.taxid_info.tsv
for “cerevisiae” and “sapiens”.
For each of the two organisms, the process consists of:
- filtering in the lines that contain proteins found in the organism
- take that result and extract the orthologous group IDs for each of those lines
- sort this list (relevant later)
- save it to a file
To achieve this micropipeline, we shall chain a few commands using
the pipe (|
) redirector, which simply takes the output of one command and feeds
it as input to the next.
The structure is filter "TaxonID." | cut out column 2 | sort > RESULTS_FILE
.
The actual commands for this are:
grep "4932." 2759_members.tsv | cut -f 2 | sort > groups_scerevisiae.txt
grep "9606." 2759_members.tsv | cut -f 2 | sort > groups_hsapiens.txt
Quickly inspect what a result file looks like: less -S groups_hsapiens.txt
(the -S
option disables word wrap - press left/right arrow keys to scroll horizontally).
Question Why did we use the dot in the filtering?
Now time to produce the intersection of these two lists, since we want the groups that contain both human and yeast.
The comm
command returns:
- the elements only present in list 1
- the elements only present in list 2
- the common part of two sorted lists
We only care about the size of the intersectin, so we run
comm -12 groups_hsapiens.txt groups_scerevisiae.txt | wc -l
where -12
inhibits the first 2 results and we pipe to wc -l
(only counts lines).
Bonus One can get fancier, avoid intermediate files, and do all of the above in one go:
comm -12 <(grep "4932." 2759_members.tsv | cut -f 2 | sort) <(grep "9606." 2759_members.tsv | cut -f 2 | sort) | wc -l
Here we’re feeding comm
with two ad-hoc data streams as if they were files.
The pattern here is <(command that yields text output)
which replaces the result files.
5. Sequence Processing
Produce the complementary RNA sequence of a DNA string
For replacing single characters, one could use the tr
(translate) tool like:
tr "A" "C" < FILE.fasta | less
This will replace all occruences of “A” with “C” in FILE.fasta and shows
contents on screen with less
.
We add the <
before the file name since tr
expects the actual contents, not
the file name. The <
redirector feeds the content as a stream to tr
.
Normally, tr
will print results to screen. So we redirect that output and
pipe it (|
) to less
which will only show you a screen at a time. (exit by pressing q
)
While useful for one-shot substitutions, this tool is a bit limited.
If we were to use multiple tr
runs for each nucleotide, given the complementary nature
of the pairs, we would revert previous substitutions. We can’t do it in one go with tr
either.
We’re instead going to use a standard Unix tool called sed
(stream editor).
Like the name implies, it takes either a file or the output of another command
and processes the text according to various rules.
These rules can be very complex (it’s a full programming language),
but here’s a simple use for it:
sed 'y/ACGT/UGCA/' FILE > FILE_RNA
The sed command (between quotes) substitutes (/y
)
each letter in the first list (ACGT
)
with each corresponding letter in the second (UGCA
),
individually (so A
-> U
, C
-> G
, etc.).
This command is applied on FILE and, since sed
outputs to screen by default,
we redirect the result to a new file.
We can also make sed skip FASTA header lines with a slightly more complex (and less understandable) command:
sed '/^>/!y/ACGT/UGCA/' FILE.fasta > FILE_RNA.fasta
The first part (/^>/!
) selects locations in the file. The exclamation mark
makes it a negative, in the sense “please de-select these lines and don’t process them”.
The hat ^
means beginning of line, so the inner part simply says lines starting with >
which is how the FASTA header is marked.
Task Try this out on
S288C_reference_genome_R64-2-1_20150113/orf_coding_all_R64-2-1_20150113.fasta
6. Bonus: Useful Commands
These are not required for the homework but it’s highly advised to be aware of them. You can read about them online or by reading their documentation.
man
(manual) gives the documentation for a given command (e.g.man ls
)-
most commands accept a
--help
and/or-h
flag that gives brief instructions on using it scp
andrsync
are commands that allow you to transfer files between computers. The second one is recommended (more flexible and generally has better performance)pushd
andpopd
are handy alternatives tocd
. They manage a list of visited locations:pushd DIR
takes you to that directory andpopd
(no argument) brings you back to your original location. You can usepushd
however many times andpopd
will always be a “step back”find
is the command to use if searching for files by name. The syntax is a bit different than other tools:find DIR -name PATTERN
-
ls -larth
is a useful way to use the list command:-l
gives a detailed list,-a
shows all files, even hidden ones,-h
shows file sizes in human-readable format (i.e. ‘KB’, ‘MB’ instead of bytes),-t
sorts by file modification time,-r
is reveresed order, so that the most recent files are at the bottom, closest to your prompt (useful for long file lists) -
column
allows displaying tablular files in nice, aligned columns, regardless of how columns are separated in the file (i.e. by,
,;
, tab). It’s however slower than using processing commands likecut
,sed
etc. so it’s recommended for small files or onhead
/tail
of files [Ctrl]-R
enters reverse history search mode. Just start typing something close to a command you remember using before. The search will fuzzy match in the command history. Press[Ctrl]-R
again to go to the next match (previously in time). Arrow keys allow you to edit the command and[Enter]
to run it, like usual.
It’s also recommended to start using a more advanced text editor, either vi
or emacs
, and get reasonably comfortable with it. These have a steeper learning curve than nano
but have high productivity gains, as they allow for advanced text processing commands and scripting. There are plenty of online tutorials for these, so try both out and see what makes more sense for you. vi
is more likely to be installed on any given system, but its modal way of working (view mode/edit mode) can be confusing.
You can use these quick command references:
- http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/
- https://www.git-tower.com/blog/command-line-cheat-sheet/
7. Homework
The homework is posted here
Recommended Reading
- The Unix Shell from Software Carpentry Foundation
- The Philosophy of UNIX Tools - some motivation for command line work
- Perl and Unix for Bioinformaticians course from DTU (initial inspiration for this exercise)
Chalmers University of Technology 2019