Sei sulla pagina 1di 8

Basic linux introduction

Curitiba - 2019

Main objectives

1 - Connect to a server and work in bash in a command line window.


2 - Understand the linux file system.
3 - Basic useful commands: move in the file system. Copy, rename and move files.
4 - First steps, using the basic commands. Retrieving and exploring data
5 - Redirecting the output.
6 - Obtaining data from GEO - SRA.
7 - Frequently used commands and filters.

1- Connection
To open a terminal on a remote machine we use the ssh protocol.
If you are already on a Linux machine

> ​ssh ​usuario@servidor.edu.uy

Example:

> ​ssh yourname1@10.41.48.11

For example the user name:​ danielapavoni1

The machine will respond by requesting the user key.

In windows you should use some ssh application, such as SSHclient or Putty or the ubuntu
version in windows10.

2. Files, File System, Moving accross the directory tree (using the TERMINAL)

STRUCTURE OF DIRECTORIES (FOLDERS) AND ARCHIVES

In Unix the files are organized in directories (equivalent to windows folders). This, which is
called "file system", has a tree structure.
In linux the files have: permissions, owner and name. We access this information through the
command ​ls​ (list) with the ​-l​ option

File NAMES
There are two ways to refer to the name of a file, the short form and the "absolute".
The first one refers simply to the name. In the second we indicate in addition to the name (short
or relative) its location.
For example, the file named​ hello.txt ​located in the ​rnaseq​ directory, which in turn is inside the
/ home directory, has an absolute name.
/home/rnaseq/hello.txt

Note that there may be several files named hello.txt in different directories. For example:

/home/falvarez/hello.txt

These are two different files, although both have the same short name. We can only
refer to a file by its short name when we ​are​ in the same directory where the file is located. Also
keep in mind that the names (and the commands) are "case sensitive", that is to say, uppercase
is different from lowercase (Hello.Txt is NOT equal to hello.txt).

The permissions are ​read ​(r)​, write ​(w)​ and execute ​(x)​.​ Only programs and scripts
should have execution permission.

3 Basic useful commands

ls Lists the contents of the current directory


mkdir Makes a new directory
mv Moves a file
cp Copies a file
rm Removes a file
cat Concatenates files
more Displays the contents of a file one page at a time
head Displays the first ten lines of a file
tail Displays the last ten lines of a file
cd Changes current working directory
pwd Prints working directory
find Finds files matching an expression
grep Searches a file for patterns
wc Counts the lines, words, characters, and bytes in a file
kill Stops a process
jobs Lists the processes that are running

The dot ".", Double dot ".." and the forward slash "/"

(​.​)​ The ​point​ refers to the current directory (where we are)


For example

hello.txt​ and ​ ./hello.txt​ are the same thing.

(​..​) ​The ​double point​ means the parent of the current directory (the directory above which we
are)
For example, if we are in ​media2/course​, the "double dot" refers to ​/media2

The slash ​/​ is used to separate directories in an absolute name, this bar alone indicates the root
directory.

4 First steps

Copy the hello.txt file in your own directory:

cp /home/rnaseq/hello.txt ./
Or
cp ../rnaseq/hello.txt ./

lists the contents of your current working directory

ls

To list all files in your home directory including those whose names begin with a dot, type:

ls -a
As you can see, ls -a lists files that are normally hidden.

We will now make a subdirectory in your home directory to hold the


files you will be creating and using in the course of this tutorial.
To make a subdirectory called ​rawdata​ in your current working directory type:

mkdir rawdata

The command ​cd directory​ means change the current working directory to ’​directory’.​ The
current working directory may be thought of as the directory you are in, i.e. your current position
in the file-system tree. To change to the directory you have just made, type:

cd rawdata

Type​ ls​ to see the contents (which should be empty)

(..) means the parent of the current directory, so typing

cd ..

will take you one directory up the hierarchy (back to your home directory).

Try it now and return to rawdata folder.


PATH: ​Pathnames enable you to work out where you are in relation to the whole file-system.
For example, to find out the absolute ​pathname​ of where you are use ​P​rint ​W​orking ​D​irectory
(pwd) command

pwd

Retrieving data from the Internet


We will work with the common budding yeast ​Saccharomyces cerevisiae​. We will need the
reference assembled genome available at the NCBI: ​https://www.ncbi.nlm.ni.gov/

Go to this address and search for ​Saccharomyces cerevisiae​. Then, in the results of genomes
click on ​Genome​ link. The genome web page will appear and you will have access to download
the files on your computer. But, as we wanted to download the sequences on the server we
need the use link and the command ​wget​. So, copy the link (with the right button) and paste it
on the prompt after the command:
wget
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_0001
46045.2_R64_genomic.fna.gz

Download the genome file and the gff file using wget and the link to the file.

Exploring data
1. Uncompress the file using gunzip
gunzip GCF_000146045.2_R64_genomic.fna.gz
ls

Uncompress the gff file

2. visualize the file


more GCF_000146045.2_R64_genomic.fna

3. List the fasta names of the genome file


grep ">" GCF_000146045.2_R64_genomic.fna

4. How many chromosomes the file has?


grep -c ">" GCF_000146045.2_R64_genomic.fna

5. How many lines have the file?


Hint: look at the ​Basic linux commands

6. Keep only the id of the chromosomes in the fasta genome file. The name of the
fastas include the ID and the Description.

cut -f1 GCF_000146045.2_R64_genomic.fna | grep ">"


cut -f1 -d" " GCF_000146045.2_R64_genomic.fna | grep ">"
Use the command ​man cut​ to determine what is doing the​ -d ​flag

to print it (redirect) in a new file:


cut -f1 -d" " GCF_000146045.2_R64_genomic.fna >new S288c genome

Use more or less to visualize the new file


5 REDIRECTING THE OUTPUT

The system takes the input from the ​standard input.​ By default it is the keyboard, but we can
change it to be a file. The standard output device, called ​stdout,​ is where the system output is
sent. By default the ​stdout​ is the screen, but it can also be redirected to a file, with the ​>
operator.

The pipe symbol "​|​" allows us to execute orders sequentially, so that the output of the first
command is the input of the second. As an example, we can count the lines of a file using pipe,
and redirecting this information to a new file:

cat GCF_000146045.2_R64_genomic.fna | wc -l > num_lines_of_Scergenome

7. Now visualize the gff file


Could you identify the header?
Could you identify the different columns?

8. Count how many annotations there are for each feature:


a) Remove the header
grep -v "#" GCF_000146045.2_R64_genomic.fna |wc

b) keep only the third column


cut -f3 GCF_000146045.2_R64_genomic.gff
c) and finally sort and use the command uniq -c (and put all toghether)
grep -v "#" GCF_000146045.2_R64_genomic.gff | cut -f3 | sort | uniq -c

6 Obtaining data from GEO - SRA


NCBI’s GEO (Gene Expression Omnibus) and SRA (Sequence Read Archive) are international
public repositories that store and distribute expression data obtained with microarrays, massive
sequencing and other high-throughput techniques for producing functional genomics data.
Starting with the GEO or SRA identifier, it is possible to find information about the target
experiment, as well as additional related identifiers which allow the download of sequences.
This is accomplished through different tools belonging to the NCBI’s SRAtoolkit.
Choose one of the samples and download the sequences in the sra format. Then produce both
corresponding fastq files (​i.e.​ in the case of paired-end Illumina data, there is one sra file and
two associated fastq files, one per read in the sequence pair). These steps are produced
with the NCBI SRAtoolkit tools prefetch and/or fastq-dump. See one example below:
prefetch -v SRR453569
fastq-dump --split-files --gzip /home/natalia/ncbi/public/sra/SRR453569.sra
fastq-dump --split-files --gzip SRR453570 #alternative
fastq-dump -X 5 --split-files --gzip SRR453571 #download first 5 reads
ls -lrt
cd ..
Sed​ is a streaming text editor (makes edits as the text passes). It is used to perform various text
editing functions, such as replacing, inserting or deleting patterns.

AWK​ is a programming language specially dedicated to word processing.


By using the awk command we can extract columns -or fields- from a file. It is possible to filter a
field of a file meeting a certain condition.

Visualize the fastq file retrieved

7. Exercise

Determine how many rna-seq’s reads belong to each gene of ​S. cerevisiae

1. Retrieve the transcripts from ​S. cerevisiae​ using wget


$wget
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_00014
6045.2_R64_rna.fna.gz

2. Uncompress the file.


3. Generate an index to blast against these transcripts
$makeblastdb -in GCF_000146045.2_R64_rna.fna -dbtype nucl

4. Copy a SRR file into your folder


cp /home/rnaseq/raw-data/SRR453566_2.fastq ./
5. From one of the SRR available on the server generate a file including only 100.000
reads
Hint: use the command ​head
6. Transform the 100.000 ​fastq​ file in a ​fasta​ file
7. Blast the fasta read file against the ​S. cerevisiae​ transcripts
8. $blastn -query fastareadfile -db S.cerevisiaetranscripts -outfmt ‘6 std qlen slen’ -out
100k_vs_ScerTranscript
9. Calculate how many reads maps to each transcript on the blast output.

Potrebbero piacerti anche