Basic Linux Introduction

Basic linux introduction
Curitiba - 2019
Main objectives
1 - Connect to a server and work in bash in a command line window.

2 - Understand the linux file system.
3 - Basic useful commands: move in the file system. Copy, rename and move files.
4 - First steps, using the basic commands. Retrieving and exploring data
5 - Redirecting the output.
6 - Obtaining data from GEO - SRA.
7 - Frequently used commands and filters.
1- Connection
To open a terminal on a remote machine we use the ssh protocol.
If you are already on a Linux machine
> ssh usuario@servidor.edu.uy
Example:
> ssh yourname1@10.41.48.11
For example the user name: danielapavoni1
The machine will respond by requesting the user key.
In windows you should use some ssh application, such as SSHclient or Putty or the ubuntu
version in windows10.
2. Files, File System, Moving accross the directory tree (using the TERMINAL)
STRUCTURE OF DIRECTORIES (FOLDERS) AND ARCHIVES
In Unix the files are organized in directories (equivalent to windows folders). This, which is
called "file system", has a tree structure.
In linux the files have: permissions, owner and name. We access this information through the
command ls (list) with the -l option
File NAMES
There are two ways to refer to the name of a file, the short form and the "absolute".
The first one refers simply to the name. In the second we indicate in addition to the name (short
or relative) its location.
For example, the file named hello.txt located in the rnaseq directory, which in turn is inside the
/ home directory, has an absolute name.
/home/rnaseq/hello.txt
Note that there may be several files named hello.txt in different directories. For example:
/home/falvarez/hello.txt
These are two different files, although both have the same short name. We can only
refer to a file by its short name when we are in the same directory where the file is located. Also
keep in mind that the names (and the commands) are "case sensitive", that is to say, uppercase
is different from lowercase (Hello.Txt is NOT equal to hello.txt).
The permissions are read (r), write (w) and execute (x). Only programs and scripts
should have execution permission.
3 Basic useful commands
ls Lists the contents of the current directory

mkdir Makes a new directory
mv Moves a file
cp Copies a file
rm Removes a file
cat Concatenates files
more Displays the contents of a file one page at a time
head Displays the first ten lines of a file
tail Displays the last ten lines of a file
cd Changes current working directory
pwd Prints working directory
find Finds files matching an expression
grep Searches a file for patterns
wc Counts the lines, words, characters, and bytes in a file
kill Stops a process
jobs Lists the processes that are running
The dot ".", Double dot ".." and the forward slash "/"
(.) The point refers to the current directory (where we are)

For example
hello.txt and ./hello.txt are the same thing.
(..) The double point means the parent of the current directory (the directory above which we
are)
For example, if we are in media2/course, the "double dot" refers to /media2
The slash / is used to separate directories in an absolute name, this bar alone indicates the root
directory.
4 First steps
Copy the hello.txt file in your own directory:
cp /home/rnaseq/hello.txt ./
Or
cp ../rnaseq/hello.txt ./
lists the contents of your current working directory
ls
To list all files in your home directory including those whose names begin with a dot, type:
ls -a
As you can see, ls -a lists files that are normally hidden.
We will now make a subdirectory in your home directory to hold the

files you will be creating and using in the course of this tutorial.
To make a subdirectory called rawdata in your current working directory type:
mkdir rawdata
The command cd directory means change the current working directory to ’directory’. The
current working directory may be thought of as the directory you are in, i.e. your current position
in the file-system tree. To change to the directory you have just made, type:
cd rawdata
Type ls to see the contents (which should be empty)
(..) means the parent of the current directory, so typing
cd ..
will take you one directory up the hierarchy (back to your home directory).
Try it now and return to rawdata folder.

PATH: Pathnames enable you to work out where you are in relation to the whole file-system.
For example, to find out the absolute pathname of where you are use Print Working Directory
(pwd) command
pwd
Retrieving data from the Internet

We will work with the common budding yeast Saccharomyces cerevisiae. We will need the
reference assembled genome available at the NCBI: https://www.ncbi.nlm.ni.gov/
Go to this address and search for Saccharomyces cerevisiae. Then, in the results of genomes
click on Genome link. The genome web page will appear and you will have access to download
the files on your computer. But, as we wanted to download the sequences on the server we
need the use link and the command wget. So, copy the link (with the right button) and paste it
on the prompt after the command:
wget
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_0001
46045.2_R64_genomic.fna.gz
Download the genome file and the gff file using wget and the link to the file.
Exploring data
1. Uncompress the file using gunzip
gunzip GCF_000146045.2_R64_genomic.fna.gz
ls
Uncompress the gff file
2. visualize the file

more GCF_000146045.2_R64_genomic.fna
3. List the fasta names of the genome file

grep ">" GCF_000146045.2_R64_genomic.fna
4. How many chromosomes the file has?

grep -c ">" GCF_000146045.2_R64_genomic.fna
5. How many lines have the file?

Hint: look at the Basic linux commands
6. Keep only the id of the chromosomes in the fasta genome file. The name of the
fastas include the ID and the Description.
cut -f1 GCF_000146045.2_R64_genomic.fna | grep ">"

cut -f1 -d" " GCF_000146045.2_R64_genomic.fna | grep ">"
Use the command man cut to determine what is doing the -d flag
to print it (redirect) in a new file:

cut -f1 -d" " GCF_000146045.2_R64_genomic.fna >new S288c genome
Use more or less to visualize the new file

5 REDIRECTING THE OUTPUT
The system takes the input from the standard input. By default it is the keyboard, but we can
change it to be a file. The standard output device, called stdout, is where the system output is
sent. By default the stdout is the screen, but it can also be redirected to a file, with the >
operator.
The pipe symbol "|" allows us to execute orders sequentially, so that the output of the first
command is the input of the second. As an example, we can count the lines of a file using pipe,
and redirecting this information to a new file:
cat GCF_000146045.2_R64_genomic.fna | wc -l > num_lines_of_Scergenome
7. Now visualize the gff file

Could you identify the header?
Could you identify the different columns?
8. Count how many annotations there are for each feature:

a) Remove the header
grep -v "#" GCF_000146045.2_R64_genomic.fna |wc
b) keep only the third column

cut -f3 GCF_000146045.2_R64_genomic.gff
c) and finally sort and use the command uniq -c (and put all toghether)
grep -v "#" GCF_000146045.2_R64_genomic.gff | cut -f3 | sort | uniq -c
6 Obtaining data from GEO - SRA

NCBI’s GEO (Gene Expression Omnibus) and SRA (Sequence Read Archive) are international
public repositories that store and distribute expression data obtained with microarrays, massive
sequencing and other high-throughput techniques for producing functional genomics data.
Starting with the GEO or SRA identifier, it is possible to find information about the target
experiment, as well as additional related identifiers which allow the download of sequences.
This is accomplished through different tools belonging to the NCBI’s SRAtoolkit.
Choose one of the samples and download the sequences in the sra format. Then produce both
corresponding fastq files (i.e. in the case of paired-end Illumina data, there is one sra file and
two associated fastq files, one per read in the sequence pair). These steps are produced
with the NCBI SRAtoolkit tools prefetch and/or fastq-dump. See one example below:
prefetch -v SRR453569
fastq-dump --split-files --gzip /home/natalia/ncbi/public/sra/SRR453569.sra
fastq-dump --split-files --gzip SRR453570 #alternative
fastq-dump -X 5 --split-files --gzip SRR453571 #download first 5 reads
ls -lrt
cd ..
Sed is a streaming text editor (makes edits as the text passes). It is used to perform various text
editing functions, such as replacing, inserting or deleting patterns.
AWK is a programming language specially dedicated to word processing.

By using the awk command we can extract columns -or fields- from a file. It is possible to filter a
field of a file meeting a certain condition.
Visualize the fastq file retrieved
7. Exercise
Determine how many rna-seq’s reads belong to each gene of S. cerevisiae
1. Retrieve the transcripts from S. cerevisiae using wget

$wget
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_00014
6045.2_R64_rna.fna.gz
2. Uncompress the file.

3. Generate an index to blast against these transcripts
$makeblastdb -in GCF_000146045.2_R64_rna.fna -dbtype nucl
4. Copy a SRR file into your folder

cp /home/rnaseq/raw-data/SRR453566_2.fastq ./
5. From one of the SRR available on the server generate a file including only 100.000
reads
Hint: use the command head
6. Transform the 100.000 fastq file in a fasta file
7. Blast the fasta read file against the S. cerevisiae transcripts
8. $blastn -query fastareadfile -db S.cerevisiaetranscripts -outfmt ‘6 std qlen slen’ -out
100k_vs_ScerTranscript
9. Calculate how many reads maps to each transcript on the blast output.

Basic Linux Introduction

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Basic Linux Introduction

Caricato da

Copyright:

Formati disponibili

Basic linux introduction

1 - Connect to a server and work in bash in a command line window.

> ​ssh ​usuario@servidor.edu.uy

> ​ssh yourname1@10.41.48.11

For example the user name:​ danielapavoni1

The machine will respond by requesting the user key.

STRUCTURE OF DIRECTORIES (FOLDERS) AND ARCHIVES

3 Basic useful commands

ls Lists the contents of the current directory

(​.​)​ The ​point​ refers to the current directory (where we are)

hello.txt​ and ​ ./hello.txt​ are the same thing.

Copy the hello.txt file in your own directory:

lists the contents of your current working directory

We will now make a subdirectory in your home directory to hold the

Type​ ls​ to see the contents (which should be empty)

(..) means the parent of the current directory, so typing

Try it now and return to rawdata folder.

Retrieving data from the Internet

Uncompress the gff file

2. visualize the file

3. List the fasta names of the genome file

4. How many chromosomes the file has?

5. How many lines have the file?

cut -f1 GCF_000146045.2_R64_genomic.fna | grep ">"

to print it (redirect) in a new file:

Use more or less to visualize the new file

cat GCF_000146045.2_R64_genomic.fna | wc -l > num_lines_of_Scergenome

7. Now visualize the gff file

8. Count how many annotations there are for each feature:

b) keep only the third column

6 Obtaining data from GEO - SRA

AWK​ is a programming language specially dedicated to word processing.

Visualize the fastq file retrieved

1. Retrieve the transcripts from ​S. cerevisiae​ using wget

2. Uncompress the file.

4. Copy a SRR file into your folder

Potrebbero piacerti anche

> ssh usuario@servidor.edu.uy

> ssh yourname1@10.41.48.11

For example the user name: danielapavoni1

(.) The point refers to the current directory (where we are)

hello.txt and ./hello.txt are the same thing.

Type ls to see the contents (which should be empty)

AWK is a programming language specially dedicated to word processing.

1. Retrieve the transcripts from S. cerevisiae using wget