Sei sulla pagina 1di 75

NEXT-GENERATION

SEQUENCING AND
BIOINFORMATICS
Moore's law: the number of transistors in a dense integrated
circuit doubles every two years
Moore's law calculates and predicts the pace of
improvement of one of the fastest improving
technologies, computers
In the last 15 years the pace of improvement of DNA
sequencing technologies has been much faster than that
of computers
Frederick Sanger
Nobel prize in chemistry in 1958 for sequencing insulin (and
proteins in general)

Nobel prize in chemistry in 1980 for sequencing nucleic acids

One of only three persons to win two Nobel prizes in science


SANGER SEQUENCING
SANGER SEQUENCING
The most modern Sanger sequencers allow parallelization of
up to 96 samples at once

Before sequencing a step of PCR and purification is


necessary – and if you do not know the sequence in advance
you need to perform a cloning step

OUTPUT: 1000 bases per run (96000 if you parallelize)


NEXT-GEN
SEQUENCING TECHNOLOGIES

• Roche/454 FLX

• Applied Biosystems SOLiD System

• Illumina/Solexa sequencing by synthesis

• IonTorrent
NEXT-GENERATION DNA SEQUENCING
MAIN CHARACTERISTICS
EXTREME MINIATURIZATION

Reactions are carried out in volumes of microliters thanks to


specific technological advances

This in turn allows

MASSIVE PARALLELIZATION

Thousands, millions of reactions are performed in parallel,


reducing the costs and increasing the output volume by orders
of magnitude
454 pyrosequencing
SAMPLE PREPARATION

Nebulization of genomic DNA in fragments of 400-1000 base pairs

Ligation of fragments to two adapters (type A and type B)

Selection of single strand fragments with both adapters


EMULSION PCR

Fragments are mixed with agarose beads by 28 microns in


diameter bearing complementary to oligo adapters
Isolation of each bead-fragment into individual micelles in
water-oil
Emulsion PCR reaction in 1 million copies of amplified fragment
on the surface of each bead
SAMPLE LOAD

Each bead is placed in a well of a picotiter slide (7x7 cm


fiber optic slide); several million 44 microns diameter
wells per slide

Multiple enzymes and reagents are added in the form of


even smaller beads
PYROSEQUENCING REACTION

1 single nucleotide species is added each cycle

Nucleotide incorporation → light generation


Rothberg Nat. Biotechnol. 2008
ROCHE/454 FLX Pyrosequencer

1 EMULSION PCR takes the place of thousands of cloning


experiments

1 SEQUENCING RUN takes the place of thousands of SANGER


sequencing runs

EXTREME MINIATURIZATION

MASSIVE PARALLELIZATION
ROCHE/454 GSFLX+

BASE CALLING ACCURACY: 99.9% or more (lower in the final part


of the reads)

OUTPUT:
Generates reads up to 1,000
nucleotides long

Generates about 500,000-1,000,000


reads

For a total output of 700 megabases


per run (8 hours)
454 MAIN ISSUE
Homopolymers: stretches of one single nucleotide species

Intrinsic problem of the technology

Multiple identical nucleotides are incorporated in a single cycle

They generate more light, but discrimination becomes increasingly more


difficult
454 MAIN ISSUE

This problem can affect the downstream bioinformatic analysis

KNOW YOUR MACHINE!


ILLUMINA/SOLEXA

Currently the market leader


Very low cost per base, proven technology

sequencing by synthesis
ILLUMINA/SOLEXA

1. DNA fragmentation and


ligation to 2 types of 3. "bridge" amplification
adapters using primers complementary to the adapters that are
bound to the substrate at high density → production of
clusters of up to 1,000,000 of template copies "in situ"
that generate a sufficient signal to be detected
2. Templates are bound on
the surface of a flow
microcell
ILLUMINA/SOLEXA

4. Addition of fluorescent nucleotides blocked at 3'-OH


5. Fluorescence detection
6. Removal of the fluorophore
7. repeat steps 3-5
ILLUMINA/SOLEXA

• Four different fluorophores → no issues with


homopolymers

• Shorter reads
blocking the incorporation of multiple nucleotides is one of
the basis of the Illumina method
Each cycle imperfect blocking happens, a small percentage
of the copies in a cluster incorporates two nucleotides,
giving noise instead of good signal
When this percentage reaches a threshold, the signal is lost

KNOW YOUR MACHINE


ILLUMINA/SOLEXA
• DIFFERENT INSTRUMENTS (Benchtop ones)
ILLUMINA/SOLEXA
• DIFFERENT INSTRUMENTS (high yield ones)
ION TORRENT

The smallest sequencer, fast and economical

An instrument: $ 60,000
A run: ~ $ 1,000 (high scalability)

Output: up to 10 Gb of reads long up to 600pb

Very quick, a run lasts for 3 hours


ION TORRENT

In many respects similar to 454

DNA is amplified on microbeads and inserted into


wells

Then subjected to cycles of incorporation of a


single type of nucleotide

Support for basic analyses without bioinformatic


knowledge
ION TORRENT

Does not detect light, but the release of H+ ions by


sequencing - As a camera chip, which instead of detecting
photons detects protons

The sequencing is performed on a


semiconductor chip, which identifies
the liberation of protons

Potential rapid technological


development, taking advantage of
the electronics industry
ION TORRENT
All nucleotides release H+, so cycles of incorporations of
individual types of nucleotides are required (A, C, G, T)

Same issue as 454: homopolymers


THIRD GENERATION
SEQUENCING TECHNOLOGIES

• Pacific Biosciences

• Oxford Nanopore
THIRD GENERATION
SEQUENCING TECHNOLOGIES
REAL TIME SEQUENCING

The idea is to bypass the amplification step

Advantage
THIRD GENERATION
SEQUENCING TECHNOLOGIES
REAL TIME SEQUENCING

The idea is to bypass the amplification step

Advantage

This allows to avoid DNA fragmentation, and to


obtain LONGER reads, FASTER
Pacific Biosciences PACBIO
Launched in 2009 (third-generation?)
Real-Time sequencing technology
The idea is to directly observe the DNA polymerization
while it is performed by DNA polymerase
Single Molecule Real Time (SMRT) sequencing

Recently the third machine was


released: PACBIO SEQUEL
cost around 350,000 dollars
Zero-mode waveguide (ZMW)
Highly sensitive detection
system
Nanophotonic structure with
50nm diameter cells

A laser illuminates from below, but the


wavelength is too large to allow the
diffusion of light

Same principle of microwave


ovens doors
Zero-mode waveguide (ZMW)

The light penetrates 20-30 nm


This allows to identify only what happens on the
bottom of the well, reducing background noise and
getting high sensitivity and temporal resolution

The latest PacBio instrument


has around 1,000,000 wells
Polimerase phi-29
phage polymerase
Highly processive, up to 70,000 nt
High fidelity, up to 100 times more of Taq polymerase
Modifed to be slower
The polymerase is linked to the bottom of the wells

Only 1/3 of the wells get a single


polymerase, and thus can
perform the sequencing
PacBio sequencing
A single strand DNA is bound to the polymerase
Addition of the 4 nucleotide species, tagged with 4 different
fluorophores

The nucleotide is The free fluorophore generates a


incorporated and the flash of light, which is detected by
fluorophore is cut a fluorescence microscope
Characteristics

The sequencing is continuous, washing is not necessary →


much faster
PacBio allows to obtain sequences of several thousands of
nucleotides (up to 20,000)

Third generation sequencing


A novel revolution → expecially for
bioinformatics
PacBio ISSUES
Current issues are
the cost (10x more expensive than Illumina)

The read quality: single molecule sequencing


means every mistake is recorded, and cannot
be cancelled by the presence of thousands of
parallel reactions

However these errors are random and can be


overcome
Oxford Nanopore

10–20 Gb of DNA sequences

1000$ per base kit


(2 runs, all reagents
included)

Library preparation can


take <10 minutes

Read length of tens of thousands base… or even more

Can sequence genomic DNA, cDNA or even directly RNA!


Oxford Nanopore

Ions pass through the pore


This generates measurable current.
When a molecule passes through, the current signal is disturbed

Each nucleotide produces a different specific current perturbation


Oxford Nanopore

A smaller protein is added


on top on the pore, to
unzip the DNA and have 1
strain pass through the
pore

Nucleic acids pass through


the pore as single strands

An adapter molecule slows


down the flow in order to
allow a clear recognition of
each base
Oxford Nanopore

Even smaller devices (in the future)

Reads up to 1 million bases!


Rivoluzione dal punto di vista
dell'analisi a valle

http://flxlexblog.wordpress.com/2013/10/01/developments-in-next-generation-sequencing-october-2013-edition/
NEXT-GEN IS TRENDY

It is the new thing


It is powerful and cheap
It has uses in any biological system (From viruses to human
genetics)
It is useful to answer a number of questions (De novo, mapping,
transcriptomics)
NEXT-GEN IS TRENDY
So everyone wants to use it

you just extract your DNA/RNA and send it to a


sequencing company

And then, who will do


the analysis?
NEXT-GEN WORKFLOW
1. What is the goal?

2. Choose the right experimental setup

3. Choose the right sequencing


technology

4. Data Analysis
What is your goal?
What exactly is the problem you want to address?
Evaluate approaches used in the past
Consider new approaches
Consider future problems

NO WAY BACK!
CHOOSE THE RIGHT TECHNOLOGY

de novo sequencing: 454, PacBio

Draft sequencing: Illumina, Iontorrent

Microbial communities: 454, Illumina

Transcriptomics: Illumina, Iontorrent


DATA ANALYSIS

A basic next-gen experiment generates gigabytes of


information

This is HIGH-THROUGHPUT!
HIGH-TROUGHPUT TECHNOLOGIES

Technologies that generate too much data, that cannot


be handled without computer assistance

Modelling

Shotgun proteomics

Network analysis

Structural biology

Machine learning
HIGH-TROUGHPUT TECHNOLOGIES

Next-generation sequencing
BIOINFORMATICS
Bioinformatics is the development and use of computer
methods for the analysis of biological data

Bioinformatics becomes absolutely necessary with the


increase of data load
BIOINFORMATICS

Most bioinformatics is run on Linux


SO WHAT IS UNIX?

Unix is a family of multitasking, multiuser computer


operating systems that derive from the original AT&T
Unix, developed in the 1970s at the Bell Labs research
center by Ken Thompson, Dennis Ritchie, and others.
UNIX Advantages

Full multitasking with protected memory

Very efficient virtual memory

Access controls and security

A rich set of small commands that do specific tasks well

Ability to combine commands to accomplish complicated tasks

A powerfully unified file system

Available on a wide variety of machines

Optimized for program development


UNIX Disadvantages

The command line interface is user hostile

Commands often have cryptic names and give very little


response to tell the user what they are doing

To use Unix well, you need to understand some of the


main design features

Richness of utilities (over 400 standard ones) often


overwhelms novices

Documentation often feels underwhelming and poor of


Examples

Expensive
UNIX → LINUX

Linux is a UNIX-like family of Operating Systems (OSs)


Each ”member” of the family has
different characteristics and comes
with different softwares and
graphic environments

Broadly, each distribution (a.k.a.


distro) is ”tuned” for a specific
task, to address a specific user or
designed for a specific kind of
devices

Most Unix advantages, plus it is FREE and User-friendly


Linux Distros
for beginners:
Mint and Ubuntu, #1 and #2 most popular distributions

for a specific task:


e.g. BioLinux (bioinformatics), Scientific Linux (science in
general)and Ubuntu Studio (multimedia)

for a specific platform:


e.g. Mythbuntu (home theater PCs), Yellow Dog Linux (apple
machines), OpenWrt (routers)
LINUX FOR BIOINFORMATICS

Why Linux?
Free and runs on most hardware
fully customizable
more efficient and stable

Why Linux for bioinformatics?


Supports multiple users in a controlled manner
Optimized for writing and executing scripts/commands
Features for handling massive amounts of files
Adopted by the scientific community

It requires more work than other operating systems


LINUX – OPEN SOURCE

Why Linux? free and open software

Open-source software (OSS) is computer software with its


source code made available with a license in which the
copyright holder provides the rights to study, change,
and distribute the software to anyone and for any
purpose

Open-source software may be developed in a collaborative


public manner
LINUX

Why Linux? fully customizable

From the small details to the core functions


LINUX

Why Linux? more efficient and stable

Linux servers are widely used → for example by Microsoft


and Apple

As a bioinformatician, if you want to interact with your server


quickly and well, you may find it easier if you use the same
language
is LINUX the only way to do bioinformatics?

ABSOLUTELY NO

However its characteristics make it optimal for most


bioinformatic tasks

Supports multiple users in a controlled manner


Optimized for writing and executing scripts/commands
Features for handling massive amounts of files
Adopted by the scientific community

Many bioinformaticians use a Mac laptop to interact with a


Linux server (MAC OS X is unix based)
Many Linux distros are as friendly as Windows

You get to browse your


files visually
internet browsers
Text processors
Skype
Even videogames

… and many things


windows does not give
you

Give a try to Ubuntu


https://www.ubuntu.com/download/desktop/try-ubuntu-before-you-install
Interacting with
computers

Usually we interact with our computer through a


Graphical User Interface (GUI)
We have folders, icons, etc...

This is very friendly for us, but very far from the
‘machine language’
language

Using a language that is closer to ‘machine


language’ gives us more power
languages

Programming languages are more difficult and


more powerful the more they are close to the
machine language
Command Line Interface
(CLI)
Also called the TERMINAL

The CLI is a more powerful way


to interact with computers

Faster and automatable

BUT

More complex than GUI


Command Line Interface
(CLI)
To use the CLI we need to write in a language that
is closer to machine language (but still very very
far from it)…

… instead of clicking with the mouse

So to start we need to know some basics

For today, let’s just see what is the folder structure


Folder structure
The files in our computers are located in folders,
with a tree-like structure
To navigate with GUI,
we can click on
folders

To navigate with CLI,


we need to issue a
basic command
Change Directory: cd
If I want to go from
Documents to
Folder1:

cd Folder1

If I want to go from Documents to Subfolder1B-1:


cd Folder1/Subfolder1B/Subfolder1B-1

From Subfolder1A to Folder 1


cd ..

Potrebbero piacerti anche