Sei sulla pagina 1di 35

RNA-Seq Alignment

HISAT2
Jelena Nadj
Seven Bridges Genomics
March 3rd, 2016

Introduction

Design

EM algorithm

Implementation

HISAT2 overview

Highly efficient toolkit for for mapping next-generation sequencing


reads (both DNA and RNA) against the general human population

Uses an indexing scheme based on the BTW and GCSA (an extension of
BWT for a graph) index

Authors claim HISAT is the fastest system currently available

Despite the large number of indexes, requires only 4.3GB of memory

Supports genomes of any size, including those larger than 4 billion bases

Introduction

RNA-seq

EM algorithm

Implementation

HISAT2 toolkit info

Contributors: Daehvan Kim, Ben Langmead, Joe Paggi, Geo Pertea & Steven Salzberg
(Tophat2 developers)

Homepage

Github

Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory
requirements. Nature Methods 2015

GPLv3 license

Latest version released on Nov 19th, 2015

Runs on Linux, Windows and Mac OX

Introduction

RNA-seq

HISAT2 VS Others

Implementation

HISAT2 modules

HISAT2 Build (indexer)

Based on GCSA (an extension of BWT for a graph)

One global GFM index that represents general population

Large set of small GFM indexes (each index representing 56 Kbp,


with 55,000 indexes needed to cover human population)

Takes a list of reference files (required) and

Exons, splice sites and SNPs (optionally)

Introduction

Design

EM algorithm

Implementation

HISAT2 modules

HISAT2 (aligner)

The only required input is a set of indexes produced by the HISAT2


Build indexer

Optionally can take splice site, exon, SNP information (in the HISAT2
format), new splice site info (derived from the previous iteration of
HISAT2)

Returns a single .SAM file + other metrics files (optional)

Introduction

Methods

EM algorithm

Implementation

HISAT2 Index

In contrast to most other aligners, HISAT2 employs two different types of indexes:

A global GFM index that represents the entire genome

Numerous small GFM (FM) indexes for regions that collectively cover the
genome, where each index represents 56,000 bp, which makes ~55,000
indexes needed to cover human genome

Optimizations to minimize the memory requirements allow storing the whole


human genome index in 4GB of space, about 6.2 GB with 12M SNPs (and
indels)

Introduction

Design

EM algorithm

Implementation

HISAT2 Index

HISAT2 first tries to identify the positions from which the read may have
originated (on the whole genome)

This is done by first using the global index, which gives a small set of candidates

The remainder of the reads is identified by selecting a local index

Mates are aligned separately, then alignment is combined

Search through a global FM index of the human genome suffers from many cache
'misses' - local index is much smaller, fits in the cache

Introduction

Design

EM algorithm

Implementation

HISAT2 Alignment Extension

After the location of a read is know, read sequence is directly compared


with genomic sequences

Requires the entire genomic sequence to be loaded into memory (682


MB for human)

Smart combining can dramatically reduce the use of relatively slow


operations such as global/local search

Introduction

Design Principles

RNA-Seq Read Types

EM algorithm

Implementation

Introduction

Design Principles

EM algorithm

Implementation

Junction reads with intermediate, 8-15 bp


anchors

8-bp sequence is expected to occur ~48,000 times in the human


genome!

In HISAT, each local index covers 64,000 bp - over 90% of annotated


human introns are completely contained in one of these indexes

After mapping the longer part, HISAT can usually align the remaining
small anchor within a single local index

Introduction

Design Principles

EM algorithm

Implementation

Junction reads with short, 1-7 bp anchors

Shorter - Expected to occur even more often in the genome

Best option - to use known splice sites

After mapping the longer part, HISAT can usually align the remaining
small anchor within a single local index

Introduction

Design Principles

a) global + extension

b) global + extension ->


local

c) global + extension ->


local + extension

EM algorithm

Implementation

Introduction

Design Principles

For short anchored reads

(1-7 bp) even local index finds


too many potential locations

Best solution in these cases

is use of known splice sites

Achieves the sensitivity of

2-pass algorithms

Also used for reads spanning

more than 2 exons

EM algorithm

Implementation

Introduction

Design Principles

Pseudogenes

Nonfunctional copies of genes

Reads corresponding to the


genes almost perfectly fit

2.7% of annotated human genes


have pseudogene copies

EM algorithm

Implementation

Introduction

Speed

Design Principles

HISAT2 vs Others

Implementation

Introduction

Sensitivity

Design Principles

HISAT2 vs Others

Implementation

Introduction

RNA-seq

EM algorithm

RNA-seq data analysis

qualitative vs. quantitative

alignment, assembly, relative abundance, diff expression

RNA-seq quantification
vs.
microarrays or qRT-PCR

technical or biological replicates


Source: http://hdl.handle.net/2345/3145

Implementation

Introduction

RNA-seq

EM algorithm

RNA-seq data (1)


(variance and ambiguity)

sampling variance (biological replicates)

technical or biological variance


(both technical and biological rep.)

alternative splicing:

mapping ambiguity
(multiple mapping)
Source: http://dx.doi.org/10.13070/mm.en.3.203

Implementation

Introduction

RNA-seq

RNA-seq data (2)


(biases)

fragment length distribution


(small fragments -> more ambiguity)

positional and sequence-specific


(due to priming or fragmentation)

sequencing errors
(error model - mismatches & indels)
Source: doi:10.1186/gb-2011-12-3-r22

EM algorithm

Implementation

Introduction

RNA-seq

EM algorithm

Implementation

RNA-seq data (3)


(normalization)

within sample normalization (transcript length + sequencing depth):

between sample normalization (coverage + control set)

Introduction

RNA-seq

EM algorithm

Statistical background (1)

type of problem -> counting fragments


(Poisson, binomial, multinomial)

Likelihood vs. probability

If probability is a function of data given some params, then

likelihood is function of those params given the data

Maximum Likelihood Estimation (MLE)

Implementation

Introduction

RNA-seq

EM algorithm

Implementation

Statistical background (2)

MLE: analytically or numerically

Frequentist: data -> model fit-> validate model params (MLE)

Bayesian: prior model + likelihood (given data) -> posterior model

Bayesian interpretation closer to algorithmic approach


(treats additional data, hidden params)
cruncher notebook

Introduction

RNA-seq

EM algorithm

Implementation

MLE example (RNA)

Adapted from: Lior Pachter 2011, arxiv: 1104.3889v2

Introduction

RNA-seq

EM algorithm

Implementation

Expectation-Maximization (EM) algorithm

Source: http://research.microsoft.com/en-us/um/people/cmbishop/prml/

Source: http://artint.info/html/ArtInt_255.html

Introduction

RNA-seq

EM algorithm

Implementation

RNA example EM

Introduction

RNA-seq

EM algorithm

EM flavours

Source: http://cs.stanford.edu/~pliang/papers/online-naacl2009.pdf p. 3

Implementation

Introduction

RNA-seq

EM algorithm

eXpress EM features (1)

Supplementary material (left: impact of algorithm flavour, right: forgetting param)

Implementation

Introduction

RNA-seq

EM algorithm

eXpress EM features (2)

priors
(error model, fragment length, abundances)

KullbackLeibler divergence

at 10-6

generative model

Implementation

Introduction

RNA-seq

EM algorithm

Implementation

eXpress CLI options

$ express [options]* <target_seqs.fa> <aligned_reads.bam>


(some) Standard options:

(some) Advanced options:

--frag-len-mean / --frag-len-stddev

--forget-param

--haplotype-file

--calc-covar

--output-align-prob / --output-align-samp

--expr-alpha <float>

--fr-stranded / --rf-stranded / --f-stranded

--aux-param-file

https://igor.sbgenomics.com/u/dusan_randjelovic/express-1-5-1-demo/apps/#express/

Introduction

RNA-seq

EM algorithm

eXpress auxiliary params

stops learning at 5M reads

Implementation

Introduction

RNA-seq

EM algorithm

Implementation

eXpress output

results.xprs

params.xprs

varcov.xprs

augmented alignments (BAMs)

original BAM with probabilities for each alignment

new BAM with sampled alignments for each fragment

Introduction

RNA-seq

EM algorithm

eXpress common runtime errors

mutually exclusive params


--output-align-prob & --output-align-samp
strandedness options
additional rounds of EM (both batch & online)

Input file contains no valid alignments


(for paired-end reads with single-end stranded options)

Implementation

Introduction

RNA-seq

eXpress benchmark

EM algorithm

Implementation

Introduction

RNA-seq

EM algorithm

Implementation

Call to action

Benchmark eXpress (w/ alignment) vs. RSEM on SBG (@Uros)

+ vs. Cufflinks on SBG

+ vs. Kallisto/Sailfish/Salmon (on SBG?!)

Potrebbero piacerti anche