HISAT2

RNA-Seq Alignment
HISAT2
Jelena Nadj
Seven Bridges Genomics
March 3rd, 2016
Introduction
Design
EM algorithm
Implementation
HISAT2 overview
Highly efficient toolkit for for mapping next-generation sequencing

reads (both DNA and RNA) against the general human population
Uses an indexing scheme based on the BTW and GCSA (an extension of
BWT for a graph) index
Authors claim HISAT is the fastest system currently available
Despite the large number of indexes, requires only 4.3GB of memory
Supports genomes of any size, including those larger than 4 billion bases
Introduction
RNA-seq
EM algorithm
Implementation
HISAT2 toolkit info
Contributors: Daehvan Kim, Ben Langmead, Joe Paggi, Geo Pertea & Steven Salzberg
(Tophat2 developers)
Homepage
Github
Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory
requirements. Nature Methods 2015
GPLv3 license
Latest version released on Nov 19th, 2015
Runs on Linux, Windows and Mac OX
Introduction
RNA-seq
HISAT2 VS Others
Implementation
HISAT2 modules
HISAT2 Build (indexer)
Based on GCSA (an extension of BWT for a graph)
One global GFM index that represents general population
Large set of small GFM indexes (each index representing 56 Kbp,

with 55,000 indexes needed to cover human population)
Takes a list of reference files (required) and
Exons, splice sites and SNPs (optionally)
Introduction
Design
EM algorithm
Implementation
HISAT2 modules
HISAT2 (aligner)
The only required input is a set of indexes produced by the HISAT2

Build indexer
Optionally can take splice site, exon, SNP information (in the HISAT2
format), new splice site info (derived from the previous iteration of
HISAT2)
Returns a single .SAM file + other metrics files (optional)
Introduction
Methods
EM algorithm
Implementation
HISAT2 Index
In contrast to most other aligners, HISAT2 employs two different types of indexes:
A global GFM index that represents the entire genome
Numerous small GFM (FM) indexes for regions that collectively cover the
genome, where each index represents 56,000 bp, which makes ~55,000
indexes needed to cover human genome
Optimizations to minimize the memory requirements allow storing the whole

human genome index in 4GB of space, about 6.2 GB with 12M SNPs (and
indels)
Introduction
Design
EM algorithm
Implementation
HISAT2 Index
HISAT2 first tries to identify the positions from which the read may have
originated (on the whole genome)
This is done by first using the global index, which gives a small set of candidates
The remainder of the reads is identified by selecting a local index
Mates are aligned separately, then alignment is combined
Search through a global FM index of the human genome suffers from many cache
'misses' - local index is much smaller, fits in the cache
Introduction
Design
EM algorithm
Implementation
HISAT2 Alignment Extension
After the location of a read is know, read sequence is directly compared

with genomic sequences
Requires the entire genomic sequence to be loaded into memory (682

MB for human)
Smart combining can dramatically reduce the use of relatively slow

operations such as global/local search
Introduction
Design Principles
RNA-Seq Read Types
EM algorithm
Implementation
Introduction
Design Principles
EM algorithm
Implementation
Junction reads with intermediate, 8-15 bp

anchors
8-bp sequence is expected to occur ~48,000 times in the human

genome!
In HISAT, each local index covers 64,000 bp - over 90% of annotated

human introns are completely contained in one of these indexes
After mapping the longer part, HISAT can usually align the remaining
small anchor within a single local index
Introduction
Design Principles
EM algorithm
Implementation
Junction reads with short, 1-7 bp anchors
Shorter - Expected to occur even more often in the genome
Best option - to use known splice sites
After mapping the longer part, HISAT can usually align the remaining
small anchor within a single local index
Introduction
Design Principles
a) global + extension
b) global + extension ->

local
c) global + extension ->

local + extension
EM algorithm
Implementation
Introduction
Design Principles
For short anchored reads
(1-7 bp) even local index finds

too many potential locations
Best solution in these cases
is use of known splice sites
Achieves the sensitivity of
2-pass algorithms
Also used for reads spanning
more than 2 exons
EM algorithm
Implementation
Introduction
Design Principles
Pseudogenes
Nonfunctional copies of genes
Reads corresponding to the

genes almost perfectly fit
2.7% of annotated human genes

have pseudogene copies
EM algorithm
Implementation
Introduction
Speed
Design Principles
HISAT2 vs Others
Implementation
Introduction
Sensitivity
Design Principles
HISAT2 vs Others
Implementation
Introduction
RNA-seq
EM algorithm
RNA-seq data analysis
qualitative vs. quantitative
alignment, assembly, relative abundance, diff expression
RNA-seq quantification
vs.
microarrays or qRT-PCR
technical or biological replicates

Source: http://hdl.handle.net/2345/3145
Implementation
Introduction
RNA-seq
EM algorithm
RNA-seq data (1)

(variance and ambiguity)
sampling variance (biological replicates)
technical or biological variance

(both technical and biological rep.)
alternative splicing:
mapping ambiguity
(multiple mapping)
Source: http://dx.doi.org/10.13070/mm.en.3.203
Implementation
Introduction
RNA-seq
RNA-seq data (2)

(biases)
fragment length distribution

(small fragments -> more ambiguity)
positional and sequence-specific

(due to priming or fragmentation)
sequencing errors
(error model - mismatches & indels)
Source: doi:10.1186/gb-2011-12-3-r22
EM algorithm
Implementation
Introduction
RNA-seq
EM algorithm
Implementation
RNA-seq data (3)

(normalization)
within sample normalization (transcript length + sequencing depth):
between sample normalization (coverage + control set)
Introduction
RNA-seq
EM algorithm
Statistical background (1)
type of problem -> counting fragments

(Poisson, binomial, multinomial)
Likelihood vs. probability
If probability is a function of data given some params, then
likelihood is function of those params given the data
Maximum Likelihood Estimation (MLE)
Implementation
Introduction
RNA-seq
EM algorithm
Implementation
Statistical background (2)
MLE: analytically or numerically
Frequentist: data -> model fit-> validate model params (MLE)
Bayesian: prior model + likelihood (given data) -> posterior model
Bayesian interpretation closer to algorithmic approach

(treats additional data, hidden params)
cruncher notebook
Introduction
RNA-seq
EM algorithm
Implementation
MLE example (RNA)
Adapted from: Lior Pachter 2011, arxiv: 1104.3889v2
Introduction
RNA-seq
EM algorithm
Implementation
Expectation-Maximization (EM) algorithm
Source: http://research.microsoft.com/en-us/um/people/cmbishop/prml/
Source: http://artint.info/html/ArtInt_255.html
Introduction
RNA-seq
EM algorithm
Implementation
RNA example EM
Introduction
RNA-seq
EM algorithm
EM flavours
Source: http://cs.stanford.edu/~pliang/papers/online-naacl2009.pdf p. 3
Implementation
Introduction
RNA-seq
EM algorithm
eXpress EM features (1)
Supplementary material (left: impact of algorithm flavour, right: forgetting param)
Implementation
Introduction
RNA-seq
EM algorithm
eXpress EM features (2)
priors
(error model, fragment length, abundances)
KullbackLeibler divergence
at 10-6
generative model
Implementation
Introduction
RNA-seq
EM algorithm
Implementation
eXpress CLI options
$ express [options]* <target_seqs.fa> <aligned_reads.bam>

(some) Standard options:
(some) Advanced options:
--frag-len-mean / --frag-len-stddev
--forget-param
--haplotype-file
--calc-covar
--output-align-prob / --output-align-samp
--expr-alpha <float>
--fr-stranded / --rf-stranded / --f-stranded
--aux-param-file
https://igor.sbgenomics.com/u/dusan_randjelovic/express-1-5-1-demo/apps/#express/
Introduction
RNA-seq
EM algorithm
eXpress auxiliary params
stops learning at 5M reads
Implementation
Introduction
RNA-seq
EM algorithm
Implementation
eXpress output
results.xprs
params.xprs
varcov.xprs
augmented alignments (BAMs)
original BAM with probabilities for each alignment
new BAM with sampled alignments for each fragment
Introduction
RNA-seq
EM algorithm
eXpress common runtime errors
mutually exclusive params

--output-align-prob & --output-align-samp
strandedness options
additional rounds of EM (both batch & online)
Input file contains no valid alignments

(for paired-end reads with single-end stranded options)
Implementation
Introduction
RNA-seq
eXpress benchmark
EM algorithm
Implementation
Introduction
RNA-seq
EM algorithm
Implementation
Call to action
Benchmark eXpress (w/ alignment) vs. RSEM on SBG (@Uros)
+ vs. Cufflinks on SBG
+ vs. Kallisto/Sailfish/Salmon (on SBG?!)

HISAT2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

HISAT2

Caricato da

Copyright:

Formati disponibili

RNA-Seq Alignment

Highly efficient toolkit for for mapping next-generation sequencing

Authors claim HISAT is the fastest system currently available

Despite the large number of indexes, requires only 4.3GB of memory

HISAT2 toolkit info

Latest version released on Nov 19th, 2015

Runs on Linux, Windows and Mac OX

HISAT2 Build (indexer)

Based on GCSA (an extension of BWT for a graph)

One global GFM index that represents general population

Large set of small GFM indexes (each index representing 56 Kbp,

Takes a list of reference files (required) and

Exons, splice sites and SNPs (optionally)

The only required input is a set of indexes produced by the HISAT2

Returns a single .SAM file + other metrics files (optional)

A global GFM index that represents the entire genome

Optimizations to minimize the memory requirements allow storing the whole

The remainder of the reads is identified by selecting a local index

Mates are aligned separately, then alignment is combined

HISAT2 Alignment Extension

After the location of a read is know, read sequence is directly compared

Requires the entire genomic sequence to be loaded into memory (682

Smart combining can dramatically reduce the use of relatively slow

RNA-Seq Read Types

Junction reads with intermediate, 8-15 bp

8-bp sequence is expected to occur ~48,000 times in the human

In HISAT, each local index covers 64,000 bp - over 90% of annotated

Junction reads with short, 1-7 bp anchors

Shorter - Expected to occur even more often in the genome

Best option - to use known splice sites

b) global + extension ->

c) global + extension ->

For short anchored reads

(1-7 bp) even local index finds

Best solution in these cases

is use of known splice sites

Achieves the sensitivity of

Also used for reads spanning

more than 2 exons

Nonfunctional copies of genes

Reads corresponding to the

2.7% of annotated human genes

RNA-seq data analysis

qualitative vs. quantitative

alignment, assembly, relative abundance, diff expression

technical or biological replicates

RNA-seq data (1)

sampling variance (biological replicates)

technical or biological variance

RNA-seq data (2)

fragment length distribution

positional and sequence-specific

RNA-seq data (3)

within sample normalization (transcript length + sequencing depth):

between sample normalization (coverage + control set)

Statistical background (1)

type of problem -> counting fragments

Likelihood vs. probability

If probability is a function of data given some params, then

likelihood is function of those params given the data

Maximum Likelihood Estimation (MLE)

Statistical background (2)

MLE: analytically or numerically

Frequentist: data -> model fit-> validate model params (MLE)