Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Dierent mapper
DNAseq
RNAseq
Overview
of
mapping
&
processing
Reference genome
Problem
:
GATK
tools
ALL
depend
on
a
reference
genome
Mapping
to
reference
is
NOT
OPTIONAL
SoluMon:
Create
a
reference
by
assembling
a
representaMve
sample
BUT
be
careful
if
your
populaMons
are
very
divergent
Consider
creaMng
a
hybrid
reference?
Note
on
draU
genomes
(especially
bacteria)
Sample
truncated
region
duplicated
region
Enormous
pile
of
short
reads
from
NGS
Easy
Harder
Mapping produces a SAM alignment
summarizing position, quality, and structure for a given sequence
Mate informaDon
See
also:
SAM
format
spec:
hap://samtools.github.io/hts-specs/SAMv1.pdf
Explain
SAM
ags:
hap://broadinsMtute.github.io/picard/explain-ags.html
For
DNAseq:
map
reads
using
BWA
Burrows-Wheeler
Aligner
mem
algorithm
for
70bp
or
longer
Illumina,
454,
Ion
Torrent
and
Sanger
reads,
assembly
conMgs
and
BAC
sequences
Use
M
ag
to
ag
extra
alignment
hits
as
secondary
(for
downstream
compaMbility)
A
quick
side
note
about
sorMng
and
read
groups
The
informaDon
for
this:
is
actually
stored
as
a
but
the
GATK
wants
text
le
with
one
line
per
reads
to
be
sorted
by
read
which
from
far
away
starDng
posiDon
like
this:
looks
like
this:
And
while
were
at
it,
lets
add
read
group
informaMon
if
it
isnt
already
there,
so
the
GATK
will
know
what
read
belongs
to
what
sample
(and
library)
MergeBamAlignment
Reference
genome
oF astq
amT
vi a
S Raw
mapped
SAM
Use
BWA
MEM
algorithm
with
M
ag
and
also
R
to
add
read
group
info
OpMonally
run
Picard
CleanSam
and
FixMateInformaMon
and
set
SO=coordinate
If
you
forgot
to
feed
BWA
the
read
group
info,
add
it
with
Picard
AddOrReplaceReadGroups
Will
not
clip
overhanging
bases
for
short
inserts!
CleanSam + FixMate
Reference genome
BWA
MEM
Raw
mapped
SAM
Mapped,
cleaned,
sorted
SAM
FASTQ
We
now
have
properly
mapped
and
sorted
reads
MARKING
DUPLICATES
Why
mark
duplicates?
Duplicates
are
sets
of
reads
pairs
that
have
the
same
unclipped
alignment
start
and
unclipped
alignment
end
Theyre
suspected
to
be
non-independent
measurements
of
a
sequence
Sampled
from
the
exact
same
template
of
DNA
Violates
assumpMons
of
variant
calling
Whats
more,
errors
in
sample/library
prep
will
get
propagated
to
all
the
duplicates
Just
pick
the
best
copy
miMgates
the
eects
of
errors
Reference
Mapped
reads
Mark
duplicates
=
sequencing
error
propagated
in
duplicates
How
do
we
idenMfy
duplicate
reads?
Dierent mapper
DNAseq
RNAseq
RNAseq
reads
mapped
across
splice
juncMons
need
special
handling
For
RNAseq:
map
reads
using
STAR
aligner
exon_1
exon_2
REF
GATTC--------------AATTATT
GATTC
AATTATT
SplitNCigarReads
splits
reads
with
Ns
in
the
CIGAR
string
keeps
grouping
informaMon
per
exon
trims
overhangs
For
now,
need
to
use
U
ALLOW_N_CIGAR_READS
java
jar
GenomeAnalysisTK.jar
\
T
SplitNCigarReads
\
R
human.fasta
\
I
original.bam
\
o
SplitNCigar.bam
\
U
ALLOW_N_CIGAR_READS
\
rf
ReassignOneMappingQuality
\
RMQF
255
RMQT
60
We are here in the Best Practices workflow
Next
Step:
Indel
Realignment
talks
Further
reading
hap://www.broadinsMtute.org/gatk/guide/best-pracMces
hap://broadinsMtute.github.io/picard/
hap://bio-bwa.sourceforge.net/bwa.shtml
hap://bioinformaMcs.oxfordjournals.org/content/29/1/15