Sei sulla pagina 1di 33

talks

Mapping and marking duplicates

From raw reads to GATK-ready reads


We are here in the Best Practices workflow
Mapping and Marking Duplicates
Key dierences for mapping & processing RNAseq

Dierent mapper

Extra processing step

DNAseq RNAseq
Overview of mapping & processing

Reference genome

Enormous pile of short


reads from NGS

Key processing steps:



Map reads to the reference & clean up
Mark PCR duplicates
(RNAseq only) Process reads that span
splice juncMons

Non-humans: no reference genome for your organism?

Problem :
GATK tools ALL depend on a reference genome
Mapping to reference is NOT OPTIONAL

SoluMon:
Create a reference by assembling a representaMve sample

BUT be careful if your populaMons are very divergent
Consider creaMng a hybrid reference?

Note on draU genomes (especially bacteria)

DraU genomes with many conMgs will be very slow because


GATK is not designed to handle this.

Concatenate conMgs into superconMgs or throw out very small


conMgs that are unlikely to be informaMve.

Also consider throwing out or masking repeated sequences,


transposons etc.

MAPPING
Ideally, wed just align the sample genome to the reference genome

Region 1 Region 2 Region 3


Reference

Sample
truncated region duplicated region

= local variant (SNP/indel)

But we dont have the whole sample in one piece.


We have a pile of reads that need to be mapped individually

Mapping is complicated by mismatches (true mutations or sequencing errors),


indels, duplicated regions etc.

Region 1 Region 2 Region 3

Enormous pile of
short reads from
NGS
Easy

Region 1 Region 2A Region 2B

Harder
Mapping produces a SAM alignment
summarizing position, quality, and structure for a given sequence

MAPQ (quality) SEQ (sequence)


POS (alignment CIGAR (structure)
start)

read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *

Mate informaDon

See also:
SAM format spec: hap://samtools.github.io/hts-specs/SAMv1.pdf
Explain SAM ags: hap://broadinsMtute.github.io/picard/explain-ags.html
For DNAseq: map reads using BWA

The BWA soUware package by Heng Li & Richard Durbin


hGp://bio-bwa.sourceforge.net/bwa.shtml

Burrows-Wheeler Aligner
mem algorithm for 70bp or longer Illumina, 454, Ion Torrent and Sanger
reads, assembly conMgs and BAC sequences
Use M ag to ag extra alignment hits as secondary (for downstream
compaMbility)
A quick side note about sorMng and read groups

The informaDon for this: is actually stored as a but the GATK wants
text le with one line per reads to be sorted by
read which from far away starDng posiDon like this:
looks like this:

The reads are in no parDcular So we need to explicitly sort the


order SAM le

And while were at it, lets add read group informaMon if it isnt already
there, so the GATK will know what read belongs to what sample (and
library)

-> All of this is done in the Picard pipeline by MergeBamAlignment


but if you are starMng from FASTQ you will have to rst create an unaligned
Bam using FastqToSam
Mapping as implemented in Broad producMon pipeline
Use BWA MEM algorithm with M ag
Despite being best in class, BWA output has some arMfacts that we like to
clean up
Alignments clipping end of reference, missing mates, reads spilling into
adapters...
Combine output with original unmapped SAM using MergeBamAlignment
Fixes up various mapping issues to remove erroneous mapping arMfacts
Sorts reads by mapping posiMon + adds read group info as well
Add mate cigar informaMon

MergeBamAlignment
Reference genome

oF astq
amT
vi a S Raw mapped SAM

BWA MEM Mapped, cleaned,


sorted SAM
Unmapped SAM
Generic recommendaMon for mapping starMng from FASTQ:

Use BWA MEM algorithm with M ag and also R to add read group info
OpMonally run Picard CleanSam and FixMateInformaMon and set
SO=coordinate
If you forgot to feed BWA the read group info, add it with Picard
AddOrReplaceReadGroups
Will not clip overhanging bases for short inserts!

CleanSam + FixMate

Reference genome

BWA MEM
Raw mapped SAM Mapped, cleaned,
sorted SAM

FASTQ
We now have properly mapped and sorted reads

MARKING DUPLICATES
Why mark duplicates?

Duplicates are sets of reads pairs that have the same unclipped alignment
start and unclipped alignment end
Theyre suspected to be non-independent measurements of a sequence
Sampled from the exact same template of DNA
Violates assumpMons of variant calling
Whats more, errors in sample/library prep will get propagated to all the
duplicates
Just pick the best copy miMgates the eects of errors
Reference

Mapped
reads

Mark duplicates
= sequencing error propagated in duplicates
How do we idenMfy duplicate reads?

Dupes come from the same input DNA template so


reads will have same start posiMon on reference!
Where was the rst base that was sequenced?
For paired-end (PE) reads, same start for both ends

IdenMfy duplicate sets, then choose representaMve


read based on base quality scores and other criteria
But theres a catch (or two)

BWA someMmes clips bases from the ends of the


alignment (when the alignment there is poor)
Fragments mapped to the reverse strand are
specied by their 3 posiMon, instead of 5
Need to use SAM ags + CIGAR string to determine
the unclipped 5 end
IdenMfy duplicates using orientaMon + unclipped 5 posiMon

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand


Orange maps to reverse strand
Ref T A G C C G A T C Grey bases are clipped
r1 T A G C C G A
r2 T A G C C G A Underlined is the expected 5 start of the
read, given the mapping
r3 T A C CAG A
r4 T A G C C H H Sowhat are the duplicate sets?
r5 T A G C C G A T C
r6 S S G C C G A
r7 G C C G A
IdenMfy duplicates using orientaMon + unclipped 5 posiMon

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand


Orange maps to reverse strand
Ref T A G C C G A T C Grey bases are clipped
r1 T A G C C G A
r2 T A G C C G A Underlined is the expected 5 start of the
read, given the mapping
r3 T A C CAG A
r4 T A G C C H H Sowhat are the duplicate sets?
r5 T A G C C G A T C r1, r3, r5, r6 (start at posiMon 1)
r6 S S G C C G A
r7 G C C G A
IdenMfy duplicates using orientaMon + unclipped 5 posiMon

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand


Orange maps to reverse strand
Ref T A G C C G A T C Grey bases are clipped
r1 T A G C C G A
r2 T A G C C G A Underlined is the expected 5 start of the
read, given the mapping
r3 T A C CAG A
r4 T A G C C H H Sowhat are the duplicate sets?
r5 T A G C C G A T C r1, r3, r5, r6 (start at posiMon 1)
r2, r4 (start at posiMon 7)
r6 S S G C C G A
r7 G C C G A
IdenMfy duplicates using orientaMon + unclipped 5 posiMon

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand


Orange maps to reverse strand
Ref T A G C C G A T C Grey bases are clipped
r1 T A G C C G A
r2 T A G C C G A Underlined is the expected 5 start of the
read, given the mapping
r3 T A C CAG A
r4 T A G C C H H Sowhat are the duplicate sets?
r5 T A G C C G A T C r1, r3, r5, r6 (start at posiMon 1)
r2, r4 (start at posiMon 7)
r6 S S G C C G A r7 (starts at posiMon 3)
r7 G C C G A
So now we have mapped, sorted, and deduped reads

Showing duplicate reads Hiding duplicate reads


We are here in the Best Practices workflow
Next Step: Indel Realignment

WAIT, WERE NOT DONE! RNASEQ
Key dierences for mapping & processing RNAseq

Dierent mapper

Extra processing step

DNAseq RNAseq
RNAseq reads mapped across splice juncMons need special handling
For RNAseq: map reads using STAR aligner

Highest sensiMvity for


both SNPs and indels
among all programs tested

2-pass approach described in
Pr G Engstrm et al. SystemaDc
evaluaDon of spliced alignment
programs for RNA-seq data. Nature
Methods, 2013
(see Suppl.l text p. 43 for detailed
protocol)
First pass idenMes
splice juncMons (SJ)
Use the SJ to guide
the second round of alignment

STAR by Dobin et al., 2012 hGp://bioinformaDcs.oxfordjournals.org/content/29/1/15


New GATK tool called SplitNCigarReads

exon_1 exon_2
REF
GATTC--------------AATTATT GATTC AATTATT

SplitNCigarReads
splits reads with Ns in the CIGAR string
keeps grouping informaMon per exon
trims overhangs
For now, need to use U
ALLOW_N_CIGAR_READS

Use ReassignOneMappingQuality read


lter to reassign mapping qualiMes from
255 (unusable by GATK) to 60
TOOL TIPS
SplitNCigarReads

Splits reads with Ns in CIGAR


java jar GenomeAnalysisTK.jar \
T SplitNCigarReads \
R human.fasta \
I original.bam \
o SplitNCigar.bam \
U ALLOW_N_CIGAR_READS \
rf ReassignOneMappingQuality \
RMQF 255 RMQT 60

We are here in the Best Practices workflow
Next Step: Indel Realignment
talks

Further reading

hap://www.broadinsMtute.org/gatk/guide/best-pracMces
hap://broadinsMtute.github.io/picard/
hap://bio-bwa.sourceforge.net/bwa.shtml
hap://bioinformaMcs.oxfordjournals.org/content/29/1/15

Potrebbero piacerti anche