GATKwr8 B 1 Mapping and Processing

talks
Mapping and marking duplicates
From raw reads to GATK-ready reads

We are here in the Best Practices workflow
Mapping and Marking Duplicates
Key dierences for mapping & processing RNAseq
Dierent mapper
Extra processing step
DNAseq RNAseq
Overview of mapping & processing
Reference genome
Enormous pile of short

reads from NGS
Key processing steps:

Map reads to the reference & clean up
Mark PCR duplicates
(RNAseq only) Process reads that span
splice juncMons

Non-humans: no reference genome for your organism?
Problem :
GATK tools ALL depend on a reference genome
Mapping to reference is NOT OPTIONAL
SoluMon:
Create a reference by assembling a representaMve sample

BUT be careful if your populaMons are very divergent
Consider creaMng a hybrid reference?

Note on draU genomes (especially bacteria)
DraU genomes with many conMgs will be very slow because

GATK is not designed to handle this.
Concatenate conMgs into superconMgs or throw out very small

conMgs that are unlikely to be informaMve.
Also consider throwing out or masking repeated sequences,

transposons etc.

MAPPING
Ideally, wed just align the sample genome to the reference genome
Region 1 Region 2 Region 3

Reference
Sample
truncated region duplicated region
= local variant (SNP/indel)
But we dont have the whole sample in one piece.

We have a pile of reads that need to be mapped individually
Mapping is complicated by mismatches (true mutations or sequencing errors),

indels, duplicated regions etc.
Region 1 Region 2 Region 3
Enormous pile of
short reads from
NGS
Easy
Region 1 Region 2A Region 2B
Harder
Mapping produces a SAM alignment
summarizing position, quality, and structure for a given sequence
MAPQ (quality) SEQ (sequence)

POS (alignment CIGAR (structure)
start)
read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *
Mate informaDon
See also:
SAM format spec: hap://samtools.github.io/hts-specs/SAMv1.pdf
Explain SAM ags: hap://broadinsMtute.github.io/picard/explain-ags.html
For DNAseq: map reads using BWA
The BWA soUware package by Heng Li & Richard Durbin

hGp://bio-bwa.sourceforge.net/bwa.shtml
Burrows-Wheeler Aligner
mem algorithm for 70bp or longer Illumina, 454, Ion Torrent and Sanger
reads, assembly conMgs and BAC sequences
Use M ag to ag extra alignment hits as secondary (for downstream
compaMbility)
A quick side note about sorMng and read groups
The informaDon for this: is actually stored as a but the GATK wants
text le with one line per reads to be sorted by
read which from far away starDng posiDon like this:
looks like this:
The reads are in no parDcular So we need to explicitly sort the

order SAM le
And while were at it, lets add read group informaMon if it isnt already
there, so the GATK will know what read belongs to what sample (and
library)
-> All of this is done in the Picard pipeline by MergeBamAlignment

but if you are starMng from FASTQ you will have to rst create an unaligned
Bam using FastqToSam
Mapping as implemented in Broad producMon pipeline
Use BWA MEM algorithm with M ag
Despite being best in class, BWA output has some arMfacts that we like to
clean up
Alignments clipping end of reference, missing mates, reads spilling into
adapters...
Combine output with original unmapped SAM using MergeBamAlignment
Fixes up various mapping issues to remove erroneous mapping arMfacts
Sorts reads by mapping posiMon + adds read group info as well
Add mate cigar informaMon

MergeBamAlignment
Reference genome
oF astq
amT
vi a S Raw mapped SAM
BWA MEM Mapped, cleaned,

sorted SAM
Unmapped SAM
Generic recommendaMon for mapping starMng from FASTQ:
Use BWA MEM algorithm with M ag and also R to add read group info
OpMonally run Picard CleanSam and FixMateInformaMon and set
SO=coordinate
If you forgot to feed BWA the read group info, add it with Picard
AddOrReplaceReadGroups
Will not clip overhanging bases for short inserts!
CleanSam + FixMate
Reference genome
BWA MEM
Raw mapped SAM Mapped, cleaned,
sorted SAM
FASTQ
We now have properly mapped and sorted reads

MARKING DUPLICATES
Why mark duplicates?
Duplicates are sets of reads pairs that have the same unclipped alignment
start and unclipped alignment end
Theyre suspected to be non-independent measurements of a sequence
Sampled from the exact same template of DNA
Violates assumpMons of variant calling
Whats more, errors in sample/library prep will get propagated to all the
duplicates
Just pick the best copy miMgates the eects of errors
Reference
Mapped
reads
Mark duplicates
= sequencing error propagated in duplicates
How do we idenMfy duplicate reads?
Dupes come from the same input DNA template so

reads will have same start posiMon on reference!
Where was the rst base that was sequenced?
For paired-end (PE) reads, same start for both ends
IdenMfy duplicate sets, then choose representaMve

read based on base quality scores and other criteria
But theres a catch (or two)
BWA someMmes clips bases from the ends of the

alignment (when the alignment there is poor)
Fragments mapped to the reverse strand are
specied by their 3 posiMon, instead of 5
Need to use SAM ags + CIGAR string to determine
the unclipped 5 end
IdenMfy duplicates using orientaMon + unclipped 5 posiMon
Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand

Orange maps to reverse strand
Ref T A G C C G A T C Grey bases are clipped
r1 T A G C C G A
r2 T A G C C G A Underlined is the expected 5 start of the
read, given the mapping
r3 T A C CAG A
r4 T A G C C H H Sowhat are the duplicate sets?
r5 T A G C C G A T C
r6 S S G C C G A
r7 G C C G A

r1 T A G C C G A
r3 T A C CAG A
r5 T A G C C G A T C r1, r3, r5, r6 (start at posiMon 1)
r6 S S G C C G A
r7 G C C G A

r1 T A G C C G A
r3 T A C CAG A
r2, r4 (start at posiMon 7)
r6 S S G C C G A
r7 G C C G A

r1 T A G C C G A
r3 T A C CAG A
r2, r4 (start at posiMon 7)
r6 S S G C C G A r7 (starts at posiMon 3)
r7 G C C G A
So now we have mapped, sorted, and deduped reads
Showing duplicate reads Hiding duplicate reads

Next Step: Indel Realignment

WAIT, WERE NOT DONE! RNASEQ
Key dierences for mapping & processing RNAseq
Dierent mapper
Extra processing step
DNAseq RNAseq
RNAseq reads mapped across splice juncMons need special handling
For RNAseq: map reads using STAR aligner
Highest sensiMvity for

both SNPs and indels
among all programs tested

2-pass approach described in
Pr G Engstrm et al. SystemaDc
evaluaDon of spliced alignment
programs for RNA-seq data. Nature
Methods, 2013
(see Suppl.l text p. 43 for detailed
protocol)
First pass idenMes
splice juncMons (SJ)
Use the SJ to guide
the second round of alignment
STAR by Dobin et al., 2012 hGp://bioinformaDcs.oxfordjournals.org/content/29/1/15

New GATK tool called SplitNCigarReads
exon_1 exon_2
REF
GATTC--------------AATTATT GATTC AATTATT
SplitNCigarReads
splits reads with Ns in the CIGAR string
keeps grouping informaMon per exon
trims overhangs
For now, need to use U
ALLOW_N_CIGAR_READS
Use ReassignOneMappingQuality read

lter to reassign mapping qualiMes from
255 (unusable by GATK) to 60
TOOL TIPS
SplitNCigarReads
Splits reads with Ns in CIGAR

java jar GenomeAnalysisTK.jar \
T SplitNCigarReads \
R human.fasta \
I original.bam \
o SplitNCigar.bam \
U ALLOW_N_CIGAR_READS \
rf ReassignOneMappingQuality \
RMQF 255 RMQT 60

Next Step: Indel Realignment
talks
Further reading

hap://www.broadinsMtute.org/gatk/guide/best-pracMces
hap://broadinsMtute.github.io/picard/
hap://bio-bwa.sourceforge.net/bwa.shtml
hap://bioinformaMcs.oxfordjournals.org/content/29/1/15

GATKwr8 B 1 Mapping and Processing

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

GATKwr8 B 1 Mapping and Processing

Caricato da

Copyright:

Formati disponibili

talks

Mapping and marking duplicates

From raw reads to GATK-ready reads

Extra processing step

Enormous pile of short

Key processing steps:

DraU genomes with many conMgs will be very slow because

Concatenate conMgs into superconMgs or throw out very small

Also consider throwing out or masking repeated sequences,

Region 1 Region 2 Region 3

= local variant (SNP/indel)

But we dont have the whole sample in one piece.

Mapping is complicated by mismatches (true mutations or sequencing errors),

Region 1 Region 2 Region 3

Region 1 Region 2A Region 2B

MAPQ (quality) SEQ (sequence)

read1 99 ref 2 30 3M1D2M1I1M = 14 20 CATCTAG *

The BWA soUware package by Heng Li & Richard Durbin

The reads are in no parDcular So we need to explicitly sort the

-> All of this is done in the Picard pipeline by MergeBamAlignment

BWA MEM Mapped, cleaned,

Dupes come from the same input DNA template so

IdenMfy duplicate sets, then choose representaMve

BWA someMmes clips bases from the ends of the

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand

Pos 1 2 3 4 5 6 7 8 9 Blue maps to forward strand

Showing duplicate reads Hiding duplicate reads

Extra processing step

Highest sensiMvity for

STAR by Dobin et al., 2012 hGp://bioinformaDcs.oxfordjournals.org/content/29/1/15

Use ReassignOneMappingQuality read

Splits reads with Ns in CIGAR

Potrebbero piacerti anche