Sei sulla pagina 1di 32

Data Basics

Preprocessing and SNP calling


Natasja S. Ehlers, PhD student
Center for Biological Sequence Analysis
Functional Human Variation Group
Simon Rasmussen
Next Generation Sequencing analysis
DTU Bioinformatics

36626 - Next Generation Sequencing Analysis


Generalized NGS analysis
Data size

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Data size

Sample prep
&
Sequencing

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Data size

SNPs, genes,
regions
Sample prep
&
Sequencing

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
Generalized NGS analysis
Main data reductive steps
Data size

SNPs, genes,
regions
Sample prep
&
Sequencing

Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
36626 - Next Generation Sequencing Analysis
What is sequence data?
Sequences are stored in fasta-files
Header
>gi|218693476|ref|NC_011748.1| Escherichia coli 55989 chromosome, complete genome
GTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGT
Sequence GTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAA
ATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACG
CATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAA
ACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCAT
GCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTG
GAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGG
TGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTT
TGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTC
GATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCA

E.coli ~ 4.5 - 6 Mbases Human ~ 3.2 Gbases

36626 - Next Generation Sequencing Analysis


Then what is NGS data?
Fastq
Header

@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
Sequence ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?
Qualities
(prob. that base call is wrong)

36626 - Next Generation Sequencing Analysis


Then what is NGS data?
Fastq
Header

@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
Sequence ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?
Qualities
(prob. that base call is wrong)

Millions to billions of these

36626 - Next Generation Sequencing Analysis


Quality score encoding
• Quality score is the combination of these two (Illumina):

• Quality predictor values of clusters:


• Intensity profiles and signal-to-noise ratios
• Quality model/table:
• Pre-calculated combinations of the above
• Depend on machine, chemistry, software
36626 - Next Generation Sequencing Analysis
A closer look at the qualities
Header

@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
Sequence ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?
Qualities
(prob. that base call is wrong)

One character encodes a number Phred-scale


using ascii table (0-255)
Q = -l0 * log10 P
This number (Q) can be
converted to P P = 10^(-Q/10)
36626 - Next Generation Sequencing Analysis
Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

36626 - Next Generation Sequencing Analysis


Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

66

36626 - Next Generation Sequencing Analysis


Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

66 65

36626 - Next Generation Sequencing Analysis


Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

66 65 65

36626 - Next Generation Sequencing Analysis


Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

66 65 65

Q ~ Prob
10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001

36626 - Next Generation Sequencing Analysis


Phred scale
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?

66 65 65 ~1e-6

Q ~ Prob
10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001

36626 - Next Generation Sequencing Analysis


Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001
36626 - Next Generation Sequencing Analysis
Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
Sanger: 33 32 32 ~0.001 20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001
36626 - Next Generation Sequencing Analysis
Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 30 ~ 0.001
40 ~ 0.0001
36626 - Next Generation Sequencing Analysis
Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
HUGE difference!
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 30 ~ 0.001
40 ~ 0.0001
36626 - Next Generation Sequencing Analysis
Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
HUGE difference!
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 Exercise today 30 ~ 0.001
40 ~ 0.0001
36626 - Next Generation Sequencing Analysis
Sanger vs. Illumina vs. Solexa
• 454, Ion Torrent, Pac Bio, Nanopore, Sanger: “Sanger”
encoding
• Illumina reads: “Illumina” or “Sanger” encoding. New
reads are all “Sanger”

• Solexa data: Solexa encoding (bought by Illumina)


• All data from SRA/ENA: “Sanger”
36626 - Next Generation Sequencing Analysis
Read types
Fragment DNA:

Paired end Mate pair


Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

36626 - Next Generation Sequencing Analysis


Read types
Fragment DNA:

Paired end Mate pair


Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

36626 - Next Generation Sequencing Analysis


Read types
Fragment DNA:

Paired end Mate pair


Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

36626 - Next Generation Sequencing Analysis


Read types
Fragment DNA:

Paired end Mate pair


Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

36626 - Next Generation Sequencing Analysis


Read types
Fragment DNA:

Paired end Mate pair


Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

36626 - Next Generation Sequencing Analysis Protocol/technology dependent


Read orientation
Single end Forward

Paired end Illumina: Forward - Reverse

Mate pair Illumina: Reverse - Forward

Different for other technologies!


36626 - Next Generation Sequencing Analysis
Special applications
• Single end reads:
• Sometimes the only possibility (small DNA fragments / ancient DNA)
• Paired end reads:
• More precise mapping/alignment/variation calls
• Medium/Large indels (insertion/deletion)
• Structural variations
• Scaffolding in de novo assembly
• Mate pairs (and Long reads):
• Scaffolding in de novo assembly
• Structural variations
36626 - Next Generation Sequencing Analysis
Question

• What does it mean to have paired end reads?

• Discuss with neighbor for 2-3 mins, we discuss

36626 - Next Generation Sequencing Analysis


Paired end reads x2

Illumina Paired End sequencing video

36626 - Next Generation Sequencing Analysis


Exercise

http://www.cbs.dtu.dk/courses/27626/Exercises/
Data_basics_exercise.php

36626 - Next Generation Sequencing Analysis

Potrebbero piacerti anche