Data Basics

Data Basics
Preprocessing and SNP calling

Natasja S. Ehlers, PhD student
Center for Biological Sequence Analysis
Functional Human Variation Group
Simon Rasmussen
Next Generation Sequencing analysis
DTU Bioinformatics
36626 - Next Generation Sequencing Analysis

Generalized NGS analysis
Data size
Application
Assembly: Compare
Raw Pre- specific:
Question Alignment / samples / Answer?
reads processing Variant calling,
de novo methods
count matrix, ...
Data size
Sample prep
&
Sequencing
Application
Assembly: Compare
Raw Pre- specific:
de novo methods
count matrix, ...
Data size
SNPs, genes,
regions
Sample prep
&
Sequencing
Application
Assembly: Compare
Raw Pre- specific:
de novo methods
count matrix, ...
Main data reductive steps
Data size
SNPs, genes,
regions
Sample prep
&
Sequencing
Application
Assembly: Compare
Raw Pre- specific:
de novo methods
count matrix, ...
What is sequence data?
Sequences are stored in fasta-files
Header
>gi|218693476|ref|NC_011748.1| Escherichia coli 55989 chromosome, complete genome
GTAAGTATTTTTCAGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGT
Sequence GTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAA
ATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACG
CATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAA
ACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCAT
GCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTG
GAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGG
TGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTT
TGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTC
GATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCA
E.coli ~ 4.5 - 6 Mbases Human ~ 3.2 Gbases

Then what is NGS data?
Fastq
Header
@ILLUMINA-C90280_0030_FC:5:1:2675:1090#NNNNNN/1
Sequence ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88?
Qualities
(prob. that base call is wrong)

Then what is NGS data?
Fastq
Header
+
Qualities
Millions to billions of these

Quality score encoding
• Quality score is the combination of these two (Illumina):
• Quality predictor values of clusters:

• Intensity profiles and signal-to-noise ratios
• Quality model/table:
• Pre-calculated combinations of the above
• Depend on machine, chemistry, software
A closer look at the qualities
Header
+
Qualities
One character encodes a number Phred-scale

using ascii table (0-255)
Q = -l0 * log10 P
This number (Q) can be
converted to P P = 10^(-Q/10)
Phred scale
ATTCCCGGCCTTTTTCCAGGCCTGCCTGCTCGAGC
+

Phred scale
+
66

Phred scale
+
66 65

Phred scale
+
66 65 65

Phred scale
+
66 65 65
Q ~ Prob
10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001

Phred scale
+
66 65 65 ~1e-6
Q ~ Prob
10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001

Phred-scaled probabilities
• Base qualities, read mapping qualities, variant qualities, ...
• Straight-forward, except for when they are used in reads!
• Offset: Sanger = 33 (“Phred+33”), Illumina = 64 (“Phred+64”)
+
BAAAGECEE<EEDFEDF3DBDBB=A+==>9>>88? Q ~ Prob
Phred: 66 65 65 ~1e-6 10 ~ 0.1
20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001
+
Phred: 66 65 65 ~1e-6 10 ~ 0.1
Sanger: 33 32 32 ~0.001 20 ~ 0.01
30 ~ 0.001
40 ~ 0.0001
+
Phred: 66 65 65 ~1e-6 10 ~ 0.1
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 30 ~ 0.001
40 ~ 0.0001
+
Phred: 66 65 65 ~1e-6 10 ~ 0.1
HUGE difference!
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 30 ~ 0.001
40 ~ 0.0001
+
Phred: 66 65 65 ~1e-6 10 ~ 0.1
HUGE difference!
Sanger: 33 32 32 ~0.001 20 ~ 0.01
Illumina: 2 1 1 ~1 Exercise today 30 ~ 0.001
40 ~ 0.0001
Sanger vs. Illumina vs. Solexa
• 454, Ion Torrent, Pac Bio, Nanopore, Sanger: “Sanger”
encoding
• Illumina reads: “Illumina” or “Sanger” encoding. New
reads are all “Sanger”
• Solexa data: Solexa encoding (bought by Illumina)

• All data from SRA/ENA: “Sanger”
Read types
Fragment DNA:
Paired end Mate pair

Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

Read types
Fragment DNA:

Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

Read types
Fragment DNA:

Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

Read types
Fragment DNA:

Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)

Read types
Fragment DNA:

Single end
Ins: 200-800 bp Ins: 2kb - 40kb (~5kb)
36626 - Next Generation Sequencing Analysis Protocol/technology dependent

Read orientation
Single end Forward
Paired end Illumina: Forward - Reverse
Mate pair Illumina: Reverse - Forward
Different for other technologies!

Special applications
• Single end reads:
• Sometimes the only possibility (small DNA fragments / ancient DNA)
• Paired end reads:
• More precise mapping/alignment/variation calls
• Medium/Large indels (insertion/deletion)
• Structural variations
• Scaffolding in de novo assembly
• Mate pairs (and Long reads):
• Scaffolding in de novo assembly
• Structural variations
Question
• What does it mean to have paired end reads?
• Discuss with neighbor for 2-3 mins, we discuss

Paired end reads x2
Illumina Paired End sequencing video

Exercise
http://www.cbs.dtu.dk/courses/27626/Exercises/
Data_basics_exercise.php

Data Basics

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Basics

Caricato da

Copyright:

Formati disponibili

Data Basics

Preprocessing and SNP calling

36626 - Next Generation Sequencing Analysis

E.coli ~ 4.5 - 6 Mbases Human ~ 3.2 Gbases

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

Millions to billions of these

36626 - Next Generation Sequencing Analysis

• Quality predictor values of clusters:

One character encodes a number Phred-scale

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

• Solexa data: Solexa encoding (bought by Illumina)

Paired end Mate pair

36626 - Next Generation Sequencing Analysis

Paired end Mate pair

36626 - Next Generation Sequencing Analysis

Paired end Mate pair

36626 - Next Generation Sequencing Analysis

Paired end Mate pair

36626 - Next Generation Sequencing Analysis

Paired end Mate pair

36626 - Next Generation Sequencing Analysis Protocol/technology dependent

Paired end Illumina: Forward - Reverse

Mate pair Illumina: Reverse - Forward

Different for other technologies!

• What does it mean to have paired end reads?

• Discuss with neighbor for 2-3 mins, we discuss

36626 - Next Generation Sequencing Analysis

Illumina Paired End sequencing video

36626 - Next Generation Sequencing Analysis

36626 - Next Generation Sequencing Analysis

Potrebbero piacerti anche