Sei sulla pagina 1di 77

Computational Biology

Introduction, Basic Biology

Q
Nives Skunca
Slides prepared by Dr. Christophe Dessimoz
19/21 September 2012

This week
Course introduction
Basic Biology

perturbation

Reality
observation

Catalogue

observation
observation

Nature

Georg Dionysius Ehret's illustration of Linnaeus's


sexual system of plant classification, 1736

Model
formulate/select

recreate life
synthetic biology

take it apart
in vitro
obs.

obs.

obs.
perturb.

Validate on
real data
obs.

perturb.

f(x)

Validate
Estimate
by simulation

prediction

Learning Outcomes

Understand basic concepts of molecular


biology

Understand and apply fundamental


models, algorithms, data structures, and
computational techniques to answer
biological questions

Wide range of topics, but special focus on


biological sequences and their evolutionary
context.

Topics
Molecular Genetics
Gene Evolution
Genome Evolution
Mass Spectrometry
Codon Bias

Modeling
Dynamic programming
Markov models
Least squares
Maximum Likelihood
Optimization
Heuristics
Simulation

Organization

Lecture

Wed 13-14 (CAB G52), Fri 13-15 (ML F34)


Prof. Gonnet will hold the lectures

Exercises:

Thu 14-16 (CAB H56), starting this week


If you do not have a nethz account, ask
Stefan Zoller as soon as possible.

Teaching Assistants

Stefan Zoller
Nives Skunca

Date
Sept. 19/21

Topic
Course Introduction; Basic Molecular
Biology
Sept. 26/28 Markov models/String Alignment I
Oct. 3/5
String Alignment II (indels, estimating
distances)
Oct. 10/12
Substitution Matrices
Oct. 17/19
Approximate Alignment Methods;
Statistics of Pairwise Alignments
Oct. 24/26
Phylogeny I
Oct.31/Nov.2 Phylogeny II
Nov. 7/9
Phylogeny III
Nov. 14/16 Multiple Sequence Alignments
Nov. 21/23 Synthetic Evolution; Evaluation of
Estimators
Nov. 28/30 Current research; Mass profiling
Dec. 5/7
Dec. 12/14
Dec. 19/21

Orthology/Lateral Gene Transfer


Codon bias
Genome Rearrangements

Lecturer
NS
GHG
GHG
GHG
GHG
GHG
GHG
GHG
AS
DD/GHG
Guests/
GHG
NS
SZ
GHG

Course Grade & Credits

Participation in the exercises is strongly


encouraged, but not mandatory

Written Exam

During winter session


3 hours
Only support materials are 2 A4 pages
(4 sides), personally handwritten.

Course Homepage
http://www.cbrg.ethz.ch/education/CompBiol

Course details
Schedule
Slides
Exercises

Darwin

Interpreted language based on Maple

Available for download mac and linux


(http://www.cbrg.ethz.ch/darwin)

Environment for bioinformatics, can do


sequence management, mathematics,
alignments, trees, drawing, etc.

Biorecipes
www.biorecipes.com

A collection of real
problems with coded
solutions in the
Darwin language

Darwin input in green


Darwin output in red

Other materials

Slides can be downloaded from the


course homepage.

Additional notes and references will be


made available as well.

Basic Biology
Slides of this part are largely
based on material from
Dr. Gina Cannarozzi

Basic Principles

Universality of life on earth: water,


carbon-based biochemistry; genetic
material; genetic code (largely) universal.
common origin!
Life is compartmentalized: cells are
fundamental units of structure,
function, organization
Self-replicating
Capable of Darwinian evolution

10 m
Cryptomonadales
Encyclopedia of Life
(eol.org)

So what is life?
Living organisms undergo metabolism,
maintain homeostasis, possess a capacity
to grow, respond to stimuli, reproduce
and, through natural selection, adapt to
their environment in successive
generations.

What about endospores? viruses? mules? priests?


prions? computer viruses?

In biology, there are exceptions to almost every rule.

Inside a Cell
Prokaryote

http://www.osovo.com/diagram/prokaryoticcelldiagram.htm

~2 m

Eukaryote

http://www.biologycorner.com/resources/cell.gif

10-30 m

Relevant components

Ribosomes translate mRNA into proteins.


Mitochondria (eukaryotes) have their own
DNA and are a result of early inclusion of proteobacteria into a eukaryotic cell.
Chloroplasts (plants, protists) have their own
DNA as a result of early inclusion of
cyanobacteria into a eukaryotic cell.
Plasmids (bacteria) are short pieces of circular
DNA in multiple copies; nonessential; get
transferred between bacteria.

Genome
chromosome

chromatin
histone

Genome: all the genetic


material of an organism.

The genome consists of


genes and non-coding
regions.

Genes consist of
regulatory regions,
intron, exons,
untranslated regions

http://www.scfbio-iitd.res.in/tutorial/geneorganization.html

Escherichia coli

Homo sapiens

23 chromosome pairs
1 circular chromosome
1 plasmid (multiple copies)
~4.6 million base pairs
~3.9 million
coding bases (85%)
4132 protein-coding genes
172 RNA (tRNA, rRNA,etc)
578 pseudogenes

~3 billion base pairs


~50 million coding bases (1.5%)
~21,000 protein-coding genes
~294,000 exons
~60,000 different transcripts
~6,000 pseudogenes
~4,800 RNA genes
~2,900 RNA pseudogenes

DNA
Deoxyribonucleic acid

Double helix
Backbones: phosphate and
deoxyribose , directed
(5 3), antiparallel

34
(3.4 nm)

Connection: 4 bases Adenine,


Thymine, Cytosine, Guanine.
A-T and C-G are paired by
hydrogen bonds (relatively weak)

3.3
(0.33 nm)
Wikipedia

DNA Bases
PuRines
PYrimidines
C G: 3 H-bonds
A T: 2H-bonds
Wikipedia

Hydrogen Bond

X-H Y where X,Y is


an electronegative
atom (typically N,O,F)

Responsible for high


boiling point of water
(each H20 can have
up to 4 H bonds)

Central dogma of
molecular biology

Wikipedia

DNA Replication

Wikipedia

Polymerase can only add bases from 53


(DNA is read 3 5)

Movie time!
Replication visualized:

http://www.wehi.edu.au/education/wehitv/molecular_visualisations_of_dna/

End of day 1

RNA

Single stranded (can form structure)

microRNA: short nucleotides (~22 nts)


which regulate gene function

Uracil instead of Thymine


mRNA: messenger RNA, for translation
rRNA: subunit of ribosome
tRNA: specific for one amino-acid,
selectively bind to codon via ribosome.

http://www.pdb.org/pdb/static.do?
p=education_discussion/
molecule_of_the_month/pdb15_2.html

Transcription

Transcription factors bind to promoter sites at


the 5 regulatory region.

RNA polymerase, binds to the complex.

Genes can be on either strand, but direction of


growing mRNA sequence is always 5 3

Working together, they open the DNA double


helix.

Roger Kornberg
Nobel Prize Chemistry 2006

The chain shown in grey is RNA polymerase,


with the portion that clamps on the DNA
shaded in yellow. The DNA helix being
unwound and transcribed by RNA
polymerase is shown in green and blue, and
the growing RNA stand is shown in red.
http://med.stanford.edu/featured_topics/nobel/kornberg/release.html

Post-transcriptional
modifications (Eukaryotes)

5 Cap
Poly-A tail
Splicing (removal of introns)

Research questions: Where are the introns? Where are the


coding sequences? Where are the stop and start of
transcription? Where are the binding sites for the transcription
factors that control when transcription takes place?

Alternative Splicing

Humans: >50% of genes have splice variants.


Dscam gene in D. melanogaster: 95 alternative
exons can express 38,016 different mRNAs through
alternative splicing.

Translation

Wikimedia
Commons

The Genetic Code

Proteins

Participate in most (all?) cellular processes

Encoded in DNA

Made of 20 amino-acids (+ occasionally a


cofactor, such as metal ion, heme, ATP, etc.)

Alberts et al., Essential cell biology: an introduction to the


molecular biology of the cell, Garland 1996

Functions of Proteins

...

Amino
Acids

Only sidechains
differ (red)

Sidechains have
diverse chemical
properties
(charge, size, pH,
hydrophobicity, ...)

Wikimedia
Commons

Peptide Bond

G. Cannarozzi

Proteins
have a 3D
structure

Wikimedia Commons

Biological sequences
How are they identified?
Where are they stored?

Next Generation Sequencing

Unidentified protein
extracted from gel

Proteomics
MDISTLTASEEIE
MEIDAEEIEIMAT
IDLAEDLISLFM
DDMFSSIDLESI
NFEIFNSSDIDSI
NIDLESIEEIEIMF
EEIEIMATIFNSS
DIDIMMDIMMD
SINFEIFNSSDIDI
MMDATIDLAED
LISLFMDDMFSS
IDLESINFEIFNSS

Split into fragments


of 5-10 amino acids

. . . AEDLISLFMDDM . . .

Determine mass
using MS (Mass
Spectrometry)

Determine amino
acid sequence and
compare with sequence database

Sequence
Database
Jiang Long, Science Creative
Quarterly Image Bank

Protein Identified

Growth of sequence databases


Number of sequences x 10^7

2.0

Protein Data Bank


8QL3URW.%6ZLVV3URW
UniProtKB/TrEmbl

1.5

1.0

0.5

0
2000

2002

2004

2006
Year

2008

2010

2012

Getting Sequences

Ensembl
...

e.g. GenBank File

e.g. GenBank
File

e.g. GenBank File

Evolution

Darwinian Evolution

Start from an initial population


Repeat:
reproduce and mutate randomly
natural selection: fittest individuals
survive and have descendants
selects good mutations
sometimes: a branching occurs (e.g.
speciation, duplication)

Not only the good


characters survive

Genetic drift (random sampling)

Population bottleneck
Founder effect

Genetic hitchhiking (neutral or mildly


deleterious alleles linked to positively
selected gene)

Species Evolution
Diane Dodds fruit fly experiment

Speciation: the
evolutionary process by
which new species arise

Can occur from


geographic isolation or
barriers, new niche
entered, animal
husbandry
http://evolution.berkeley.edu/evolibrary/article/_0_0/evo_45

Genome Rearrangements
e.g. Human vs. Dog

Krzywinski et al. Circos: an information aesthetic for comparative genomics. Genome Research (2009) vol. 19 (9) pp. 1639-45

Example: recombination
among E. coli strains

Mau et al. Genome Biology 2006 7:R44

Whole genome duplications

Gene Evolution

Point mutations

Kunkel, 2004, The Journal of Biological Chemistry

Point mutations
Purines

Pyrimidines

Insertion/deletion

Lateral Gene
Transfer

Wikipedia

http://www.scq.ubc.ca/attack-of-the-superbugs-antibiotic-resistance/

Recombination

Gene Evolution

Mutation (base substitution)


Insertion/Deletion
Transposition (horizontal transfer)
Recombination
Gene loss or gene duplication
Splicing pattern mutations

Evolutionary Distances
How can we quantify the amount of evolution
between two subjects?

Time since divergence


Number of common traits.
Edit distance (minimum # of elementary
operations to transform one object into the
other)
...

Desirable properties
distance estimable without knowing history
metric properties (e.g. triangle inequality)

Markovian Evolution
Markov Model: every site evolves independently,
probability of mutation only depends on present
state (no memory), probabilities of mutation are
expressed by transition matrix.
A

M1=

A
C
G
T

C G

0.900

0.033

0.033

0.033

0.033

0.900

0.033

0.033

0.033

0.033

0.900

0.033

0.033

0.033

0.033

0.900

After one unit of evolution, the


probability that an A mutates into a
C is given by the corresponding
entry in the matrix:
p(AC | d=1) = M1[AC] = 0.033

http://gi.cebitec.uni-bielefeld.de/people/boecker/bilder/tree_of_life_new.gif

Augustin Augier,
Arbre Botanique
(1801)

Lamarck, Philosophie Zoologique , 1809

Darwin, Notebook B, 1837

Edward Hitchcock, Elementary Geology, 1840

Haeckel, The Evolution of Man, 1879

rRNA was used by Woese (1987) to group early life forms into
three kingdoms

NO

CS
J

C
O

R
JK

L
F C EIX
FRRALAM
X
S
ST ACAAAC 3
S
I
TH TRRC C
EF A O 1
YW

CH
LA
B
CC
HHLF
CHCHL CHLLCVF
LM TA
P
UR N
BBIF
OA IFLAOA
C
TR
OW
A 8T
ARRT
TAS2
T

TH D
ET DEIG
28 EIRD
A

LEPB
LEPIC
J
IN L

MAGSM

SA

SR FA P
OOC CV
RH N MY CSSJKS
YYC
MM S2
C
MY

N
CE

P
HY

AU
TTFO
YCCCTB P
M
MYYC
M
MY

UA
1MYC
AA
CP
MY

RH
OB
A

PS
EP PS
K
EA
PS PSEE
E
B
P EU2 4
PSSE14
E
PSSEM
PSE F5
PF AL PSE
C
MBAS U5
HA
HRCAV
SAC
H
DC
2 HRS
D
N
THIC
METITOC
R
A
HC
ALH
ALLH
EH X
Y
L
XYL F T
FA
XA
NC 8
XXAANNC5P
XANO AC
R
M

I
D F
R RE GBL
COCO OR
C

PR

PA
RU
W

CARR
P

MAGMM
ZYMM
O
RHOR
T
SPHAL

V
H
CN
SO C DI
VE TM
RU
APHL
PPC
EEGEGG U
R LL L XB
YACKT CO
YDIB
PS
SIAC
PA
AC

NOV
A
ERY D
LH

GRA
GBLCU
OX

RHIERH
MES
RHRIM
HE
IL3 C ILO SB

BARBK
BARQU
BAR
HE
BR
UO
BR
2U
US
BR
UM
BR
EB
UA
2

BRAJA
BRASO
RHOPA
BRASB
RHOP2
RHOPS
RHOP5
RHOPB
NITWN
NITHX

R
GB
WI

AGRT5

P
CB
BU

R
CAUC
B
PELUPM
OTLR
G
RW
OL
RR
HR
WW
EEH
CJ
EHRCR
P Z EH R
ANAMM
ANA
SM
NEOCN
E
RIC
ICF
RY
T
RIC PR BR
IC
RIC
ITB R
OR

CC
O
IL
ID

PS
Y
VI IN
SH
V I BV U
Y
SHEDO
VB
IBPA
EF
CH
SHSH VIB
N
E E PHF1
SH
SH
EO LPAM OP
S
R
H
E
S
N
HE SM
C
SH ES
SR
PSOLP
ES A
EH 3
W
T
PS
EA
6

PARDP
S1
RHOS4
JANSC
SILPO
SILST
ROSDO
HYPNA
MARMM

C
BU

P
CA AI
BU UCOPB
L
B
OF B L
BL
S4H
RH
R
AE
AE

ERYE
W RE
SCOT 8
D
G

TPIA
CLHY
SLALT
SSAA

EANPPS
RRPREPRPRP
YYEYEYEYE

LL

1T
LKU 6I5
8O
ICFO
DOOSLL57
EHEIC
HSFCICSSO
SSHSEIEIBC
SSHH

O
PH

BA
UC
PA H H
SMAE
HHHAA
HAEAE
U S1
EIEIGNI
I8E
M
HA AC AN
ED TP SM
U 2

TO AC
SUL SUL O
S
SUL RAR
PY
AE
PYR
IL
Y
P RJ
C
PY R

PD
THE

D
BX
RU

KO U
PYR
PYYRRFAB
P YRHO
P

N
NEI
EIM
MA
G
F1B

CHRV
AZOSB O
AZOS
E DECAR

BORB
BO
RPR
RP
AE
BORA1
RALEH
RAL
EJ
RALM
EO
RALS
BURP1
BURPS
BURP0
BURMA
BURTA

ARCFU
METST
METS3
METTH

DEHSC
SYMTH DEHE1

CLOD6

CLOTH

CLOAB CLOT
E
CLONN
CL
CLOP
OPSE
1

METFK
N
AT
FR
O
H
AT
ATW
T1
FRAT
FR
AT
FR

HALSA
A
HALM D
NATP
HALWD
P
METT
METBU
METBF A
METM
METAC
UNCMA
METLZ
METMJ
METHJ

METKA METJAMETMP
CENSY

THEMA

BURCM
JANMA
BURCA
BURXL
BURCH
BURS3 HERAR
THIDA
VEREI
METPP
ACIAC
ACISJ
RHOFD
NITMU
EU
A J NIT
LNLS
POPO
NITEC

M
STA

O
EV
THHEACTO NEQ
T PIC NA

E
CL

MY

THETN

MYCS5

XD
MYNXADE
A

JEJFF
AMMJR
M
CCACA
SB
NIT
U
IBL
LS
ACUE
WO
L
SO

D
ELC
SLG P EBA
PS
GEO
DES
OM DBD
MS GEPELP
F
N
SYYNA
V
S
VH
DES DG
S IP
DEA
L W

STA
EQ
S

S
SSTTS
STTA
S
A
SAT
TBA
AAA8
A
ATA
RAA
AAA
S3C
NA
W
M

A
TF
EN

B
LN N
SUHID
T
RD
L
SA

S
FU

GU

ITH

TR AN
IE AS
SY I VPT
NY
SY 3
SY
NE
NP
L
6
SSYYN
NJ JA
GLB
OV
I

P PR
P RO OM
PR R O M 5 P
PR OMMS
PR OM 0
PR OM 9
OM A
OS
S T1
TL
SYYNP
U
SYNSCX
SY N
LE
N S
LELEI ITMRA SYN PW9
IBRIN YB S
OR
B 3PRO
AR
Y
P
PO ATHSA
SY ROMM
PT
NRENM3
R
3 CC
U

SORAR
ERIEU
MONDO

CH
ORNANICK

LOXAF

DASNO
MYOLU
CANFA
FELCA
BOVIN
TUPGB
OTOGA
HUMAN
PANTR
MACMU
RABIT
SPETR

DRO
ME
DR
ANO OPS
GA
AED
AE

ECHTE

NA
L

DE
L
LO

DE
B

HA

ST

CRY
NE
UST
MA

APIM

SCH

PO

PH
AN
O
AS
PF
U MAG BO
GR TC
YA
I
RL
I
CA

OE
RATN
MOUS
CAVPO

PIC

FUGRU
TETNG
GASAC

IN
CIO
CIOSA

ORYLA

AS
HG
O
YE
AS KLU
LA
T
NG
A

EN
O
M
UM

LE

BC6P8G
RM
RRRTFPPPP
R1PP3
TTTSP
PRSRTDTPR
SRTTS
TSSSR
SSTS

CAEBR
CAERE
CAEEL

PLAF7

CRYPV

DICDI

XENTR

DANRE

CA

LA L
C AC
PL B

PA

PE

S1
C 3
LAS CC
CSLA

315
D
1 RAA
TT2 TTR
TRR SS
SST

U
RM
2 V
ST SY
R
S
ST STR
RN26
RRPP
SSTT

M
CL S
LACLA
LA

NW
SY
HY
DES
RHZ
CA
OTA
MO

FK J
RA P
G FLA

UP
BA
RR
PA
OO
BB
RGA
E TRE
BTO
RED

E
H LT
LCLPDLCDH
L
CH
CHPE

GI
8OR
N
CT N V8ARDP 3
BA CFR
C P TH
A
BA B
CY

LA

NN

AQUA

P
LH
HE
Y
LP
HEELPLHPJH
H HE LA
HE

LLIS S S
I W
IN6 TATSAH
LLIISSSM
1 J
MOF
OC
EI
H
BBB
BAAACC
BBAAACC
CCCR
1
CCAAHZH
G NK
B G EO
BAACEOKTN
L
BA CSUD A
CH B
L L A D AC S
LLAAACCA CD
K
CJG C BA
OA

OA
NYYW
BP
PE

MYC
CT
MESMYC
FL MS

MYCG
A
UR
MYCEPA
PE

MYCPU
MYCMO

MMYC
YCH7
HJ
H2

MYCP
N
MYC
GE

Eukaryota
Archaea
Bacteria
Planctomycetes
Fusobacteria
candidate division TG1
Dictyoglomi
Verrucomicrobia
Aquificae
Acidobacteria
Deinococcus-Thermus
Thermotogae
Chloroflexi
Chlamydiae
Chlorobi
Bacteroidetes
Spirochaetes
Tenericutes
Cyanobacteria
Clostridia
Bacilli
Lactobacillales
Actinobacteria
Proteobacteria

P
BU AER

LA
KLU
ST
YEA
GA
CAN

PO

E
YN
A
CR
TM
US

IME

H
EC

TE

A
OX

K
IC
H
C N
NA
R
O

PL
A
CR
F7
Y
PV
D
I
CD
CA
I
C EB
CA AER R
E
EL E

O
R
YL

CI
O
C IN
I
O
SA

FU
G
TE RU
G TNG
A
S
AC

PO
AV
C

You are here.

R
AMT
ON
U
OS
E

NO
S
DA LU
O
MY NFA
A
CALC
AR O
E
R
N
F VI
O IEU ND
S
O
B GB
ER MO
P
N
TU GA MA TR
O U N U
OT H PA CM BIT TR
A RA E
M
SP

E
R
AN TR
N

E
OM
R
D PS
O
DR A
OG
AN DAE
AE

AP

EL
LO D

H
SC

XE

ST
PIC
HA
DEB
AL
CAN

Potrebbero piacerti anche