Presentation Pune

CARL
Probabilistic
Cellular Automata Model for
Identification of CpG island
in DNA string
Soumyabrata Ghosh
carl
Cellular Automata Research Lab Group

Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh
Salt Lake City | Kolkata | www.carltig.res.in

Agendas
• Background
• CpG Island Identification by Hidden Markov Model & its

drawbacks
• Probabilistic Cellular Automata (PrCA) & its Rule Vector

Graph
• PrCA Model for CpG island identification
• Filter Design to reduce false positive predictions

carl
• Experimental results
Background
• Identification of CpG island in DNA string

• Higher concentration of C-G dinucleotides
• Probable “start” regions of genes
• De facto standard: Hidden Markov Model

• Based on Markov chain with transition & emission probabilities
• Maximum probable path prediction by Viterbi Algorithm
• Disadvantages of HMM:
• States are assumed to be independent of each other
• Local Maxima
• Over training
carl
• False positive prediction

Hidden Markov Model (HMM)
• Based on Markov Chain
• Probability of a state depends only on previous state
• Transition Probability Matrix
• Transition between underlying states (A+, T+ , G+, C+ for
CpG island and A-, T-, G-, C- for non-CpG)
• Emission Probability Matrix
• Emission of expressed symbol (A, T, G, C) with respect to
underlying state
• Viterbi Algorithm
– Initializes the probability calculations

by taking the product of the initial
Transition probabilities with the
carl
associated emission probabilities.
– Determines the most probable path through the underlying

states of the HMM
Our Approach
• Design of Probabilistic CA based model
• Encoding of nucleotides by 3 bits binary string
• Derivation of Probabilistic Rules from Transition Probability

Matrix
• Model Traversal by Viterbi algorithm
• Feature Extraction from encoded DNA string

carl
• Filter design based on Feature Extraction to reduce false

positive predictions
Probabilistic Cellular Automata (PrCA)
• Cellular Automata: discrete dynamical system. Space, time, and
the states of the system are discrete.
• The next state function of ith cell depends on current state of (i-1)th,
ith, (i+1)th cell. The next state function is represented by CA Rule.
• Three Neighborhood Periodic Boundary Probabilistic CA
An n cell CA with Rule Vector
• Rule Mean Term (RMT): Minterm of a 3 variable Boolean function

carl
for a 3-neighborhood CA cell denoted as T= {T(m)}, (m=0 to 7)
• In Probabilistic CA, for a given RMT, the next state value is 0 or 1

with probability less than 1
Probabilistic CA Rules
Deterministic & Probabilistic Rule 232
carl
Rule Vector Graph (RVG) of PrCA
• Efficient data-structure for characterization of CA evolution
• RVG of n cell CA consists of n subgraphs
• The output of ith subgraph serves as input node of (i+1)th

subgraph
• Nodes contain possible RMTs derived from RMTs of

previous level
• Root Node & Sink Node contain all possible RMTs for
periodic boundary CA
carl
• An edge explicitly denotes probability of 1 whereas the

probability of 0 is implicit
RVG Structure of Probabilistic Periodic Boundary 3 Neighborhood CA
Probabilistic Rule Table

carl
RVG Traversal
• 2n possible number of paths for n cell PrCA
• Forward RVG Traversal
• Most probable next state identification
• Backward RVG Traversal
• Most probable transition identification
• Forward/Backward Traversal
• Most probable path determination through the RVG
carl
Pr CA Model for CpG island Identification
• 3 cell CA can generate 8 states
• Each of 8 states represents one of underlying states (A+,G+, T+,

C+,A-, T- G-, C-)
• One RVG represent single underlying state transition
• Sequence of N such transitions can be modeled by a network of N

RVGS
• Most probable path determination by N sequential traversal of the

RVG using Viterbi Algorithm
carl
carl RVG Network of CA Model & Model Traversal
Synthesis of Probabilistic Rules
• Determination from Transition Probability Matrix by
permuting binary encoding
• Lowest conflict encoding is taken
Probabilistic Rules applied on three bits of binary encoding of underlying states

carl
Three bit encoding for 8 under lying states

Filter Design
• Two Binary slices derived from DNA sequence by using common two bit
encoding of nucleotides: C(00), A(01), G(10), T(11)
• Feature Extraction from binary string of traing DNA sequences based on

experimentally known CpG/non-CpG differentiation
– Class separation using Bhattacharyya distance
– Window size 200 bits (lowest difference between 2 classes)
– K-means Clustering to produce representative Datapoints
carl
• Analysis of feature vectors of testing sequence to assign 4th bit encoding

– Using Mahanalobis Distance Measure
– 1 for window belongs to CpG
– 0 for window belongs to non-CpG
carl Experimental Results
Prediction of CpG Islands on AF111167 gene sequence. The model

and filter are trained by CPGISLE dataset
Inferences of experimental results
• HMM prediction:
» 4 False Positive Predictions
» 2 False Negative Predictions
• Pr CA Model (with Filter):

» 1 False Positive Prediction
» 2 False Negative Predictions
carl
Concluding Remarks
• Probabilistic CA based Model for DNA sequence analysis
• Improvement of CpG island prediction by reducing False

Positive predictions
• Analysis of binary encoded DNA string may provide better

insight of local interaction of nucleotides
carl
Thanks !
carl
Cellular Automata Research Lab Group

Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh
Salt Lake City | Kolkata | www.carltig.res.in

Presentation Pune

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Presentation Pune

Caricato da

Copyright:

Formati disponibili

CARL

Cellular Automata Research Lab Group

Salt Lake City | Kolkata | www.carltig.res.in

• CpG Island Identification by Hidden Markov Model & its

• Probabilistic Cellular Automata (PrCA) & its Rule Vector

• PrCA Model for CpG island identification

• Filter Design to reduce false positive predictions

• Identification of CpG island in DNA string

• De facto standard: Hidden Markov Model

• False positive prediction

– Initializes the probability calculations

associated emission probabilities.

– Determines the most probable path through the underlying

• Design of Probabilistic CA based model

• Encoding of nucleotides by 3 bits binary string

• Derivation of Probabilistic Rules from Transition Probability

• Model Traversal by Viterbi algorithm

• Feature Extraction from encoded DNA string

• Filter design based on Feature Extraction to reduce false

• Three Neighborhood Periodic Boundary Probabilistic CA

An n cell CA with Rule Vector

• Rule Mean Term (RMT): Minterm of a 3 variable Boolean function

for a 3-neighborhood CA cell denoted as T= {T(m)}, (m=0 to 7)

• In Probabilistic CA, for a given RMT, the next state value is 0 or 1

• Efficient data-structure for characterization of CA evolution

• RVG of n cell CA consists of n subgraphs

• The output of ith subgraph serves as input node of (i+1)th

• Nodes contain possible RMTs derived from RMTs of

• An edge explicitly denotes probability of 1 whereas the

Probabilistic Rule Table

• 3 cell CA can generate 8 states

• Each of 8 states represents one of underlying states (A+,G+, T+,

• One RVG represent single underlying state transition

• Sequence of N such transitions can be modeled by a network of N

• Most probable path determination by N sequential traversal of the

Probabilistic Rules applied on three bits of binary encoding of underlying states

Three bit encoding for 8 under lying states

• Feature Extraction from binary string of traing DNA sequences based on

• Analysis of feature vectors of testing sequence to assign 4th bit encoding

Prediction of CpG Islands on AF111167 gene sequence. The model

• Pr CA Model (with Filter):

• Probabilistic CA based Model for DNA sequence analysis

• Improvement of CpG island prediction by reducing False

• Analysis of binary encoded DNA string may provide better

Cellular Automata Research Lab Group

Salt Lake City | Kolkata | www.carltig.res.in

Potrebbero piacerti anche