Sei sulla pagina 1di 18

CARL

Probabilistic
Cellular Automata Model for
Identification of CpG island
in DNA string

Soumyabrata Ghosh
carl

Cellular Automata Research Lab Group


Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh

Salt Lake City | Kolkata | www.carltig.res.in


Agendas

• Background

• CpG Island Identification by Hidden Markov Model & its


drawbacks

• Probabilistic Cellular Automata (PrCA) & its Rule Vector


Graph

• PrCA Model for CpG island identification

• Filter Design to reduce false positive predictions


carl

• Experimental results
Background

• Identification of CpG island in DNA string


• Higher concentration of C-G dinucleotides
• Probable “start” regions of genes

• De facto standard: Hidden Markov Model


• Based on Markov chain with transition & emission probabilities
• Maximum probable path prediction by Viterbi Algorithm

• Disadvantages of HMM:
• States are assumed to be independent of each other
• Local Maxima
• Over training
carl

• False positive prediction


Hidden Markov Model (HMM)
• Based on Markov Chain
• Probability of a state depends only on previous state
• Transition Probability Matrix
• Transition between underlying states (A+, T+ , G+, C+ for
CpG island and A-, T-, G-, C- for non-CpG)
• Emission Probability Matrix
• Emission of expressed symbol (A, T, G, C) with respect to
underlying state

• Viterbi Algorithm

– Initializes the probability calculations


by taking the product of the initial
Transition probabilities with the
carl

associated emission probabilities.

– Determines the most probable path through the underlying


states of the HMM
Our Approach

• Design of Probabilistic CA based model

• Encoding of nucleotides by 3 bits binary string

• Derivation of Probabilistic Rules from Transition Probability


Matrix

• Model Traversal by Viterbi algorithm

• Feature Extraction from encoded DNA string


carl

• Filter design based on Feature Extraction to reduce false


positive predictions
Probabilistic Cellular Automata (PrCA)
• Cellular Automata: discrete dynamical system. Space, time, and
the states of the system are discrete.

• The next state function of ith cell depends on current state of (i-1)th,
ith, (i+1)th cell. The next state function is represented by CA Rule.

• Three Neighborhood Periodic Boundary Probabilistic CA

An n cell CA with Rule Vector

• Rule Mean Term (RMT): Minterm of a 3 variable Boolean function


carl

for a 3-neighborhood CA cell denoted as T= {T(m)}, (m=0 to 7)

• In Probabilistic CA, for a given RMT, the next state value is 0 or 1


with probability less than 1
Probabilistic CA Rules
Deterministic & Probabilistic Rule 232
carl
Rule Vector Graph (RVG) of PrCA

• Efficient data-structure for characterization of CA evolution

• RVG of n cell CA consists of n subgraphs

• The output of ith subgraph serves as input node of (i+1)th


subgraph

• Nodes contain possible RMTs derived from RMTs of


previous level

• Root Node & Sink Node contain all possible RMTs for
periodic boundary CA
carl

• An edge explicitly denotes probability of 1 whereas the


probability of 0 is implicit
RVG Structure of Probabilistic Periodic Boundary 3 Neighborhood CA

Probabilistic Rule Table


carl
RVG Traversal
• 2n possible number of paths for n cell PrCA
• Forward RVG Traversal
• Most probable next state identification
• Backward RVG Traversal
• Most probable transition identification
• Forward/Backward Traversal
• Most probable path determination through the RVG
carl
Pr CA Model for CpG island Identification

• 3 cell CA can generate 8 states

• Each of 8 states represents one of underlying states (A+,G+, T+,


C+,A-, T- G-, C-)

• One RVG represent single underlying state transition

• Sequence of N such transitions can be modeled by a network of N


RVGS

• Most probable path determination by N sequential traversal of the


RVG using Viterbi Algorithm
carl
carl RVG Network of CA Model & Model Traversal
Synthesis of Probabilistic Rules
• Determination from Transition Probability Matrix by
permuting binary encoding
• Lowest conflict encoding is taken

Probabilistic Rules applied on three bits of binary encoding of underlying states


carl

Three bit encoding for 8 under lying states


Filter Design
• Two Binary slices derived from DNA sequence by using common two bit
encoding of nucleotides: C(00), A(01), G(10), T(11)

• Feature Extraction from binary string of traing DNA sequences based on


experimentally known CpG/non-CpG differentiation
– Class separation using Bhattacharyya distance
– Window size 200 bits (lowest difference between 2 classes)
– K-means Clustering to produce representative Datapoints
carl

• Analysis of feature vectors of testing sequence to assign 4th bit encoding


– Using Mahanalobis Distance Measure
– 1 for window belongs to CpG
– 0 for window belongs to non-CpG
carl Experimental Results

Prediction of CpG Islands on AF111167 gene sequence. The model


and filter are trained by CPGISLE dataset
Inferences of experimental results

• HMM prediction:
» 4 False Positive Predictions
» 2 False Negative Predictions

• Pr CA Model (with Filter):


» 1 False Positive Prediction
» 2 False Negative Predictions
carl
Concluding Remarks

• Probabilistic CA based Model for DNA sequence analysis

• Improvement of CpG island prediction by reducing False


Positive predictions

• Analysis of binary encoded DNA string may provide better


insight of local interaction of nucleotides
carl
Thanks !
carl

Cellular Automata Research Lab Group


Shiladitya Munshi | Nirmalya S. Maiti | Soumyabrata Ghosh

Salt Lake City | Kolkata | www.carltig.res.in

Potrebbero piacerti anche