Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
00
Pages 111
Hartelstrae
16-18, D-04107 Leipzig, Germany, b Interdisciplinary Center for Clinical
Research and Division of Hematology/Oncology, Inselstrae 22, D-04103 Leipzig,
Germany, c Interdisciplinary Center for Bioinformatics, University of Leipzig,
Hartelstrae
16-18, D-04107 Leipzig, Germany, d Max-Planck-Institute for
Mathematics in the Sciences, Inselstr. 22, D-04103 Leipzig, Germany.
ABSTRACT
Motivation: The identification of the topology and function of
gene regulation networks remains a challenge. A frequently
used strategy is to reconstruct gene regulatory networks from
time series of gene expression levels from data pooled from
cell populations. However, this strategy causes problems if the
gene expression in different cells of the population is not synchronous, as is expected to be the case in the transcription
factor network that controls lineage commitment in haematopoietic stem cells. Here, a promising alternative may be to
measure the gene expression levels in single cells individually.
The inference of a network requires knowledge of the gene
expression levels at successive time points, at least before
and after a network transition. However, due to experimental
limitations a complete determination of the precursor state is
not possible.
Results: We investigate a strategy for the inference of gene
regulatory networks from incomplete expression data based
on dynamic Bayesian networks that permits prediction of
the number of experiments necessary for network inference
depending on noise in the data, prior knowledge, limited
attainability of initial states and other inference parameters.
Our inference strategy combines a gradual Partial Learning
strategy only based on true experimental observations for
the network topology with expectation maximization for the
network parameters. We illustrate our strategy by extensive
computer simulations in a high-dimensional parameter space
on the network inference in simulated single-cell-based experiments during haematopoietic stem cell commitment. We find
for example that the feasibility of network inferences increases
significantly with the experimental ability to force the system
to
1 INTRODUCTION
Regulatory networks are frequently inferred from time series
of gene expression levels obtained from micro-array experiments in which the gene products of a large number of cells
from a population are pooled at successive time points (e.g.
[34, 3, 37]), such that the experimentally observed expression
state represents an average over the subpopulation. This strategy may cause serious problems if the expression patterns in
different cells of the population are non-synchronous, such
that the average values and time development of the gene
expression levels do not represent the true behavior within an
individual cell (compare also [3]). This situation is relevant
to haematopoietic stem cells undergoing lineage commitment
[8]. Haematopoietic stem cells are the progenitors for all of
the different types of human blood cells, which develop from
the stem cells by progressive specialization or differentiation. The choice of a specific cell lineage is believed to be
determined by interactions between a relative small number
of lineage-associated transcription factors and their corresponding genes [8, 20].
For cases such as these, experimental procedures have been
developed that permit the determination of gene expression
levels in single cells individually [5]. They involve (i) the
generation of cDNA sequences corresponding to mRNA of
expressed genes by reverse transcription and (ii) the amplification of the cDNAs by polymerase chain reaction (PCR).
submitted
The major drawback of the procedure for modelling purposes is that the cell is destroyed during mRNA extraction, so
that the full gene expression pattern within one cell can be
measured only once. On the other hand, in order to uniquely
infer the topology and rules of the underlying gene regulation network the previous network state within the cell, i.e.,
the network state prior to the transition that leads to network
state at the moment of cell destruction, must also be known
to a large extend. One way to gain knowledge of two successive network states is to up- or down-regulate the expression
level of individual genes experimentally by introduction of
recombinant genes or specific inhibitory molecules into the
cell [17, 25] followed by a complete monitoring of the network state after a transition has occurred. Unfortunately,
the number of gene expression levels that can be adjusted
simultaneously in this way is currently limited practically
(typically to just one or two genes), hence the knowledge of
the previous (manipulated) network state is largely incomplete. This makes the experimental strategy feasible only for
relatively small gene regulation networks, as is assumed to
be the case for haematopoietic lineage commitment, or for
sub-networks of larger networks.
In this paper we use in-silico simulations to study the efficiency and reliability of reverse engineering of gene regulation networks from transition data which are largely incomplete.
As a guide for the network parameters, we use the example of haematopoietic lineage commitment (see Fig. 1). The
main motivation for our simulation study was to estimate the
approximate number of experiments necessary for a reliable
inference of small gene regulation networks from experiments on state transitions. For this purpose we generated
artificial expression data from Boolean networks (BoolN,
[22]) by computer simulations, used reverse engineering strategies to infer networks from the data, and compared the
inferred networks to those that we originally used to generate the data. Similar procedures have been proved useful for
reverse engineering methods [24, 1, 36, 32], as well as for
example in sequence alignment (e.g. [11], [23]). The use of
BoolNs is undoubtedly an oversimplification in many biological situations [10] but it is noteworthy that despite of their
simplicity and shortcomings, Boolean networks have been
successfully used to model the gene regulatory network in
a number of biological systems as e.g. Drosophila melanogaster [2]. Furthermore, Liang et. al. [24] have presented a
reverse engineering algorithm (REVEAL) that allows the inference of a unique BoolN from a time series in which all states
can be measured. Using REVEAL the full network (for arbitrary Boolean rules) can be constructed from gene expression
data only if the number of known gene expression levels from
the precursor network state is larger than or equal to the number of input states of the gene that has the largest number
of input states (Missal and Drasdo, unpubl.). In the situations studied in our simulations, this condition is not fulfilled
submitted
Fli1
c/ebpa[0]
c/ebpa[t1]
c/ebpa[t]
pu.1[0]
pu.1[t1]
pu.1[t]
gata1[0]
gata1[t1]
gata1[t]
gata2[0]
gata2[t1]
gata2[t]
scl[0]
scl[t1]
scl[t]
elf1[0]
elf1[t1]
elf1[t]
Myb
+ 1
PU.1
GATA1
NFE2
1
+
+
MafK
1 +/
EKLF
Elf1
+
1
SCL
GATA2
+
+
C/EBPa
2
3
G(0)
a)
b)
Fig. 1. (a) A gene regulation model for control of lineage commitment in haemopoietic stem cells. Labels are explained in the text. (b) DBN
modelling the process of gene regulation involving influences with label 1 in (a). The random variables X[t] = {c/ebpa[t], pu.1[t], gata
1[t], gata 2[t], scl[t], elf 1[t]} indicate the concentrations of corresponding transcription factors at time t. G (0) is the start structure
specifying the conditional independence assumptions over initial states X[0]. In case X[0] are randomly distributed no correlations in G (0)
are observed and hence no arcs are given. G is the transition structure representing conditional independence assumptions between state
transitions. We assume for simplicity that gene regulation is a Markovian and static process, hence the transition structure models only the
dependencies between two succeeding time steps t 1 and t1 .
This was achieved by investigating the Boolean functions of 1000 randomly created Boolean networks - according to Fig.1a.
submitted
are given, they include M 2N (1 (1 (1/2)N )M ) unique network states, so that for N = 6 and M = 64 at least
400 redundant network states must be sampled to
M
observe more than 63 different initial states. According to
the experimental feasibility we assume that either n = 1 or
n = 2 elements can be jointly manipulated. I.e. for our inference strategies we assume that the states of either one or two
of the N elements are known before transition. The network
state that evolves after one transition according to the Boolean network rules is assumed to be completely known since
it can be completely determined experimentally.
Reverse engineering method: The general modelling
scheme of the reverse engineering approach PL to infer the
topology of the DBN comprises the following steps:
(i) Initial determination of model parameters and topology
of the DBN. {See lines 1-2 Alg. 1.}
(ii) Optimization of structure
(a) Generation of a set of successors of the DBN by
inserting edges. {See line 7 Alg. 1.}
(b) Scoring all successors locally. A local score value evaluates in how far element Xi [t] is dependent from its
putative parents Pa(Xi [t]). {See lines 8-10 Alg. 1.}
(c) Continue with the best scoring successors, i.e. Hill
climbing. {See line 11 Alg. 1.}
(iii) Optimization of parameters
(a) Given an optimal topology of the DBN, parameters iji ki
are estimated by expectation maximization (EM) [9, 13].
Algorithm 1 : PL
1: Initialize Ginit with empty structure and init with randomly
chosen parameters
2: Set current optimal model to Gopt := Ginit and opt := init
3: Optimization of structure:
4:
Let n be the maximal number of elements Xi [t 1]
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
T
Y
t=1
N
T Y
Y
t=0 i=1
submitted
ri
X
X
ki =1{ji [t]}
log2
LR :=
P (D|H0 )
=
P (D|H1 )
ri
Q
ki =1
{ji [t]}
P (H0 |D)
P (D|H0 )P (H0 )
=
P (H1 |D)
P (D|H1 )P (H1 )
(4)
For this step we used the EM implementation of the LibB toolkit [14].
submitted
x[1],...,x[T ]
ri
T X
N X
X
X
Q(X[1], . . . , X[T ])
P (X[1], . . . , X[T ])
log 2
(5)
(6)
submitted
Sensitivity ncommit=1
0.8
0.8
0.6
0.6
Sk
fk
0.4
0.4
0.2
0.2
0
10
(a)
100
1000
(b)
0
10
100
1000
PPV ncommit=1
PPVk
PPVk
0.9
0.98
0.8
0.96
0.7
0.6
0.995
0.94
0.92
0.5
0.9
10
(c)
M
100
0.4
1
Fig. 2. The parameters of plots (a)-(c) are = 0.01, n = 1, noise and no redundancy. (a) fk vs. log(M ). For e.g. M = 100 we have
performed 100 in-silico experiments where the 1st gene has been fixed randomly to either 0 or 1, 100 experiments, where the 2nd gene has
been fixed etc. The dotted lines denote inference including prior knowledge, the full lines without prior knowledge each for k = 1 (circles),
k = 2 (squares), k = 3 (diamonds), k = 4 (triangles up), k = 5 (triangles down). Each point represents an average over 1000 networks. f k
shows a threshold behavior as a function of log(M ) and saturates at sufficiently large values of M . The larger k is, the larger is the value
of M at which saturation is observed. (b) Sk vs. log(M ). For large M , Sk saturates at Sksat = 1 for all k. The saturation occurs earlier
with decreasing values of k. Prior knowledge results in an earlier saturation. (c) P P Vk vs. log(M ). The ppv is already large at M = 10 and
converges to P P Vksat = 1 for all k. For large k the convergence is much faster (at small M ) than the saturation of fk and Sk . That is, those
parents that are found by the DBN are indeed also parents in the original BoolN, i.e., they are true positives. (d) Condensed information on
P P Vk . The parameters are = 0.01, n = 2, noise and no redundancy. The bars denote the 25%, 50% and 75% quantils. For D = 0.01
and no prior knowledge the P P Vk rapidly converges to 1 for all k. However, in case of prior knowledge and too large D , the P P Vk
converges to 0.4. A smaller value of D is required to restrict the fraction of false positive parents (inset in (d)).
submitted
10
10
10
10
Mth
Mth
10
10
(a)
10 1
(b)
10 1
Fig. 3. (a) Mth vs. k in case of noise and no redundancy in simulation data. Mth , defined as the number of experiments at which fk =
0.9fksat , increases approximately exponentially fast with k. This permits to conclude Mth for larger k. All data points are calculated without
redundancy and as averages of 1000 network samples. (b) Mth vs. k in case of noise and redundancy in simulation data. Explanation see
text. Note that in case of n = 1, = 0.01 the fidelity does not converge to 1. In section 3.2 we chose = 0.001, because then the fidelity
saturates again at 1 for each k at still a high sensitivity at smaller M .
submitted
Hamming distance
10
10
-1
10
-2
10
-3
10 10
1.3
1.2
1.1
Hrel 1
0.9
0.8
0.7
n=1, =0.01, no prior
0.6
n=1, =0.01, prior
n=2, =0.01, D=1e-05, no prior
0.5
n=2, =0.01, D=1e-05, prior
(a)
0.4
400 800 1200 1600 2000 2400 2800
10
(b)
400
800
Fig. 4. Hamming distance and relative entropy in case of noise and no redundancy. (a) 95% confidence intervals of mean of Hamming
distance between inferred and true network. Vertical line is shown at M = 250, horizontal line at a Hamming distance of 0.01 (see text).
(b) 95% confidence intervals of mean of relative entropy. The means were estimated from 1000 DBNs. Hrel is calculated for each inferred
DBN from a sample of 1000 transitions vectors which are generated from the original BoolN.
3.2
In a true experiment the haematopoietic stem cells are assumed to be in a periodic attractor with either two or more
states. For this purpose we generated networks with either
2 or 3 attractor states. The number of network states that
can be attained by a perturbation of either one (n = 1) or at
most two (n = 2) elements of the attractor states is very limited and usually smaller than 2N . According to experimental
feasibility we have chosen n = 2 and the parameters denoted
in Fig. 5. Again we assume that we are able to completely
monitor the network state that follows the perturbation.
The simulated fidelity curves look very different from the
case in which an arbitrary initial state could be chosen (Fig.
5). The fidelity increases almost immediately for all k but
does not converge to one. The reason is that the limited number of initial states is frequently and repeatedly offered to the
DBN so that it learns quickly the topology inherent in the
transitions which start from these initial states. On the other
hand if, the attractors have too few states, then not enough
states can be attained to ensure a complete inference. Both fk
and Sk depend significantly on the number of attractor states.
For prior knowledge the saturation values were the same but
Mth was reduced by 20% (not shown).
4 DISCUSSION
We have simulated gene network inference for small networks in the absence of extensive knowledge concerning the
transition states. In order to infer the network topology we
used a partial learning strategy which identifies the input
of gene elements by assessing the amount of information
transmitted to a gene from each gene or from groups of
genes. The network topology was represented by a dynamic Bayesian network and the joint probabilities for a given
topology were used to calculate the mutual information score.
Finally, expectation maximization was used to optimize the
inference parameters. This inference strategy permits the
inclusion of topological prior knowledge by considering the
likelihood ratio (ratio of the likelihood of independent elements to the likelihood of dependent elements) in the same
way as the mutual information score. Our studies were guided by a hypothesized core network for haematopoietic stem
cell commitment. For this network, we found that each element transmits information individually, so that in principle
one influenceable gene is sufficient to determine the whole
network topology. We quantified the degree of knowledge
on the network topology by the fraction of correctly learned
inputs (which we call fidelity). We found that prior knowledge, a larger number of influenceable initial states, or a
larger number of accessible initial states, all decrease the
number of experiments necessary to obtain the same fidelity. Redundancy increases the number of experiments as
well as the requirement for a larger statistical significance of
the resulting topology. Similar tendencies are found for the
sensitivities. However, the positive predictive value behaves
contra-intuitively in that, if the chosen significance value is
not sufficiently small, then the PPV saturates at small values
1 in the case of prior knowledge for a small number of
input genes, and in the presence of noise. Prior knowledge
increases the tendency to keep false positive elements in case
the prior knowledge does not meet the situation found in the
data.
In a real experimental situation networks may often be in
attractors prior to experimental perturbation, which may largely limit the number of states accessible by a perturbation of
only a small number of elements. For this case, we found a
completely different shape of fidelity curves and a saturation
submitted
0.8
0.8
0.6
0.6
Sk
fk
0.4
0.4
0.2
0.2
0
40
(a)
100
1000
0
40
(b)
100
1000
Fig. 5. Parameter settings are n = 2, = 0.001, D = 105 , no prior knowledge, redundancy and noise. (a) Fidelity vs. M for attractors
with 2 and 3 states. The full lines denote inference on networks with 2 attractor states and the dashed lines with 3 attractor states each
for k = 1 (circles), k = 2 (squares), k = 3 (diamonds), k = 4 (triangles up) and k = 5 (triangles down). The saturation value for f k is in case
of at least 3 attractor states close to 1. (b) Corresponding sensitivity curves. P P Vk was in all cases almost 1.
value often far below to that found when all network states are
in principle accessible. However, both the parameter learning
(which we did by calculating the relative entropy) and the
accessibility of states significantly improves with the number
of gene states that can be experimentally modified.
We also assessed these strategies for networks of N = 10
genes without any pre-conditions on the network topology.
The only difference to the results reported here was that the
fidelities saturated at values that were usually smaller than
one. However, the fidelity can still be described by the functional form denoted in equation (6), and Mth can still be fitted
to the form Mth A exp(k). For a given number n of
genes that can be manipulated jointly the saturation value is
given by the ratio of those canalizing functions in which n
genes transmit information on the output to the total number of canalizing functions. Here, increasing n significantly
improves network inference.
Further steps could lead into several directions. Firstly, our
strategy may be extended to infer subnetworks of larger networks as demonstrated in Ref. [31] for gene networks in
which each state is fully experimentally accessible. Secondly,
our strategy may be generalized to continuous expression
levels based on DBNs that permit to consider continuous state
functions [28].
Acknowledgments: Useful discussions with D. Hasenclever and M.
Loffler are gratefully acknowledged. This work was partly supported by the Interdisciplinary Center for Clinical Research, University
of Leipzig (Project N02) and the grant BIZ-6 1/1 from the Deutsche
Forschungsgemeinschaft.
REFERENCES
[1]T. Akutsu, S. Kuhara, O. Maruyama, and S. Miyano. Identification of gene regulatory networks by strategic gene disruptions
10
and gene overexpressions. Proc. 9th annual ACM-SIAM Symposium on Discrete Algorithms (SODA98), pages 695702,
1998.
[2]R. Albert and H. G. Othmer. The topology of the regulatory
interactions predict the expression pattern of the segment polarity gene in Drosophila melanogaster. J. Theoretical Biol.,
223:118, 2003.
[3]Z. Bar-Joseph. Analyzing time series gene expression data.
Bioinformatics, 20(16):24932503, 2004.
[4]M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L.
Wild. A Bayesian approach to reconstructing genetic regulatory
networks with hidden factors. Bioinformatics, 21(3):349356,
2005.
[5]G. Brady, F. Billia, J. Knox, T. Hoang, I. R. Kirsch, E. B. Voura,
R. G. Hawley, R. Cumming, M. Buchwald, and K. Siminovitch.
Analysis of gene expression in a complex differentiation hierarchy by global amplification of cDNA from single cells. Current
Biology, 5(8):909922, 1995.
[6]L. J. Burke and A. Baniahmad. Co-repressors 2000. FASEB
Journal, 14(13):18761888, 2000.
[7]R. C. Conant. Extended dependency analysis of large systems
part I: Dynamic analysis. Int. J. General Systems, 14:97123,
1988.
[8]M. A. Cross and T. Enver. The lineage commitment of haemopoietic progenitor cells. Current Opinion in Genetics and
Development, 7:609613, 1997.
[9]A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximumlikelihood from incomplete data via the EM algorithm. Journal
of the Royal Statistical Society, 39:138, 1977.
[10]P. Dhaeseleer, S. Liang, and R. Somogyi. Genetic network
inference: From co-expression clustering to reverse engineering. Bioinformatics, 16(8):707726, 2000.
[11]D. Drasdo, T. Hwa, and M. Lassig. Scaling laws and similarity
detection in sequence alignment with gaps. J. Comp. Biol,
7:11541, 2000.
submitted
11