Vidal PFAI - Pami05

IEEE TRANSACTION PAMI
Probabilistic Finite-State Machines Part I

E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta and R. C. Carrasco
Abstract Probabilistic nite-state machines are used today in a variety of areas in pattern recognition, or in elds to which pattern recognition is linked: computational linguistics, machine learning, time series analysis, circuit testing, computational biology, speech recognition and machine translation are some of them. In part I of this paper we survey these generative objects and study their denitions and properties. In part II, we will study the relation of probabilistic nite-state automata with other well known devices that generate strings as hidden Markov models and -grams, and provide theorems, algorithms and properties that represent a current state of the art of these objects.
Index Terms Automata (F.1.1.a), Classes dened by grammars or automata (F.4.3.b), Machine learning (I.2.6.g), Language acquisition (I.2.6.h), Language models (I.2.7.c), Language parsing and understanding (I.2.7.d), Machine translation (I.2.7.f), Speech recognition and synthesis (I.2.7.g), Structural Pattern Recognition (I.5.1.f), Syntactic Pattern Recognition (I.5.1.g).
I. I NTRODUCTION Probabilistic nite-state machines such as probabilistic nite-state automata (P FA ) [1], hidden Markov models (H MM s) [2], [3], stochastic regular grammars [4], Markov chains [5], -grams [3], [6], probabilistic sufx trees [7], deterministic stochastic or probabilistic automata (D PFA ) [4], weighted automata [8] are some names of syntactic objects which during the past years have attempted to model and generate distributions over sets of possible innite cardinality of strings, sequences, words, phrases but also terms and trees.
Dr. Vidal and Dr. Casacuberta are with Dto. Sistemas Inform ticos y Computaci n and Instituto Tecnol gico de a o o Inform tica. Universitat Polit` cnica de Val` ncia. Spain. a e e Dr. de la Higuera and Dr. Thollard are with EURISE and the Universit Jean Monnet. France. e Dr. Carrasco is with Dto. de Lenguajes y Sistemas Inform ticos. Universidad de Alicante. Spain. a
Their successes in a wide amount of elds ranging from computational linguistics [8] to pattern recognition [9][12], and including language modeling in speech recognition [2], [3], [13], bioinformatics [14][17], music modeling [18], machine translation [8], [19][26], circuit testing [27] or time series analysis [28] make these objects very valuable indeed. But as more and more researchers have entered this eld, denitions and notations have varied and not enough energy has been spent to reach a common language. For the outsider, the choice of the best tting syntactic object to describe the sort of distribution she/he is working on will seldom depend on anything else than the usual knowledge in the sub-eld or on her/his own background. There has been a number of survey papers dedicated to one or another of these models during the past thirty years [8], [29][33], but it is not always obvious through reading these papers how the models inter-relate, and where the difculties lie. These difculties have been theoretically analyzed in the computational learning theory literature [34][38]; alas, these results, highly technical, seem not to have reached the adequate communities. A possible exception is the very recent article by Dupont et al. [39]. Furthermore, more and more folk theorems appear: H MM s might be equivalent to P FA , parsing a string in the non-deterministic case by taking the best derivation (instead of summing up over the possible derivations) could be a good approximation; determinism might not (as in common language theory) modify the expressive power of P FA . Some of these results are true, others are not. And even in the case of the true folk theorems, most researchers would not know why they hold. The complexity of the objects themselves, and moreover of the underlying theories (for instance probabilities, matrix calculus, rational power series), makes many of the usual results depend on some exterior theory: For example, consider the question (studied in section IV-C) of knowing if the mean of two regular deterministic distributions is also regular deterministic. If this was so, we could merge distributions using D PFA . But this is false and a proof can be given using results on rational series. We argue that such a proof (albeit simpler than the one we propose) offers little insight for people working in the eld. Knowing how to construct the counter-example is of much more use: It helps for instance to build hard cases that can be used for other problems, or to identify a sub-class of distributions where the
counter-example will not hold. The above example gives the spirit in which the paper is written. It aims to provide an up to date theory of P FA , but also a survey where the objects themselves give the answers to the questions that naturally arise. Another preliminary question is that of justifying our interest in P FA to describe distributions rather than some other devices, among which the most popular may be the H MM s. Our choice of centering the survey on P FA instead of H MM s obeys to at least 3 reasons: Formal language theory appears to be today a widespread background knowledge to

researchers and engineers in computer science. Adding probabilities to well known objects as automata permits to build on our intuitions and experiences. On the other hand H MM s are directly issued from probability theory. This parentage also affects the way the theory is constructed. P FA are built to deal with the problem of probabilizing a structured space by adding probabilities to structure, whereas H MM s might rather be considered as devices that structure probabilistic spaces by adding structure to probabilities. Neither choice is fundamentally better, but if concerned with a task where one wishes to use probabilistic devices in order to grasp the structure of the data, the rst one seems more appropriate. As we will prove in the second part of our paper [40], P FA can represent the same distributions class as those modeled by the H MM s dened in that section. Furthermore, they can do so in at most as much space, and the common algorithms are at least as simple. A third point is that as P FA are nite-state automata with weights that verify some constraints, then if the underlying automaton is deterministic, we have a deterministic probabilistic nite-state automaton (D PFA ). In formal language theory, there is a key difference between deterministic and non-deterministic nite-state machines which extends to the probabilistic case: D PFA are very much favored because parsing with them is simpler, and also because they admit a minimal object, which in turn makes the equivalence problem tractable. A probabilistic deterministic machine also exists, which we will study with special attention. Even if these machines are not as powerful as their non-deterministic counterpart, they play an important role in a number of applications.
Our rst objective will be to establish correct denitions for the different sorts of probabilistic automata; this will be done in section II. The probabilistic automata we consider in this paper are generative processes. It should be noted that in the line of [41] probabilistic acceptors have also been studied. A simple problem as that of parsing can be upsetting: We provide in section III all required equations and algorithms dealing with parsing. The goal of the section is to study the relationship between the P FA and the strings they generate [42], [43]. Section IV is devoted to study the intrinsic properties of P FA . Minimality issues are discussed in section IV-A. In section IV-B we will prove that there are distributions that cannot be represented by D PFA , whereas they can by P FA . Topology over regular distributions will be thoroughly studied in section V. On the one hand entropy-based measures such as the Kullback-Leibler divergence or the perplexity can arguably measure the quality of a model. On the other hand, alternative mathematical distances [16], [17], [44] can be used. Some of them can effectively be computed over the representants of the distributions, at least when these are deterministic. Part II [40] of the paper will be devoted to the comparison with other types of models, learning issues and the presentation of some of the extensions of the probabilistic automata. In order to make the manuscript more readable the proofs of the propositions and theorems are left to the corresponding appendixes. As all surveys this one is incomplete. In our case the completeness is particularly difcult to achieve due to the enormous and increasing amount of very different elds where these objects have been used. In advance we would like to apologize to all those whose work on the subject we have not recalled. II. D EFINITIONS Probabilistic nite-state automata are chosen as key syntactic representations of the distributions for a certain amount of reasons: Formal language theory appears to be today one of the most widespread background

knowledges to researchers and engineers in computer science. P FA can represent the same distributions as those modeled by some H MM s.
P FA admit a deterministic version for which most natural problems become tractable;
Even though non-deterministic P FA are not equivalent to their deterministic counterparts, these (D PFA ) have been studied by a number of authors because of their particular properties. In practice, P FA can be used to implement other nite-state models.
There is a variety of denitions regarding P FA in the literature. The ones we choose to give here are sufciently general to cover most cases where the intended distribution is over the set of all strings (and not just the set of strings of some special length). The cases that do not t in this denition will be analyzed in the second part of our paper [40]. In the general denition of such automata the probabilities are real numbers but, as they are intended for practical use, the probabilities are rather represented as rational numbers. Also, rational probabilities are needed for discussing about computational properties involving the concept of size of an automaton. A different line was successfully followed in [8], where the probabilities are just a special case of abstract weights: The algebra over which the weights are computed then allows to deal with all cases, whether computable or not. We now give the formal denitions of probabilistic automata we are going to use in the rest of the paper. A. Stochastic languages Let be a nite alphabet and the set of all strings that can be built from
the empty string denoted by . A language is a subset of . By convention, symbols in ) and strings in
will be denoted by letters will be denoted by end of . The set of all
variables
and
dr g up b g `X sivqthcfa2W
drp `X sqcbaYW
db g `X ihcfa2W
d `X ecbaYW
As usual, we will use the notation of
as
and
$ U
the probability1 of a string
under the distribution
. The distribution must verify

as for any random
T R QP 8S'I
A stochastic language
is a probability distribution over
5 E GF6
with
is the empty string .
. We denote by
A 7 BD92C
A 7 B@9928
substring of
from position to position
will be denoted as
strings of length
(resp. less than, at most ) will be denoted by
(resp.
. A substring
(3 (0 24 21
& ' &
( )
$ %
# "!
the alphabet letters (
). The length of a string

from the beginning of the alphabet (
is written
, including
). A
. If the distribution is modeled by some syntactic machine
The distribution modeled by a machine non-ambiguous context.
will be denoted
A sample
is a multi-set of strings: as they are usually built through sampling, one string
is (are) represented in the sample. The size
of sample , is the total number of strings
in the sample and
is the total sum of lengths of the all strings in . It should be noted
that neither of these measures of sizes corresponds to the actual number of bits needed to encode a sample. The empirical nite-support distribution associated with
in
and
if
B. Probabilistic automata We present in this section formal denitions about probabilistic automata. These are directly inspired by a number of works in machine learning and pattern recognition, including [1], [4], [13], [45][47]. Denition 1: A P FA is a tuple
, where:
is a nite set of states; is the alphabet;
is a set of transitions;
(initial-state probabilities);
(transition probabilities); (nal-state probabilities). ) and therefore functions
It should be noted that probabilities may be null ( and
can be considered as total. Similarly, for the sake of notation simplication, for all
is
assumed to be extended with , and
are functions such that:
rH GF qD
A x$ u CXRT wEsBtsR
PH QGIGBFVCEDCBA V@987
ec $ pTt4
rT tsR #D y
4 u R cTvEsBtsQH
T 8SR
as
; i.e.
, where
is the frequency (number of repetitions) of
& !&
$
may appear more than once. We will write
to indicate that (all instances of the) string
will be denoted
T R QP 8S@VI
QP 'I
If
is a language over
, and
a distribution over
T R
H
6 2$ & !"38SR & 2T
e c 8` idb RaTgF e c A` ihb aTgH e c 8` fdb RaTCD 8 V 8 S YXV WRUTA
4 T ' P 58SR (DQ'I 0 T ' P 1)8 R (DQ'I &H %
" $#"
GF
probability of
according to the probability distribution dened by
is denoted
and simplied to
H T R P 8S'I
8 R T
QP @'I
GH CD
R8
gH
, the .
in a
In what follows, the subscript state of as
will be denoted by
where for . PSfrag replacements As will be seen in the next section, the above automata denition corresponds to models which are generative in nature. This is in contrast with the standard denition of automata in the conventional (non probabilistic) formal language theory, where strings are generated by grammars while the automata are the accepting devices. It is not difcult to prove that the denition adopted in this article is equivalent to the denition of stochastic regular grammar [9], [32]. From a probabilistic point of view, the process of (randomly) accepting a given string is essentially different from the process of generating a (random) string. Probabilistic acceptors are dened in [9], [41], but they have only seldom been considered in syntactic pattern recognition or in (probabilistic) formal language theory. Typically, P FA are represented as directed labeled graphs. Figure 1 shows of a P FA with four states,
a (1/2)
, only one initial state (i.e. a state
with
four-symbol alphabet,
. The real numbers in the states and in the arrows are
the nal-state and the transition probabilities, respectively.

c (2/5) c (1/4) a (1/8)
(1/16)
b (2/5)
(1/5)
(1)
d (1/16) a (1/2)
(1)
Fig. 1.
Graphical representation of a P FA.
A particular case of P FA arises when the underlying graph is acyclic. This type of models are known as acyclic probabilistic nite-state automata (APFA) [48]. On the other hand, a more general model is dened in the next subsection.
T A 999 R
0 2s 4 1T s R D
' (s
and a sequence of states of length
8 s UT u EBtRGH R s s 8 y T tsGF RX$ s

without subindex, the specic states in
and
will be dropped when there is no ambiguity. A generic will be denoted ,
s
(s
)54 e3 $ )'s ss$ (E9s &E9v% 8
(s
!#5U!" 8 $ 7 Es9 8 Es9s HRg2dQ8@9` Ig PH C DB6g87 adBA@9` G87F 6E
will be denoted by
),
, and a
Denition 2: A -P FA is a tuple , where . , , and
is extended to and
-P FA appear as natural objects when combining distributions. They are nevertheless not
more powerful than P FA in the sense that they generate the same distributions (see section IV-C). -P FA introduce specic problems, in particular, when sequences of transitions
considering -P FA a few concepts will be needed: Denition 3: For any -P FA
A -transition is any transition labeled by ; A -loop is a transition of the form
D. Deterministic probabilistic nite-state automata (D PFA ) Even though determinism (as we shall show later) restricts the class of distributions that can be generated, we introduce deterministic probabilistic nite-state automata because of the following reasons: Parsing is easier as only one path has to be followed.
Some intractable problems (nding the most probable string, comparing two distribu
tions) become tractable. There are a number of positive learning results for D PFA that do not hold for P FA .
Denition 4: A P FA
T "! "!9 R E 6 E #7 7
P BH)BFCEDBA 987 1
where
is a D PFA , if:
T7 7 R T 7 @ ees999 e ! R 7 R T 7 7
A -cycle is a sequence of -transitions from :
T E s R s
A X$
P &EH EF"D eEA 87
4 a`
labeled with
are considered. In section III-C some of these problems are analyzed. When
F 8 UT u sEBtR H s y T tsR $ X$ s
to include :
; ,
$ t
verify a similar normalization as for P FA with the sum for all
8 V XpT ) $ 8 R V UA P BH)BFCEDBA 987
C.
-Probabilistic nite-state automata ( -P FA )
are dened as for P FA , but
extended
(initial state), such that
more simply denoted by
A particular case of D PFA is the probabilistic prex tree automaton (PPTA) where the underlying graph is a tree rooted at the initial state
E. Size of a P FA If P FA are to be implemented then we are concerned with two issues. On the one hand all probabilities have to be encoded and thus the range of functions , instead of and should be
. A second point is that in order to compute the complexity of an algorithm,
we must be able to give the size of a P FA (or D PFA , -P FA ). The complexity should be polynomially linked with the number of bits needed to encode the P FA in a reasonable way.
It follows that in the case of D PFA a correct measure of the size is the sum of the number of states, the size
of the alphabet and the number of bits needed to encode all the non null
probabilities in the automaton. In the case of P FA or -P FA , because of non-determinism, the number of transitions in the automaton should also appear.
F. Distributions modeled by P FA P FA are stochastic machines that may not generate a probability space but a subprobability
a string proceeds as follows:
Initialization: Choose (with respect to a distribution ) one state state. Dene
as the current state.
Generation: Let
be the current state. Decide whether to stop, with probability
In some cases, this process may never end; i.e. it may generate strings of unbounded length (see section II-G). If P FA generates nite-length strings, a relevant question is that of
$
computing the probability that a P FA
generates a string
Output
and set the current state to
. To deal with this problem,
$ F
to produce a move
with probability
where
and
8X"ws $ u T tR s
)$
8 s
cTwEBtR us s
space over the set of nite-strings
. Given a P FA (or -P FA )
, the process of generating
in
as the initial
, or .
e @
In a D PFA , a transition
is completely dened by
F D
2s
P BH)BF9EsBA 987 T u EBtR s s ! FXT u EsBts"` u v$ & & )A $ R s T s 8D R

;
us
T wE s R us
& &
$ 8 4(s
e ic
8 $s
. and and a D PFA can be
10
that in general
because some
can be ). To simplify the notation, the symbols
in the sequences of transitions will be omitted if not needed. The probability of generating such a path is:

(1)
than zero. The set of valid paths in
will be denoted as
have much interest. The conditions which guarantee this will be discussed in section II-G.
, these sub-paths will be allowed to start or end in states with null initial or nal probabilities, respectively.
d 98`
In unambiguous context,
will be extended in section III to also mean the set of sub-paths that generate a substring
T (tR s
%# i&&5% 4 4 '6% v"4 5# v'&$ v44 4 4 %# 4 F R H R F R H R vT s C!9s FvT (ts8D DTtsR vDT&s C!9s FvT (ts8D
the P FA of gure 1 is ambiguous. The probability of
T's F ' tR vT (sE &s RFvT@&sEe tR vTD&sEe tR v&EtR vT (t8D s B R 3'I H s H s H T s s H sR T P ) 's s s $ T T (E &s &Ee Es R 3fs B R
For the string , there are two valid paths:
)T (sEs Re T@&sE9(tsR $ @TB R R 3 t&4 4 4 4 21( v0)( v")( v'&$ v"4 4 4 4 %# 4
is then:
B
. The probability of
is :
s 2
For the P FA of gure 1 there is only one valid path for the string

. Therefore,
T 8SR
A probabilistic nite-state automaton is ambiguous if a string
exists such that
0 &
@!&
If
, then
denes a distribution
on
; otherwise the model does not
in
. The probability of generating
with
T R 'I P

8SR 'I y T P
denote2 the set of all the valid paths for
In general, a given string
can be generated by
through multiple valid paths. Let
T R 8S
$
Denition 5: A valid path in a P FA
is a path for some
with probability greater
(2)
uA
is .
of transitions
such that
u 99 u u
R T
F vT A uA DA FH A R
Y
let
be a path for
in
! ' & & uA AT$ T u (R99 T u D R DT u D2R T u D(s9999 u u 9 R
RD P 8T R 3'I
T R P 8S3'I
T R P B 3'I
&
; that is, there is a sequence (note
11
The denition of D PFA directly yields: Proposition 1: No D PFA is ambiguous. We conclude this section by dening classes of string distributions on the base of the corresponding generating automata. Denition 6: A distribution is regular if it can be generated by some P FA . An alternative denition could be used: A regular distribution is a probabilistic distribution on a regular language. However, we do not assume this denition because it would present the following problem: There would exist regular distributions which could not be generated by any P FA . This result can be easily derived from [32]. Denition 7: A distribution is regular deterministic if it can be generated by some D PFA . Denition 8: Two P FA are equivalent if they generate the same distribution. From the denition of P FA and D PFA the following hierarchy follows: Proposition 2: A regular deterministic distribution is also a regular distribution The reverse of this proposition is not always true (see proposition 10, section IV). It is interesting to note that APFA and PPTA only generate distributions on nite sets of strings. Moreover, given any nite sample
G. Consistency of P FA The question of consistency is do the probabilities provided by an automaton according to equation (2) sum up to 1?. In early papers in the eld the question was supposed to be simple [31] or on the contrary complex when concerned with stochastic context-free grammars; in that setting the consistency can be checked by analyzing the behavior of the underlying probability matrix [32], [49]. The conditions needed for a P FA to be consistent are established as follows [39]: Denition 9: A state of a P FA is useful if it appears in at least one valid path of
Proposition 3: A P FA is consistent if all its states are useful.
Note that the condition of proposition 3 is sufcient but not necessary: A non useful state is harmless if it is inaccessible; i.e., if no string can reach it with probability greater than zero. Once the syntactic models and the corresponding string distributions have been dened, we
@
generates the empirical distribution
[4].
, a PPTA can be easily constructed which
12
discuss in the next section how to compute the probability of a given string in the distribution modeled by a given probabilistic automaton. III. PARSING
ISSUES
We understand parsing as the computation of equation (2) in page 10. In the case of D PFA , the algorithms are simpler than for non-deterministic P FA . In the rst case, the time
This computational cost does not depend on the number of states since at each step the
only possible next state is computed with a cost in
. In general, as will be discussed
computed efciently by using dynamic programming. Another problem related with parsing is the computation of the probability of a substring in a P FA [46].
A. Parsing with P FA The probabilities assigned to the paths in efciently
. The idea is similar to the one proposed for H MM s [50] by dening
reaching state :3
where used.
if
and 0 if
. In this case, the extended
to sub-paths is
Equation (3) can be calculated with the following algorithm: Algorithm 3.1: Forward algorithm

It is assumed the notation:
C Dg Q Q!
$
For a string
, the following proposition is straightforward:

, if .
7 D998
7 R A A qT 9 s 2T A B!R
R
' & &
!#U! T sE u tsR vT u sE 5 R y T ES5R 5 7 H s T Es t4R
s s R u x s u s )T u EsBts2 Q Q A rT ED5 R H A T R D y s
T t8D sR
& ' &
and
as the probability of generating the prex
R
below, the probability that a string
is generated by a P FA , given by equation (2), can be
(section II-F) can be used to compute
T ' & R &
computation cost of equation (2) (and that of equation (6) in this section) are in
T R e i
! 4
8 vs T EsS5R $ T P 8SR 3'I

and
(3)
13
generating the sufx
from the state :
that can be calculated by
Algorithm 3.2: Backward algorithm
And the corresponding proposition: Proposition 5:
B. Searching for the optimal path for a string in a P FA
probabilities of all valid paths that deal with . However, it can be interesting to search
argmax

and
from equation (2) has been studied in [51] and [52]. When good
models are used in practice, the probability given by equation (2) is often mainly distributed among a few paths close to the optimal one. In that case, the probability of the optimal path is an adequate approximation to the probability given by equation (2).

The optimal path
is of practical interest in many pattern recognition applications, since
useful information can be attached to the states and in many cases the problem is to search for
The probability of this optimal path
will be denoted as
T P 8 R 'I
T R 'I P
@
T R P 8 3'I
for a valid path
that generates
with highest probability, (6) . The relation between
In equation (2), the probability of generating
with
& ' &
where
is the length of
and
is the number of transitions in
is dened as a sum of the
& CA &
The computation of
and
can be performed with a time complexity of
T A & ' & R & &
& ' &
!#5U!I4 T u sE D s RFvT u E G5 R 7 H s y T EsD5 R F T Es ' &R &
There is another way of computing
Q #Q e7 A fT EsD5 R T T A 7 R R Fv AB!AR H STBts2 y @99 e C s 7 T ED5 R s T P 8SR EVI T P y 8 R 'I

T s Fv T E ' & R R F s & T E R s 4
Proposition 4:
(4)
T t8D sR
by introducing
as the probability of
(5)
T P 8 R 'I y
T tR s
T P 8 R 'I
14
the information that is in the optimal path. This path is also useful for an efcient estimation of the parameters of the model from a training sample (see section III-A of part II). The computation of can be efciently performed by dening a function
path and reaching state :
An algorithmic solution is given by the following algorithm. Algorithm 3.3: Viterbi algorithm
with the corresponding proposition:
Proposition 6:
but the implementation of the last ones may lead to numerical precision problems, which can be easily circumvented in the implementation of the rst one by using logarithms.
probability
that
generates .
We can introduce generating the prex
in a similar way as for equation (3) as the probability of
and reaching state :
of
can be performed from
through a new function
T E 6 D5 R s
with
and
. In this case the computation , that
cTvE SR us 5 u 7 u7 u D9 U D9 tuU! 5
u " uU#U! ! ! 5 !5
T ED5 R u s
Here,
denotes the set of sub-paths rather than full paths. On the other hand,
u
T Bt2s uA R 7 sR T A A
$
Q Q A rT ED5 R u H A T 8R D y s
7 D9
T R P 8S3VI
Given a
P FA
and a string
T ESR u s5 P BH)BFCEDBA 987
C. Parsing with
P FA , we want to compute the
(8) or
The computation of
presents the same time complexity as the computation of
T Bt2T eB A FH 7 sR A A R
' & &
7 D9
!#U! T sE u tsR vT u sE 5 R T ES5R 5 7 H s T Es t4R
T tR s
F & v T Es ' & R 8S3'I T R P
8QR D Q A T s 5 R

T t8D sR
& ' &
, as the probability of generating the prex
through the best
or ,
T s 5 R
T P 8SR 'I
!5 ! 4 8 $ s
u
(7)
15
This function can be dened as:
if if

By successive application of equation (10) ending by equation (9),
From equations (9)-(11) and taking into account the existence of all possible sequence of
Finally, by applying equations (11) and (9) in equation (12)
Proposition 7:
Analogous results can be obtained for the backward and Viterbi algorithms. The algorithms and propositions presented in the last subsections can also be derived from some results of the theory of discrete stochastic process [53].
The probability of generating
by a
P FA
is:

T s FF T E ' & R u 8 R 'I R s & y T P 7 u Es u u tsR H y T u u Es 5 R u y T EsS5R u
By taking
, the identity matrix,
can be inverted in most cases and
5 A 7 y T u Es u u tsR H y T u u Es
T E 6 5 R s
5A y rT EsS5R u
transitions:

5R
u y T s 5 R u
using only
transitions.
as
for all
. Therefore,
is the probability to reach
from
(12)
(13)
us
8 $ wEsBs u
where
is the
element in the -th power of a matrix
if
and
. This matrix is dened
4 t0 6
4 U 4x 5 4 Q5
T E u tR HvT u E 6 D5 R s s s T Es 6 D5 R y T E 7 uvtR vT wsE 5 R s s H u u T Es E45 R
u Es B4D5 R
y T Es 6 5 R
T sR
6 T E ws FH s u R T E vtR s us A
t6
and then to use
transitions to reach
(that is, the last
transitions are
transitions).
(9)
(10)
(11)
7 D998
represents the probability of generating (maybe with
transitions) the prex
of
16
D. The most probable string problem In the previous problems a string is given and one wishes to compute its probability or that of a generating path. Other related interesting problems are the most probable string and the most probable constrained string in a P FA searching for the string with highest probability in argmax
The second problem is the search for a string of upper bounded length with highest probability in
: argmax and
When ,
are dened over
, the following theorem holds.
Proposition 8: The computation of the most probable string and the computation of the most probable constrained string in a P FA are NP-Hard problems. The formal proof of proposition 8 can be found in [43]. It should be noted that the problem is at least NP-complete but that the membership to NP is an open question. A similar question is proved to be undecidable for P FA acceptors which are different to the P FA covered in this paper [54]. However the problem of searching for the string associated to the most probable derivation in a P FA , that is, given a P FA
, compute argmax
is polynomial [43].
IV. P ROPERTIES We now turn to study the properties of these models. What are the normal forms? Are
standard in formal language theory [55], but can lead to unexpected results in the probabilistic case.
99
they equivalent one to the other? Which are more expressive?
H
[43]. The rst problem consists in : (14)
8S3'I T R P
8S3'I T R P
8S3'I T R P
eR
F D
@H
(15)
(16)
. These questions may be
17
A. A minimal form for D PFA A central question that arises when considering any nite devices is that of being able to decide the equivalence between two such devices. This is important for learning, as it is known [56] that a class is hard to learn if the equivalence problem is not tractable. In terms of probabilistic automata the question one wishes to answer is: Given two P FA (resp. two D PFA ), are they equivalent?. In the case of probabilistic objects a more natural question may be: Given two P FA
distributions at most ?
While the second question requires a concept of distance, and then will be discussed in section V, part of the rst question can be answered here. A Nerode theorem for D PFA : For any D PFA
The construction of a minimal D PFA can be found in [57]. This construction is based on the denition of an equivalence relation between strings on one hand and between states on another. Extending the Nerode relation over the states of the automaton goes as follows:
is a path with
This relation has nite index, and from it the minimal canonical D PFA can be constructed by merging equivalent states, unique up to a state-isomorphism. This can be done in polynomial time. In [8] an efcient algorithm that does this is given. Also cases where even non-deterministic P FA can be put into canonical form (for instance if they are acyclic) are studied.
is a path with
u u u u u vs q (s @99 u s i @998 s
sA e! FH T A# A R
A P fT # R u 'I

initial state
and
and
99 9&
where
is the probability of the unique path of states (
T SR u 'I P T 8SR u 'PP I ` T R u 'I T R u 'I P
T q s99 q "` u u u u R u T s999 D2 CX R `
alence relation over
has trivially nite index [57]:
P BH)BF9EsBA 981 7
(respectively two D PFA ), and
4 0
, are they -equivalent, i.e. is the distance between their
the following equiv-
(17) ) for from the
2 ps

T P Y# R u 'I
(18)
u s s
18
This enables us therefore to test the equivalence between two D PFA : minimize each and compare. If the corresponding minimal D PFA are isomorphic (a simple relabeling of the states through their minimum prexes is enough to test this) then the initial D PFA are equivalent. In the non-deterministic case Tzeng [45] proposes an algorithm that directly tests if two P FA are equivalent, but no result concerning a minimal P FA is known. B. Equivalence of P FA and D PFA One expects to nd standard automata results when dealing with regular stochastic languages. For instance, that determinism does not imply a loss of expressive power. We prove here that this is not true. The result is mostly known and sometimes proved elsewhere (for instance in [8], [39]) but the construction of the counter-example is of use: it informs us that the mean of two deterministic regular distributions may not be regular deterministic. We rst dene the mean of deterministic regular distributions and argue that this distribution is not deterministic.
be regular deterministic. The proof of this proposition is in the appendix A.
Proposition 10: There exist distributions that can be generated by P FA but not by D PFA . The proof is a simple consequence of proposition 9: Take P FA from gure 2 as counterexample (see appendix A).
a (1/2)
PSfrag replacements
a (1/2)
1/2
a (1/2)
a (1/3)
2/3
Fig. 2.
A counter-example about distributions that can be generated by P FA but not by D PFA.
H q H
Proposition 9: Given two regular deterministic distributions
H H 8 D'e46% T R QP I
4 T P I 4 T R P 6 8SR Q'e'6% 58SDQ'I
$
H H
we denote
the distribution
such that:
and

H
Denition 10 (Mean of two distributions): Given two distributions
and
over
(19) may not
19
C. Equivalence of -P FA and P FA Given a -P FA , there is an equivalent P FA with no -transitions [58]:
polynomial time.
second one that does not contain -transitions.

a (1/4)
(1/4)
(1/5) I(0)=1
I(1)=0
(3/10)
I(1)=0
(7/16)
(3/8)
b (1/2)
b (5/8)
Fig. 3.
A -P FA and its equivalent P FA
V. C OMBINING
DISTRIBUTIONS :
AUTOMATA
PRODUCT
There can be many simple ways of combining non-deterministic P FA . But because of the special interest D PFA represent, it would be of singular use to have some means of modifying deterministic regular distributions, of combining them. We give two results in this section, one relating to the product of two automata (the co-emission probability), and the second to the computation of the weight of a language inside a distribution. From proposition 9, we know that the mean of two regular deterministic distributions may not be regular deterministic. Thus combining two D PFA has to be done in a different way. We can compute the product automaton as follows: Let
be two D PFA .
Consider the automaton
3P#EF #BH Es 2BA 987
where
&s
qT u sEs e H T u E t#H !P u s u 7 C!ss &3FH R s sR T s P s7R T s v T t 5r!P u s 3FF R F sR F T s7R ) X$ T u E(stR A )T u E ts"i!P u E u svsC(E &ts7 R 3YA A s $ s R ` T s 7 Ps $ P BH)BFBA P s tsv"8 V I87 7
&s
P BF9 BH Es9(EA "87 iR
I(0)=1
(1/2)
We illustrate this proposition in gure 3. The rst
most
states, where
is the number of states of
. Also,
can be constructed from
-P FA has been transformed into the
b (5/16) a (1/4)
just one initial state such that
. Moreover
is of size at most
size
T tR
Proposition 11: Given a -P FA
, representing distribution
, there exists a P FA
with at in
rag replacements
with
and
20
. This quantity is of use when computing the distance between two distributions, but is
also of interest as it measures the interactions between two distributions. In [16], it is proved that this is computable for APFA. Intractability results for more complicated architectures are proved in [17].
and when
Solving this system of equations enables us to solve equation (20) by
The same sort of techniques allows us to consider the product automaton obtained by taking and a D PFA
Consider the automaton
if if
if
We will not dene these formally here, but recommend the reader refers to usual textbooks [55] if needed.
X$ T u E &s R T u Es e H !P u E u vs(E 7 R H A s s R T s s7 P s s #F 2 $ s 4 T s #F $ s cTws e F rP u EBs7 R F uR )(AX$ T u E9(tsR 2XT u E&ts"iP u E u sv CE s7 R 3RA s A $ s R ` T s 7 Ps $ P &BH)BF P s tsv EA "8 V I811 7 7
I
D
a deterministic nite-state automaton4
: Let
3P BF Es BA R81 7
Q ! ! i T ssEeA!s RFH A 7 y Q ! ! 7 y
S7 E!s FH A T s A R
Sss Fv 7 T7 R F
rT REDtR y
P BH9 BF Es9(EA "87 iR
4 5Q5
Computing
can be done through solving the following system:
T
Q A RP e'I y
We introduce one variable
per state in the product automata, with intended meaning:
8Se 3'e8 s 'I T R P I T R P
R rT RE D y
Formally:
where
The sum over
of these scores denes the co-emission (denoted
) between
D
probability of generating simultaneously
by
is called the co-emission probability of
by
and
and
R R
This automaton affects to each string
the following score:
[16]. The score corresponds to the . and
T P T P 8SR 3'I8SR B'I
. This product
RE!DtR
R
(20)
and
21
Note that the construction does not necessarily yield a consistent D PFA : at every state the sum of probabilities might be less than 1. The construction is nevertheless of interest and yields the following result:
The proof follows from the construction of automaton
This enables us to give a direct method of computing the weight of a regular language for a regular distribution, with a complexity which is linear in the product of the sizes of the two automata. It should be noted that this problem has been solved for special cases of
VI. C OMPARING
Dening similarity measures between distributions is the most natural way of comparing them. Even if the question of exact equivalence (discussed in section IV-A) is of interest, in practical cases we wish to know if the distributions are close or not. In tasks involving the learning of P FA or D PFA one wants to measure the quality of the result or of the learning process. When learning takes place from a training sample, measuring how far the learned automaton is from a (test) sample can also be done by comparing distributions as a sample can be encoded as a PPTA. There are 2 families of distance measures. Those that are true distances, and those that measure a cross entropy. We study both types.
A. Mathematical distances All the denitions hereafter are seen as denitions of distances between distributions over
. In doing so they implicitly dene distances between automata, but also between automata
and samples, or even between samples.
&T QP 8SR D'I
T R QP 8SD' &I
rT u y
H H iSR (
the norm
The most general family of distances are referred to as the
the language
by more efcient algorithms in [46].

DISTRIBUTIONS :
S IMILARITY
T
Proposition 12:
R 'I P
T R 'I P
r
MEASURES
distances or distances for
22
When concerned with very small probabilities such as those that may arise when an innite number of strings have non null probability, it may be more useful to use logarithms of probabilities. In this way two strings with very small probabilities may inuence the distance because their relative probabilities are very different: suppose
for the logarithmic distance the difference will be the same as if probabilities had been and
The logarithmic distance is dened as
It should be noticed that the logarithmic distance is innite when the probability of a string is null in one distribution and strictly positive in the other one.
B. Entropy based measures
logarithms.
not in
, then the Kullback-Leibler divergence becomes innite.
It should be noticed that in the case where some string has a null probability in
!
We set in a standard way that
and
and assume
T8SR @'I QP ! T R QP & %8S@VI T R QP 8 D'I
4 ('4 4! # y rT u HiH R $4
!
Similar to the
distance is the well-known Kullback-Leibler divergence:
to represent base 2
4
&T QP I ! T QP I ! !e8SR D'V " 8 R 'V &
rT u
54
, then the effect for
of this particular string will be of
whereas
4 4 T RP 8S'I
e8 R @VI &T QP
The following distance is used in [34] (under the name
or distance for the
8SR D'I TT QP
T R QP 8 D' &I
T R QP 8S@VI R
rT u
T u y
H iH R 4
H HR iS
H H iSR 4
In the special case where
we obtain
e8SR D'I &T QP
T R QP 8S@V &I
T u y
norm
For =1 we get a natural distance also known as the
distance [36] or distance for the
H HR DiSs
4 4

T R P 4 )8S 'I
'4
norm):
and
, but
23
Rewriting the Kullback-Leibler divergence as
From this writing, we can see that the Kullback-Leibler divergence has some of the logarithmic distance properties.
C. Some properties
the same subset of
).
is not a mathematical distance. It nevertheless veries the following properties
2) 3)
HiSR $4 H # HiSR $4 H # H H # iSR $4
1)
T R T u DHiSHs 4 R ( T u u H 1H 5T u 4 4 U Tu
H H iSR 4
H DiH
Obviously

cTu iSs H HR
4 ! Tu
`u
3)
(for
, assume
and
uu
T u u iH R H
4 T u u u H R 4 T u iH R 4 H H T u SR 4 fT u iH R 4 H H H u H 1H fT u HiH R 4 4 ` ! &) " $ $ u DHiH
2)
1)

H DiH
4 ( 4
are distances, i.e., they comply with the usual properties.
!
T u
H H # DiSR $4
and
, the Kullback-Leibler divergence can be expressed as:
are null on
Let us now consider the random variables
and
from
to
such that
T f8 R
4
will be one more than the optimal code obtained using
optimal number of bits needed to code a message of
H H # iSR 4
. To x the ideas a divergence of 1 (i.e.
) will mean that the average using
distributed according to
measures the cost (in number of bits of encoding) one must pay when estimating
rst term measures the optimal number of bits needed to encode
Tu
term is the cross entropy of
and
. From the information theory interpretation [59] the and the second one using
one can note the rst term is the entropy of
and does not depend on
D8SR D' 8 R @'I TT QP I ! T QP
T R QP I ! T QP 8SD' 8 R 'I R
T u y
T QP 8SR @'I
H iH R 4 #
T 8SSRu
and the second
# $4
T QP 8 R @'I
u u
H H
(21)
24
D. Computing distances
between them. Main positive results include:
done in polynomial time.
The proof of this proposition is reported in the appendix B.
can be done in polynomial time. E. Estimating distances
In some places, it is interesting either to compare a theoretical distribution with the empirical one, or to compare different distributions with respect to an empirical one. For the rst purpose, we can use the following lemma:
In case one wants to learn or estimate distributions, this result is commonly used to compare the different learning algorithms: a sample of the target distribution is built and a distance between the learned distribution and the sample is computed. In applications such as language modeling [60] or statistical clustering [61], [62], a distance
the target distribution and
a model. As previously noted this distance can be decomposed
on a sample
. Let
denote the set which contains the unique elements of the sample
(removing the repetitions). The corresponding empirical cross-entropy can then be written as:
& !& T P I ! T 8SR 3'(% 8SR
% 0 y
% T R P I ! T ' P 8S3' %8 R (Q'I y
u
UT H v R
Since
is generally unknown, it is replaced by an adequate empirical estimate
H HR iSv
T R P I ! 8S3'
UT
H HR DiSv
as the entropy of
and the cross-entropy of
with respect to
, based
based on the Kullback-Leibler divergence is commonly used to compare estimators. Let
& &
& !"32 &T !& &
T8 RQD'I P y H
! R & "
1ViH R 4 !T
P 'I
, then for
Lemma 1 ( [34], lemma 14): Let
be any distribution on
, and
Proposition 14 ( [44]): If
and
are given by D PFA , the computation of
a sample of size
cTu DiSR 4 H H #
Proposition 13: If
and
T u iH e H R
We consider the following problem: given
and
, compute the distance
T u DiSR H H
can be be
0
& &
25
Another measure often used in the language model community is the perplexity of a given model

. It is computed using equation (22) as:
In practice, rather than the entropy (or perplexity) per string given by the previous equations, the entropy (or the perplexity) per symbol is often preferred [3]. It can be obtained approximately by replacing with
in equation (22) (or equation (24)).
The properties of the perplexity can be summarized as follows: Equation (22) says that the cross-entropy measures the average number of bits one must
From equation (23), the perplexity measures the corresponding average number of
choices entailed by this coding. From equation (24), the perplexity can be seen as the inverse of the geometric mean of
the probabilities of the sample strings according to the model. On the other hand, in practical work, the following properties must be carefully taken into account: Perplexity and entropy diverge as soon as one of the probabilities according to zero. In practice, this implies that the perplexity can only be used if i.e., it provides a non null probability for every string of
is smoothed;
Obviously, perplexity and entropy only make sense if (the smoothed version of) really a probabilistic model; i.e.,
The perplexity can compare models only using the same sample
8 R 'I T P
pay by using the model
instead of
while coding the sample
which can be also written as:
8 R 'I & T P !
'
T R 8SgH
Q %
% & !& y

T & 9R @H H
& & & &
T & F@H R H
T H R
& &
notation, we have:
$
T 8SR
where
is the number of occurrences of
in
. Finally, using
in multi-set
(22) for
(23)
(24)
is
is
26
VII. C ONCLUSION We have provided in this rst part a number of results centered on the probabilistic automata and distributions themselves. It remains to study the relationships between these models and other important models that can be found in the literature. Also the important task of approximating, learning or identifying these models, all central problems to structural pattern recognition, need to be explored. All this will be done in part II of this paper [40]. ACKNOWLEDGMENTS The authors wish to thank the anonymous reviewers for their careful reading and in-depth criticisms and suggestions.
27
A PPENDIX A. Proof of the Proposition 9
not be regular deterministic.
which:
there is an innity of strings with non null probability). PSfrag replacements k l

Fig. 4.
$k$ $l$
The general shape of a D PFA over a one letter alphabet
w T 'I 0 RP
Let
(such an
exists because of consistency).
4 x T R P $ 3'I
! ! `
a string of
states (with eventually
) followed by a cycle of
states (with
4 0
Denote
(the length of the cycle). The automaton (see gure 4) consists of , as
, with
, for all
with
; There would (because of determinism) have to be some
E 5 ! VUI4
! R4 !
AX9 Es ( s R $ T A $ 7 7 XT e sEs stsR
But no D PFA can implement
. Suppose such an automaton exists and call it
for
uR u u R A u s u ' T vts F ' T ws C! ws H 9 X$ T vE vtsR 4 T vts F T ws C! ws H 9 X$ T vEs vtsR u R u u R A u u
P &BH)BFBA(Es 987
I
and distribution
dened by
T R T R A Ts &ts F &s C! &s #H 2X$ &E tsR 4 T ts F T&s C!9s #H 2X$ T&Es9(tsR R R A
Proof: Consider distribution
dened by the following D PFA
H v
H
H
DH
Proposition 9: Given two regular deterministic distributions
and
may
T u iS H HR
u
H
# # # ' 4# e e e C # # e C # e e # ' "# # 4# C # R C e Ce # BT R T # ' "# # R

T 3'I R P T R 'I P e P e R P T # R 3'I T C 3'I w rT C e'I 0 e RP 97 T s s D e 7 E 7 tR H 9 E ( tR T s s H ( 0w rT eC P'I Ce # rT # R # T # R # R 5 7 T tR 0 s F T s s D e 7 E 7 tR GH

where
and
and
Thus
# #
' "# # ' # ' "# # # ' "# #

B. Proof of the Proposition 13
we end up to a clear contradiction. And if
contradiction.
0
If
Simplifying:
Proposition 13: If
It follows that:
the right hand side of the equation is odd, and the the left hand side is even, so
where
and
and
done in polynomial time.
Proof: In the following,
T

matches the denition (20) in page 20. we solve and reach 3=13 which is also a can be
28
29
If
[1] A. Paz, Introduction to Probabilistic Automata.
[2] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recoginition, Proceedings of the IEEE, vol. 77, pp. 257286, 1989. [3] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, Massachusetts: The MIT Press, 1998.
[4] R. Carrasco and J. Oncina, Learning stochastic regular grammars by means of a state merging method, ser. Lecture Notes in Computer Science, R. C. Carrasco and J. Oncina, Eds., no. 862. Berlin, Heidelberg: Springer Verlag, 1994, pp. 139150. [5] L. Saul and F. Pereira, Aggregate and mixed-order Markov models for statistical language processing, in Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, C. Cardie and R. Weischedel, Eds. Somerset, New Jersey: Association for Computational Linguistics, 1997, pp. 8189. [6] H. Ney, S. Martin, and F. Wessel, Corpus-Based Statiscal Methods in Speech and Language Processing. S. Young
and G. Bloothooft, Kluwer Academic Publishers, 1997, ch. Statistical Language Modeling Using Leaving-One-Out, pp. 174207. [7] D. Ron, Y. Singer, and N. Tishby, Learning probabilistic automata with variable memory length, in Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory. Press, 1994, pp. 3546. [8] M. Mohri, Finite-state transducers in language and speech processing, Computational Linguistics, vol. 23, no. 3, pp. 269311, 1997. [9] K. S. Fu, Syntactic pattern recognition and applications. [10] L. Miclet, Structural Methods in Pattern Recognition. Prentice Hall, 1982. New Brunswick, New Jersey: ACM
[11] S. Lucas, E. Vidal, A. Amari, S. Hanlon, and J. C. Amengual, A comparison of syntactic and statistical techniques for off-line OCR, ser. Lecture Notes in Computer Science, R. C. Carrasco and J. Oncina, Eds., no. 862. Heidelberg: Springer Verlag, 1994, pp. 168179. [12] D. Ron, Y. Singer, and N. Tishby, On the learnability and usage of acyclic probabilistic nite automata, in Proceedings of COLT 1995, 1995, pp. 3140. [13] H. Ney, Stochastic grammars and pattern recognition, in Proceedings of the NATO Advanced Study Institute, P. Laface and R. D. Mori, Eds. SpringerVerlag, 1992, pp. 313344. Berlin,
uH H T u iSV # T u u SV T iH R T H HR H HR H R T R QD'e T RDP'I P I Q P P # T R Q'I T R Q'I T R Q@'Ie T R @P'I # T R P'Ie T R 'IG T D'e T D'I P Q Q P R QP I R QP QP I QP TT R DV% T R DVI R T D' &I R QP R rcTu DHiSH4
and are given by D PFA the above can be solved in polynomial time. R EFERENCES
Academic Press, New York, NY, 1971. Springer-Verlag, 1987.

& P eT R DQ'I
30
[14] N. Abe and H. Mamitsuka, Predicting protein secondary structure using stochastic tree grammars, Machine Learning, vol. 29, pp. 275301, 1997. [15] Y. Sakakibara, M. Brown, R. Hughley, I. Mian, K. Sjolander, R. Underwood, and D. Haussler, Stochastic context-free grammars for tRNA modeling, Nuclear Acids Res., vol. 22, pp. 51125120, 1994. [16] R. B. Lyngs, C. N. S. Pedersen, and H. Nielsen, Metrics and similarity measures for hidden Markov models, in Proceedings of ISMB99, 1999. [17] R. B. Lyngs and C. N. S. Pedersen, Complexity of comparing hidden Markov models, in Proceedings of ISAAC 01, 2001. [18] P. Cruz and E. Vidal, Learning regular grammars to model musical style: Comparing different coding schemes, ser. Lecture Notes in Computer Science, V. Honavar and G. Slutski, Eds., no. 1433. Berlin, Heidelberg: Springer-Verlag, 1998, pp. 211222. [19] M. G. Thomason, Regular stochastic syntax-directed translations, Computer Science Department. University of Tennesse, Knoxville, Tech. Rep. CS-76-17, 1976. [20] M. Mohri, F. Pereira, and M. Riley, The design principles of a weighted nite-state transducer library, Theoretical Computer Science, vol. 231, pp. 1732, 2000. [21] H. Alshawi, S. Bangalore, and S. Douglas, Learning dependency translation models as collections of nite state head transducers, Computational Linguistics, vol. 26, 2000. [22] , Head transducer model for speech translation and their automatic acquisition from bilingual data, Machine Translation, 2000. [23] J. C. Amengual, J. M. Bened, F. Casacuberta, A. C. no, A. Castellanos, V. M. Jim nez, D. Llorens, A. Marzal, e M. Pastor, F. Prat, E. Vidal, and J. M. Vilar, The EUTRANS-I speech translation system, Machine Translation Journal, vol. 15, no. 1-2, pp. 75103, 2000. [24] S. Bangalore and G. Riccardi, Stochastic nite-state models for spoken language machine translation, in Proceedings of the Workshop on Embeded Machine Translation Systems, NAACL, Seattle, USA, May 2000, pp. 5259. [25] , A nite-state approach to machine translation, in Proceedings of the North American ACL2001, Pittsburgh, USA, May 2001. [26] F. Casacuberta, H. Ney, F. J. Och, E. Vidal, J. M. Vilar, S. Barrachina, I. Garca-Varea, D. Llorens, C. Martnez, S. Molau, F. Nevado, M. Pastor, D. Pic , A. Sanchis, and C. Tillmann, Some approaches to statistical and nite-state o speech-to-speech translation, Computer Speech and Language, 2003. [27] L. Br h lin, O. Gascuel, and G. Caraux, Hidden Markov models with patterns to learn boolean vector sequences e e and application to the built-in self-test for integrated circuits, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 9971008, 2001. [28] Y. Bengio, V.-P. Lauzon, and R. Ducharme, Experiments on the application of IOHMMs to model nancial returns series, IEEE Transaction on Neural Networks, vol. 12, no. 1, pp. 113123, 2001. [29] K. S. Fu, Syntactic Methods in Pattern Recognition. New-York: Academic Press, 1974.
[30] J. J. Paradaens, A general denition of stochastic automata, Computing, vol. 13, pp. 93105, 1974. [31] K. S. Fu and T. L. Booth, Grammatical inference: Introduction and survey. part I and II, IEEE Transactions on System Man and Cybernetics, vol. 5, pp. 5972 and 409423, 1975. [32] C. S. Wetherell, Probabilistic languages : A review and some open questions, Computing Surveys, vol. 12, no. 4, 1980.
31
[33] F. Casacuberta, Some relations among stochastic nite state networks used in automatic speech recogntion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 691695, 1990. [34] D. Angluin, Identifying languages from stochastic examples, Yale University, Tech. Rep. YALEU/DCS/RR-614, March 1988. [35] M. Kearns and L. Valiant, Cryptographic limitations on learning boolean formulae and nite automata, in 21st ACM Symposium on Theory of Computing, 1989, pp. 433444. [36] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. E. Schapire, and L. Sellie, On the learnability of discrete distributions, in Proc. of the 25th Annual ACM Symposium on Theory of Computing, 1994, pp. 273282. [37] M. Kearns and U. Vazirani, An Introduction to Computational Learning Theory. MIT press, 1994.
[38] N. Abe and M. Warmuth, On the computational complexity of approximating distributions by probabilistic automata, in Proceedings of the Third Workshop on Computational Learning Theory. Morgan Kaufmann, 1998, pp. 5266.
[39] P. Dupont, F. Denis, and Y. Esposito, Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms, Pattern Recognition, 2004, to appear. [40] E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, and R. C. Carrasco, Probabilistic nite state automata part II, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. Special Issue-Syntactic and Structural Pattern Recognition, 2004. [41] M. O. Rabin, Probabilistic automata, Information and Control, vol. 6, no. 3, pp. 230245, 1963. [42] G. D. Forney, The Viterbi algorithm, in IEEE Procedings, vol. 3, 1973, pp. 268278. [43] F. Casacuberta and C. de la Higuera, Computational complexity of problems on probabilistic grammars and transducers, ser. Lecture Notes in Computer Science, A. de Oliveira, Ed., vol. 1891. Berlin, Heidelberg: SpringerVerlag, 2000, pp. 1524. [44] R. C. Carrasco, Accurate computation of the relative entropy between stochastic regular grammars, RAIRO (Theoretical Informatics and Applications), vol. 31, no. 5, pp. 437444, 1997. [45] W.-G. Tzeng, A polynomial-time algorithm for the equivalence of probabilistic automata, SIAM J. Comput., vol. 21, no. 2, pp. 216227, 1992. [46] A. Fred, Computation of substring probabilities in stochastic grammars, in Grammatical Inference: Algorithms and Applications, ser. Lecture Notes in Computer Science, A. de Oliveira, Ed. Berlin, Heidelberg: Springer-Verlag, 2000, vol. 1891, pp. 103114. [47] M. Young-Lai and F. W. Tompa, Stochastic grammatical inference of text database structure, Machine Learning, vol. 40, no. 2, pp. 111137, 2000. [48] D. Ron and R. Rubinfeld, Learning fallible deterministic nite automata, Machine Learning, vol. 18, pp. 149185, 1995. [49] C. Cook and A. Rosenfeld, Some experiments in grammatical inference, NATO ASI on Computer Orientation Learning Process, pp. 157171, 1974, bonas, France. [50] K. Knill and S. Young, Corpus-Based Statistical Methods in Speech and Language Processing. Eds. S. Young and G. Bloothoof. Kluwer Academic Publishers, 1997, ch. Hidden Markov Models in Speech and Language Processing, pp. 2768. [51] N. Merhav and Y. Ephraim, Hidden Markov modeling using a dominant state sequence with application to speech recognition, Computer Speech and Language, vol. 5, pp. 327339, 1991.
32
[52] , Maximum likelihood hidden Markov modeling using a dominant state sequence of states, IEEE Transaction on Signal Processing, vol. 39, no. 9, pp. 21112115, 1991. [53] R. G. Galleguer, Discrete Stochastic Processes. Kluwer Academic Publisher, 1996.
[54] V. C. V. D. Blondel, Undecidable problems for probabilistic automata of xed dimension, Theory of Computing Systems, vol. 36, no. 3, pp. 231245, 2003. [55] M. H. Harrison, Introduction to Formal Language Theory. Inc., 1978. [56] C. de la Higuera, Characteristic sets for polynomial grammatical inference, Machine Learning, vol. 27, pp. 125138, 1997. [57] R. Carrasco and J. Oncina, Learning deterministic regular grammars from stochastic samples in polynomial time, RAIRO (Theoretical Informatics and Applications), vol. 33, no. 1, pp. 120, 1999. [58] C. de la Higuera, Why -transitions are not necessary in probabilistic nite automata, EURISE, University of Saint-Etienne, Tech. Rep. 0301, 2003. [59] T. Cover and J. Thomas, Elements of Information Theory. Wiley Interscience Publication, 1991.
Reading, MA: Addison-Wesley Publishing Company,
[60] J. Goodman, A bit of progress in language modeling, Microsoft Research, Tech. Rep., 2001. [61] R. Kneser and H. Ney, Improved clustering techniques for class-based language modelling, in European Conference on Speech Communication and Technology, Berlin, 1993, pp. 973976. [62] P. Brown, V. Della Pietra, P. de Souza, J. Lai, and R. Mercer, Class-based N-gram models of natural language, Computational Linguistics, vol. 18, no. 4, pp. 467479, 1992. [63] R. C. Carrasco and J. Oncina, Eds., Grammatical Inference and Applications, ICGI-94, ser. Lecture Notes in Computer Science, no. 862. Berlin, Heidelberg: Springer Verlag, 1994.
[64] A. de Oliveira, Ed., Grammatical Inference: Algorithms and Applications, ICGI 00, ser. Lecture Notes in Computer Science, vol. 1891. Berlin, Heidelberg: Springer-Verlag, 2000.

Vidal PFAI - Pami05

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Vidal PFAI - Pami05

Caricato da

Copyright:

Formati disponibili

IEEE TRANSACTION PAMI

Probabilistic Finite-State Machines Part I

IEEE TRANSACTION PAMI

IEEE TRANSACTION PAMI

IEEE TRANSACTION PAMI

IEEE TRANSACTION PAMI

will be denoted by letters will be denoted by end of . The set of all

As usual, we will use the notation of

the probability1 of a string

under the distribution

. The distribution must verify

is a probability distribution over

is the empty string .

from position to position

(resp. less than, at most ) will be denoted by

& ' &

the alphabet letters (

). The length of a string

from the beginning of the alphabet (

IEEE TRANSACTION PAMI

. If the distribution is modeled by some syntactic machine

The distribution modeled by a machine non-ambiguous context.

is (are) represented in the sample. The size

of sample , is the total number of strings

in the sample and

is the total sum of lengths of the all strings in . It should be noted

is a nite set of states; is the alphabet;

(transition probabilities); (nal-state probabilities). ) and therefore functions

It should be noted that probabilities may be null ( and

assumed to be extended with , and

are functions such that:

PH    QGIGBFVCEDCBA V@987 

is the frequency (number of repetitions) of

may appear more than once. We will write

to indicate that (all instances of the) string

 6 2$ & !"38SR & 2T 

e c  8`  idb RaTgF e c A`  ihb aTgH e c  8`  fdb RaTCD 8 V 8 S YXV WRUTA

4 T ' P 58SR (DQ'I  0 T ' P 1)8 R (DQ'I &H %

according to the probability distribution dened by

IEEE TRANSACTION PAMI

In what follows, the subscript state of as

, only one initial state (i.e. a state

. The real numbers in the states and in the arrows are

the nal-state and the transition probabilities, respectively.

Graphical representation of a P FA.

and a sequence of states of length

8 s    UT u EBtRGH R  s s 8 y T tsGF RX$ s

will be dropped when there is no ambiguity. A generic will be denoted ,

)54 e3 $ )'s ss$ (E9s &E9v% 8

!#5U!" 8 $  7      Es9 8 Es9s HRg2dQ8@9` Ig PH C DB6g87 adBA@9` G87F 6E

IEEE TRANSACTION PAMI

Denition 2: A -P FA is a tuple , where . , , and

considering -P FA a few concepts will be needed: Denition 3: For any -P FA

A -transition is any transition labeled by ; A -loop is a transition of the form

T  "! "!9 R  E 6 E #7 7

T7  7 R    T 7  @ ees999  e ! R 7 R T 7  7 

A -cycle is a sequence of -transitions from :

P &EH EF"D eEA 87 

verify a similar normalization as for P FA with the sum for all

8 V XpT ) $ 8 R V UA P BH)BFCEDBA 987 

-Probabilistic nite-state automata ( -P FA )

are dened as for P FA , but

& ' &

PH QGIGBFVCEDCBA V@987

6 2$ & !"38SR & 2T

e c 8` idb RaTgF e c A` ihb aTgH e c 8` fdb RaTCD 8 V 8 S YXV WRUTA

4 T ' P 58SR (DQ'I 0 T ' P 1)8 R (DQ'I &H %

8 s UT u EBtRGH R s s 8 y T tsGF RX$ s

)54 e3 $ )'s ss$ (E9s &E9v% 8

!#5U!" 8 $ 7 Es9 8 Es9s HRg2dQ8@9` Ig PH C DB6g87 adBA@9` G87F 6E

T "! "!9 R E 6 E #7 7

T7 7 R T 7 @ ees999 e ! R 7 R T 7 7

P &EH EF"D eEA 87

8 V XpT ) $ 8 R V UA P BH)BFCEDBA 987

P BH)BF9EsBA 987 T u EBtR s s ! FXT u EsBts"` u v$ & & )A $ R s T s 8D R

' & &

!#U! T sE u tsR vT u sE 5 R y T ES5R 5 7 H s T Es t4R

& ' &

T ' & R &

8 vs T EsS5R $ T P 8SR 3'I

& ' &

T A & ' & R & &