Sei sulla pagina 1di 69

MANUSCRIPT 1

Bayesian Filtering: From Kalman Filters to


Particle Filters, and Beyond
ZHE CHEN

Abstract— In this self-contained survey/review paper, we system- IV Bayesian Optimal Filtering 9


atically investigate the roots of Bayesian filtering as well as its rich IV-A Optimal Filtering . . . . . . . . . . . . . . . . . . . . . 10
leaves in the literature. Stochastic filtering theory is briefly reviewed IV-B Kalman Filtering . . . . . . . . . . . . . . . . . . . . . 11
with emphasis on nonlinear and non-Gaussian filtering. Following IV-C Optimum Nonlinear Filtering . . . . . . . . . . . . . . 13
the Bayesian statistics, different Bayesian filtering techniques are de-
IV-C.1 Finite-dimensional Filters . . . . . . . . . . . . 13
veloped given different scenarios. Under linear quadratic Gaussian
circumstance, the celebrated Kalman filter can be derived within the
Bayesian framework. Optimal/suboptimal nonlinear filtering tech- V Numerical Approximation Methods 14
niques are extensively investigated. In particular, we focus our at- V-A Gaussian/Laplace Approximation . . . . . . . . . . . . 14
tention on the Bayesian filtering approach based on sequential Monte V-B Iterative Quadrature . . . . . . . . . . . . . . . . . . . 14
Carlo sampling, the so-called particle filters. Many variants of the V-C Mulitgrid Method and Point-Mass Approximation . . 14
particle filter as well as their features (strengths and weaknesses) are V-D Moment Approximation . . . . . . . . . . . . . . . . . 15
discussed. Related theoretical and practical issues are addressed in V-E Gaussian Sum Approximation . . . . . . . . . . . . . . 16
detail. In addition, some other (new) directions on Bayesian filtering V-F Deterministic Sampling Approximation . . . . . . . . . 16
are also explored.
V-G Monte Carlo Sampling Approximation . . . . . . . . . 17
Index Terms— Stochastic filtering, Bayesian filtering, V-G.1 Importance Sampling . . . . . . . . . . . . . . 18
Bayesian inference, particle filter, sequential Monte Carlo, V-G.2 Rejection Sampling . . . . . . . . . . . . . . . . 19
sequential state estimation, Monte Carlo methods.
V-G.3 Sequential Importance Sampling . . . . . . . . 19
V-G.4 Sampling-Importance Resampling . . . . . . . 20
V-G.5 Stratified Sampling . . . . . . . . . . . . . . . . 21
“The probability of any event is the ratio between the
V-G.6 Markov Chain Monte Carlo . . . . . . . . . . . 22
value at which an expectation depending on the happening
of the event ought to be computed, and the value of the V-G.7 Hybrid Monte Carlo . . . . . . . . . . . . . . . 23
thing expected upon its happening.” V-G.8 Quasi-Monte Carlo . . . . . . . . . . . . . . . . 24
— Thomas Bayes (1702-1761), [29]
VI Sequential Monte Carlo Estimation: Particle Filters 25
“Statistics is the art of never having to say you’re
wrong. Variance is what any two statisticians are at.” VI-A Sequential Importance Sampling (SIS) Filter . . . . . 26
— C. J. Bradfield VI-B Bootstrap/SIR filter . . . . . . . . . . . . . . . . . . . 26
VI-C Improved SIS/SIR Filters . . . . . . . . . . . . . . . . 27
Contents VI-D Auxiliary Particle Filter . . . . . . . . . . . . . . . . . 28
VI-E Rejection Particle Filter . . . . . . . . . . . . . . . . . 29
I Introduction 2 VI-F Rao-Blackwellization . . . . . . . . . . . . . . . . . . . 30
I-A Stochastic Filtering Theory . . . . . . . . . . . . . . . 2 VI-GKernel Smoothing and Regularization . . . . . . . . . 31
I-B Bayesian Theory and Bayesian Filtering . . . . . . . . 2 VI-H Data Augmentation . . . . . . . . . . . . . . . . . . . 32
I-C Monte Carlo Methods and Monte Carlo Filtering . . . 2 VI-H.1 Data Augmentation is an Iterative Kernel
I-D Outline of Paper . . . . . . . . . . . . . . . . . . . . . 3 Smoothing Process . . . . . . . . . . . . . . . . 32
VI-H.2 Data Augmentation as a Bayesian Sampling
II Mathematical Preliminaries and Problem Formula- Method . . . . . . . . . . . . . . . . . . . . . . 33
tion 4 VI-I MCMC Particle Filter . . . . . . . . . . . . . . . . . . 33
II-A Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 4 VI-J Mixture Kalman Filters . . . . . . . . . . . . . . . . . 34
II-B Notations . . . . . . . . . . . . . . . . . . . . . . . . . 4 VI-KMixture Particle Filters . . . . . . . . . . . . . . . . . 34
II-C Stochastic Filtering Problem . . . . . . . . . . . . . . 4 VI-L Other Monte Carlo Filters . . . . . . . . . . . . . . . . 35
II-D Nonlinear Stochastic Filtering Is an Ill-posed Inverse VI-MChoices of Proposal Distribution . . . . . . . . . . . . 35
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5 VI-M.1Prior Distribution . . . . . . . . . . . . . . . . 35
II-D.1 Inverse Problem . . . . . . . . . . . . . . . . . 5 VI-M.2Annealed Prior Distribution . . . . . . . . . . . 36
II-D.2 Differential Operator and Integral Equation . . 6 VI-M.3Likelihood . . . . . . . . . . . . . . . . . . . . . 36
II-D.3 Relations to Other Problems . . . . . . . . . . 7 VI-M.4Bridging Density and Partitioned Sampling . . 37
II-E Stochastic Differential Equations and Filtering . . . . 7 VI-M.5Gradient-Based Transition Density . . . . . . . 38
VI-M.6EKF as Proposal Distribution . . . . . . . . . . 38
III Bayesian Statistics and Bayesian Estimation 8 VI-M.7Unscented Particle Filter . . . . . . . . . . . . 38
III-ABayesian Statistics . . . . . . . . . . . . . . . . . . . . 8 VI-N Bayesian Smoothing . . . . . . . . . . . . . . . . . . . 38
III-B Recursive Bayesian Estimation . . . . . . . . . . . . . 9 VI-N.1 Fixed-point smoothing . . . . . . . . . . . . . . 38
VI-N.2 Fixed-lag smoothing . . . . . . . . . . . . . . . 39
The work is supported by the Natural Sciences and Engineering VI-N.3 Fixed-interval smoothing . . . . . . . . . . . . 39
Research Council of Canada. Z. Chen was also partially supported VI-OLikelihood Estimate . . . . . . . . . . . . . . . . . . . 40
by Clifton W. Sherman Scholarship.
VI-P Theoretical and Practical Issues . . . . . . . . . . . . . 40
The author is with the Communications Research Laboratory,
McMaster University, Hamilton, Ontario, Canada L8S 4K1, e- VI-P.1 Convergence and Asymptotic Results . . . . . 40
mail: zhechen@soma.crl.mcmaster.ca, Tel: (905)525-9140 x27282, VI-P.2 Bias-Variance . . . . . . . . . . . . . . . . . . . 41
Fax:(905)521-2922. VI-P.3 Robustness . . . . . . . . . . . . . . . . . . . . 43
VI-P.4 Adaptive Procedure . . . . . . . . . . . . . . . 46
MANUSCRIPT 2

VI-P.5 Evaluation and Implementation . . . . . . . . . 46 its line have been proposed and developed to overcome its
limitation.
VIIOther Forms of Bayesian Filtering and Inference 47
VII-AConjugate Analysis Approach . . . . . . . . . . . . . . 47 B. Bayesian Theory and Bayesian Filtering
VII-BDifferential Geometrical Approach . . . . . . . . . . . 47
VII-CInteracting Multiple Models . . . . . . . . . . . . . . . 48 Bayesian theory2 was originally discovered by the British
VII-DBayesian Kernel Approaches . . . . . . . . . . . . . . . 48 researcher Thomas Bayes in a posthumous publication in
VII-EDynamic Bayesian Networks . . . . . . . . . . . . . . . 48 1763 [29]. The well-known Bayes theorem describes the
VIIISelected Applications 49
fundamental probability law governing the process of log-
VIII-ATarget Tracking . . . . . . . . . . . . . . . . . . . . . . 49 ical inference. However, Bayesian theory has not gained
VIII-BComputer Vision and Robotics . . . . . . . . . . . . . 49 its deserved attention in the early days until its modern
VIII-CDigital Communications . . . . . . . . . . . . . . . . . 49 form was rediscovered by the French mathematician Pierre-
VIII-DSpeech Enhancement and Speech Recognition . . . . . 50 Simon de Laplace in Théorie analytique des probailités.3
VIII-EMachine Learning . . . . . . . . . . . . . . . . . . . . . 50 Bayesian inference [38], [388], [375], devoted to applying
VIII-FOthers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bayesian statistics to statistical inference, has become one
VIII-GAn Illustrative Example: Robot-Arm Problem . . . . . 50
of the important branches in statistics, and has been ap-
IX Discussion and Critique 51 plied successfully in statistical decision, detection and es-
IX-A Parameter Estimation . . . . . . . . . . . . . . . . . . 51 timation, pattern recognition, and machine learning. In
IX-B Joint Estimation and Dual Estimation . . . . . . . . . 51 particular, the November 19 issue of 1999 Science mag-
IX-C Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 azine has given the Bayesian research boom a four-page
IX-DLocalization Methods . . . . . . . . . . . . . . . . . . 52
special attention [320]. In many scenarios, the solutions
IX-E Dimensionality Reduction and Projection . . . . . . . 53
IX-F Unanswered Questions . . . . . . . . . . . . . . . . . . 53 gained through Bayesian inference are viewed as “optimal”.
Not surprisingly, Bayesian theory was also studied in the
X Summary and Concluding Remarks 55 filtering literature. One of the first exploration of itera-
tive Bayesian estimation is found in Ho and Lee’ paper
I. Introduction [212], in which they specified the principle and procedure
of Bayesian filtering. Sprangins [426] discussed the itera-
T HE contents of this paper contain three major scien-
tific areas: stochastic filtering theory, Bayesian theory,
and Monte Carlo methods. All of them are closely discussed
tive application of Bayes rule to sequential parameter esti-
mation and called it as “Bayesian learning”. Lin and Yau
[301] and Chien an Fu [92] discussed Bayesian approach
around the subject of our interest: Bayesian filtering. In
to optimization of adaptive systems. Bucy [62] and Bucy
the course of explaining this long story, some relevant the-
and Senne [63] also explored the point-mass approximation
ories are briefly reviewed for the purpose of providing the
method in the Bayesian filtering framework.
reader a complete picture. Mathematical preliminaries and
background materials are also provided in detail for the C. Monte Carlo Methods and Monte Carlo Filtering
self-containing purpose.
The early idea of Monte Carlo4 can be traced back to
A. Stochastic Filtering Theory the problem of Buffon’s needle when Buffon attempted
in 1777 to estimate π (see e.g., [419]). But the modern
Stochastic filtering theory was first established in the formulation of Monte Carlo methods started from 1940s
early 1940s due to the pioneering work by Norbert Wiener in physics [330], [329], [393] and later in 1950s to statis-
[487], [488] and Andrey N. Kolmogorov [264], [265], and it tics [198]. During the World War II, John von Neumann,
culminated in 1960 for the publication of classic Kalman Stanislaw Ulam, Niick Metropolis, and others initialized
filter (KF) [250] (and subsequent Kalman-Bucy filter in the Monte Carlo method in Los Alamos Laboratory. von
1961 [249]), 1 though many credits should be also due to Neumann also used Monte Carlo method to calculate the
some earlier work by Bode and Shannon [46], Zadeh and elements of an inverse matrix, in which they redefined the
Ragazzini [502], [503], Swerling [434], Levinson [297], and “Russian roulette” and “splitting” methods [472]. In recent
others. Without any exaggeration, it seems fair to say decades, Monte Carlo techniques have been rediscovered in-
that the Kalman filter (and its numerous variants) have dependently in statistics, physics, and engineering. Many
dominated the adaptive filter theory for decades in signal new Monte Carlo methodologies (e.g. Bayesian bootstrap,
processing and control areas. Nowadays, Kalman filters hybrid Monte Carlo, quasi Monte Carlo) have been reju-
have been applied in the various engineering and scientific venated and developed. Roughly speaking, Monte Carlo
areas, including communications, machine learning, neu-
roscience, economics, finance, political science, and many 2 A generalized Bayesian theory is the so-called Quasi-Bayesian the-

others. Bearing in mind that Kalman filter is limited by its ory (e.g. [100]) that is built on the convex set of probability distribu-
tions and a relaxed set of aximoms about preferences, which we don’t
assumptions, numerous nonlinear filtering methods along discuss in this paper.
3 An interesting history of Thomas Bayes and its famous essay is
1 Another important event in 1960 is the publication of the cele- found in [110].
brated least-mean-squares (LMS) algorithm [485]. However, the LMS 4 The method is named after the city in the Monaco principality,
filter is not discussed in this paper, the reader can refer to [486], [205], because of a roulette, a simple random number generator. The name
[207], [247] for more information. was first suggested by Stanislaw Ulam.
MANUSCRIPT 3

technique is a kind of stochastic sampling approach aim- tial geometry approach, variational method, or conjugate
ing to tackle the complex systems which are analytically method. Some potential future directions, will be consid-
intractable. The power of Monte Carlo methods is that ering combining these methods with Monte Carlo sampling
they can attack the difficult numerical integration prob- techniques, as we will discuss in the paper. The attention
lems. In recent years, sequential Monte Carlo approaches of this paper, however, is still on the Monte Carlo methods
have attracted more and more attention to the researchers and particularly sequential Monte Carlo estimation.
from different areas, with many successful applications in
statistics (see e.g. the March special issue of 2001 Annals D. Outline of Paper
of the Institute of Statistical Mathematics), sig-
In this paper, we present a comprehensive review of
nal processing (see e.g., the February special issue of 2002
stochastic filtering theory from Bayesian perspective. [It
IEEE Transactions on Signal Processing), machine
happens to be almost three decades after the 1974 publica-
learning, econometrics, automatic control, tracking, com-
tion of Prof. Thomas Kailath’s illuminating review paper
munications, biology, and many others (e.g., see [141] and
“A view of three decades of linear filtering theory” [244],
the references therein). One of the attractive merits of se-
we take this opportunity to dedicate this paper to him who
quential Monte Carlo approaches lies in the fact that they
has greatly contributed to the literature in stochastic filter-
allow on-line estimation by combining the powerful Monte
ing theory.] With the tool of Bayesian statistics, it turns
Carlo sampling methods with Bayesian inference, at an ex-
out that the celebrated Kalman filter is a special case of
pense of reasonable computational cost. In particular, the
Bayesian filtering under the LQG (linear, quadratic, Gaus-
sequential Monte Carlo approach has been used in parame-
sian) circumstance, a fact that was first observed by Ho
ter estimation and state estimation, for the latter of which
and Lee [212]; particle filters are also essentially rooted
it is sometimes called particle filter.5 The basic idea of
in Bayesian statistics, in the spirit of recursive Bayesian
particle filter is to use a number of independent random
estimation. To our interest, the attention will be given to
variables called particles,6 sampled directly from the state
the nonlinear, non-Gaussian and non-stationary situations
space, to represent the posterior probability, and update
where we mostly encounter in the real world. Generally for
the posterior by involving the new observations; the “par-
nonlinear filtering, no exact solution can be obtained, or the
ticle system” is properly located, weighted, and propagated
solution is infinite-dimensional,8 hence various numerical
recursively according to the Bayesian rule. In retrospect,
approximation methods come in to address the intractabil-
the earliest idea of Monte Carlo method used in statisti-
ity. In particular, we focus our attention on sequential
cal inference is found in [200], [201], and later in [5], [6],
Monte Carlo method which allows on-line estimation in a
[506], [433], [258], but the formal establishment of particle
Bayesian perspective. The historic root and remarks of
filter seems fair to be due to Gordon, Salmond and Smith
Monte Carlo filtering are traced. Other Bayesian filtering
[193], who introduced certain novel resampling technique
approaches other than Monte Carlo framework are also re-
to the formulation. Almost in the meantime, a number
viewed. Besides, we extend our discussion from Bayesian
of statisticians also independently rediscovered and devel-
filtering to Bayesian inference, in the latter of which the
oped the sampling-importance-resampling (SIR) idea [414],
well-known hidden Markov model (HMM) (a.k.a. HMM
[266], [303], which was originally proposed by Rubin [395],
filter), dynamic Bayesian networks (DBN) and Bayesian
[397] in a non-dynamic framework.7 The rediscovery and
kernel machines are also briefly discussed.
renaissance of particle filters in the mid-1990s (e.g. [259],
Nowadays Bayesian filtering has become such a broad
[222], [229], [304], [307], [143], [40]) after a long dominant
topic involving many scientific areas that a comprehen-
period, partially thanks to the ever increasing computing
sive survey and detailed treatment seems crucial to cater
power. Recently, a lot of work has been done to improve
the ever growing demands of understanding this important
the performance of particle filters [69], [189], [428], [345],
field for many novices, though it is noticed by the author
[456], [458], [357]. Also, many doctoral theses were devoted
that in the literature there exist a number of excellent tuto-
to Monte Carlo filtering and inference from different per-
rial papers on particle filters and Monte Carlo filters [143],
spectives [191], [142], [162], [118], [221], [228], [35], [97],
[144], [19], [438], [443], as well as relevant edited volumes
[365], [467], [86].
[141] and books [185], [173], [306], [82]. Unfortunately, as
It is noted that particle filter is not the only leaf in the
observed in our comprehensive bibliographies, a lot of pa-
Bayesian filtering tree, in the sense that Bayesian filtering
pers were written by statisticians or physicists with some
can be also tackled with other techniques, such as differen-
special terminologies, which might be unfamiliar to many
5 Many other terminologies also exist in the literature, e.g., SIS fil- engineers. Besides, the papers were written with different
ter, SIR filter, bootstrap filter, sequential imputation, or CONDEN- nomenclatures for different purposes (e.g. the convergence
SATION algorithm (see [224] for many others), though they are ad- and asymptotic results are rarely cared in engineering but
dressed differently in different areas. In this paper, we treat them as
different variants within the generic Monte Carlo filter family. Monte are important for the statisticians). The author, thus, felt
Carlo filters are not all sequential Monte Carlo estimation. obligated to write a tutorial paper on this emerging and
6 The particle filter is called normal if it produces i.i.d. samples;
promising area for the readership of engineers, and to in-
sometimes it is deliberately to introduce negative correlations among
the particles for the sake of variance reduction. troduce the reader many techniques developed in statistics
7 The earliest idea of multiple imputation due to Rubin was pub-
lished in 1978 [394]. 8 Or the sufficient statistics is infinite-dimensional.
MANUSCRIPT 4

and physics. For this purpose again, for a variety of particle P(x)
filter algorithms, the basic ideas instead of mathematical 1

derivations are emphasized. The further details and exper-


imental results are indicated in the references. Due to the
dual tutorial/review nature of current paper, only few sim-
ple examples and simulation are presented to illustrate the
essential ideas, no comparative results are available at this
stage (see other paper [88]); however, it doesn’t prevent us 0
x
presenting the new thoughts. Moreover, many graphical
and tabular illustrations are presented. Since it is also a Fig. 1. Empirical probability distribution (density) function con-
survey paper, extensive bibliographies are included in the structed from the discrete observations {x(i) }.
references. But there is no claim that the bibliographies
are complete, which is due to the our knowledge limitation
as well as the space allowance. (see Fig. 1 for illustration)
The rest of this paper is organized as follows: In Section
1 
Np
II, some basic mathematical preliminaries of stochastic fil- P̂ (x) = δ(x − x(i) )
tering theory are given; the stochastic filtering problem is Np i=1
also mathematically formulated. Section III presents the
essential Bayesian theory, particularly Bayesian statistics where δ(·) is a Radon-Nikodým density w.r.t. μ of the
and Bayesian inference. In Section IV, the Bayesian fil- point-mass distribution concentrated at the point x. When
tering theory is systematically investigated. Following the x ∈ X is discrete, δ(x − x(i) ) is 1 for x = x(i) and 0
simplest LQG case, the celebrated Kalman filter is briefly elsewhere. When x ∈ X is continuous, δ(x − x(i) ) is a
derived, followed by the discussion of optimal nonlinear Dirac-delta
 function,
 δ(x − x(i) ) = 0 for all x(i) = x, and
filtering. Section V discusses many popular numerical ap- X
dP̂ (x) = X p̂(x)dx = 1.
proximation techniques, with special emphasis on Monte
B. Notations
Carlo sampling methods, which result in various forms of
particle filters in Section VI. In Section VII, some other Throughout this paper, the bold font is referred to vec-
new Bayesian filtering approaches other than Monte Carlo tor or matrix; the subscript symbol t (t ∈ R+ ) is referred
sampling are also reviewed. Section VIII presents some se- to the index in a continuous-time domain; and n (n ∈ N)
lected applications and one illustrative example of particle is referred to the index in a discrete-time domain. p(x) is
filters. We give some discussions and critiques in Section referred to the pdf in a Lebesque measure or the pmf in
IX and conclude the paper in Section X. a counting measure. E[·] and Var[·] (Cov[·]) are expecta-
tion and variance (covariance) operators, respectively. Un-
less specified elsewhere, the expectations are taken w.r.t.
II. Mathematical Preliminaries and Problem the true pdf. Notations x0:n and y0:n 9 are referred to
Formulation the state and observation sets with elements collected from
time step 0 up to n. Gaussian (normal) distribution is de-
A. Preliminaries noted by N (μ, Σ). xn represents the true state in time
step n, whereas x̂n (or x̂n|n ) and x̂n|n−1 represent the fil-
Definition 1: Let S be a set and F be a family of subsets tered state and predicted state of xn , respectively. f and g
of S. F is a σ-algebra if (i) ∅ ∈ F; (ii) A ∈ F implies are used to represent vector-valued state function and mea-
Ac ∈ F; (iii) A1 , A2 , · · · ∈ F implies ∪∞
i=1 Ai ∈ F. surement function, respectively. f is denoted as a generic
A σ-algebra is closed under complement and union of (vector or scalar valued) nonlinear function. Additional
countably infinitely many sets. nomenclatures will be given wherever confusion is neces-
Definition 2: A probability space is defined by the el- sary to clarify.
ements {Ω, F, P } where F is a σ-algebra of Ω and P is For the reader’s convenience, a complete list of notations
a complete, σ-additive probability measure on all F. In used in this paper is summarized in the Appendix G.
other words, P is a set function whose arguments are ran-
dom events (element of F) such that axioms of probability C. Stochastic Filtering Problem
hold. Before we run into the mathematical formulation of
Definition 3: Let p(x) = dPdμ(x) denote Radon-Nikodým stochastic filtering problem, it is necessary to clarify some
density of probability distribution P (x) w.r.t. a measure μ. basic concepts:
When x ∈ X is discrete and μ is a counting measure, p(x)
is a probability mass function (pmf); when x is continuous Filtering is an operation that involves the extraction of
and μ is a Lebesgue measure, p(x) is a probability density information about a quantity of interest at time t by
function (pdf). using data measured up to and including t.
Intuitively, the true distribution P (x) can be replaced 9 Sometimes it is also denoted by y
1:n , which differs in the assuming
by the empirical distribution given the simulated samples order of state and measurement equations.
MANUSCRIPT 5

Prediction is an a priori form of estimation. Its aim is to u t-1 ut ut+1


input
derive information about what the quantity of interest
will be like at some time t + τ in the future (τ >
0) by using data measured up to and including time
ft-1 ( ) ft( )
t. Unless specified otherwise, prediction is referred to state x t-1 xt x t+1
one-step ahead prediction in this paper.
Smoothing is an a posteriori form of estimation in that g t-1 ( ) g t( ) g t+1 ( )
data measured after the time of interest are used for
measurement yt-1 yt yt+1
the estimation. Specifically, the smoothed estimate at
time t is obtained by using data measured over the
interval [0, t], where t < t. Fig. 2. A graphical model of generic state-space model.

Now, let us consider the following generic stochastic fil-


tering problem in a dynamic state-space form [238], [422]: of mean and state-error correlation matrix are calculated
and propagated. In equations (3a) and (3b), Fn+1,n , Gn
ẋt = f (t, xt , ut , dt ), (1a) are called transition matrix and measurement matrix, re-
yt = g(t, xt , ut , vt ), (1b) spectively.
Described as a generic state-space model, the stochastic
where equations (1a) and (1b) are called state equation and
filtering problem can be illustrated by a graphical model
measurement equation, respectively; xt represents the state
(Fig. 2). Given initial density p(x0 ), transition density
vector, yt is the measurement vector, ut represents the sys-
p(xn |xn−1 ), and likelihood p(yn |xn ), the objective of the
tem input vector (as driving force) in a controlled environ-
filtering is to estimate the optimal current state at time n
ment; f : RNx → RNx and g : RNx → RNy are two vector-
given the observations up to time n, which is in essence
valued functions, which are potentially time-varying; dt
amount to estimating the posterior density p(xn |y0:n ) or
and vt represent the process (dynamical) noise and mea-
p(x0:n |y0:n ). Although the posterior density provides a
surement noise respectively, with appropriate dimensions.
complete solution of the stochastic filtering problem, the
The above formulation is discussed in the continuous-time
problem still remains intractable since the density is a func-
domain, in practice however, we are more concerned about
tion rather than a finite-dimensional point estimate. We
the discrete-time filtering.10 In this context, the following
should also keep in mind that most of physical systems are
practical filtering problem is concerned:11
not finite dimensional, thus the infinite-dimensional system
xn+1 = f (xn , dn ), (2a) can only be modeled approximately by a finite-dimensional
filter, in other words, the filter can only be suboptimal
yn = g(xn , vn ), (2b)
in this sense. Nevertheless, in the context of nonlinear
where dn and vn can be viewed as white noise random filtering, it is still possible to formulate the exact finite-
sequences with unknown statistics in the discrete-time do- dimensional filtering solution, as we will discuss in Section
main. The state equation (2a) characterizes the state tran- IV.
sition probability p(xn+1 |xn ), whereas the measurement In Table I, a brief and incomplete development history of
equation (2b) describes the probability p(yn |xn ) which is stochastic filtering theory (from linear to nonlinear, Gaus-
further related to the measurement noise model. sian to non-Gaussian, stationary to non-stationary) is sum-
The equations (2a)(2b) reduce to the following special marized. Some detailed reviews are referred to [244], [423],
case where a linear Gaussian dynamic system is consid- [247], [205].
ered:12
D. Nonlinear Stochastic Filtering Is an Ill-posed Inverse
xn+1 = Fn+1,n xn + dn , (3a) Problem
yn = G n xn + v n , (3b)
D.1 Inverse Problem
for which the analytic filtering solution is given by the
Stochastic filtering is an inverse problem: Given collected
Kalman filter [250], [253], in which the sufficient statistics13
yn at discrete time steps (hence y0:n ), provided f and g are
10 The continuous-time dynamic system can be always converted known, one needs to find the optimal or suboptimal x̂n . In
into a discrete-time system by sampling the outputs and using “zero- another perspective, this problem can be interpreted as an
order holds” on the inputs. Hence the derivative will be replaced by
the difference, the operator will become a matrix. inverse mapping learning problem: Find the inputs sequen-
11 For discussion simplicity, no driving-force in the dynamic system tially with a (composite) mapping function which yields the
(which is often referred to the stochastic control problem) is consid- output data. In contrast to the forward learning (given in-
ered in this paper. However, the extension to the driven system is
straightforward. puts find outputs) which is a many-to-one mapping prob-
12 An excellent and illuminating review of linear filtering theory is lem, the inversion learning problem is one-to-many, in a
found in [244] (see also [385], [435], [61]); for a complete treatment of sense that the mapping from output to input space is gen-
linear estimation theory, see the classic textbook [247].
13 Sufficient statistics is referred to a collection of quantities which erally non-unique.
uniquely determine a probability density in its entirety. A problem is said to be well-posed if it satisfies three con-
MANUSCRIPT 6

TABLE I
A Development History of Stochastic Filtering Theory.

author(s) (year) method solution comment

Kolmogorov (1941) innovations exact linear, stationary

Wiener (1942) spectral factorization exact linear, stationary, infinite memory

Levinson (1947) lattice filter approximate linear, stationary, finite memory

Bode & Shannon (1950) innovations, whitening exact linear, stationary,

Zadeh & Ragazzini (1950) innovations, whitening exact linear, non-stationary

Kalman (1960) orthogonal projection exact LQG, non-stationary, discrete

Kalman & Bucy (1961) recursive Riccati equation exact LQG, non-stationary, continuous

Stratonovich (1960) conditional Markov process exact nonlinear, non-stationary

Kushner (1967) PDE exact nonlinear, non-stationary

Zakai (1969) PDE exact nonlinear, non-stationary

Handschin & Mayne (1969) Monte Carlo approximate nonlinear, non-Gaussian, non-stationary

Bucy & Senne (1971) point-mass, Bayes approximate nonlinear, non-Gaussian, non-stationary

Kailath (1971) innovations exact linear, non-Gaussian, non-stationary

Beneš (1981) Beneš exact solution of Zakai eqn. nonlinear, finite-dimensional

Daum (1986) Daum, virtual measurement exact solution of FPK eqn. nonlinear, finite-dimensional

Gordon, Salmond, & Smith (1993) bootstrap, sequential Monte Carlo approximate nonlinear, non-Gaussian, non-stationary

Julier & Uhlmann (1997) unscented transformation approximate nonlinear, (non)-Gaussian, derivative-free

ditions: existence, uniqueness and stability, otherwise it is where the second integral is Itô stochastic integral (named
said to be ill posed [87]. In this context, stochastic filtering after Japanese mathematician Kiyosi Ito [233]).15
problem is ill-posed in the following sense: (i) The ubiqui- Mathematically, the ill-posed nature of stochastic filter-
tous presence of the unknown noise corrupts the state and ing problem can be understood from the operator theory.
measurement equations, given limited noisy observations, Definition 4: [274], [87] Let A : Y → X be an operator
the solution is non-unique; (ii) Supposing the state equa- from a normed space Y to X. The equation AY = X is said
tion is a diffeomorphism (i.e. differentiable and regular),14 to be well posed if A is bijective and the inverse operator
the measurement function is possibly a many-to-one map- A−1 : X → Y is continuous. Otherwise the equation is
ping function (e.g. g(ξ) = ξ 2 or g(ξ) = sin(ξ), see also the called ill posed.
illustrative example in Section VIII-G), which also violates Definition 5: [418] Suppose H is a Hilbert space and let
the uniqueness condition; (iii) The filtering problem is per A = A(γ) be a stochastic operator mapping Ω × H in
se a conditional posterior distribution (density) estimation H. Let X = X(γ) be a generalized random variable (or
problem, which is known to be stochastically ill posed es- function) in H, then
pecially in high-dimensional space [463], let alone on-line
processing [412]. A(γ)Y = X(γ) (6)
D.2 Differential Operator and Integral Equation is a generalized stochastic operator equation for the ele-
In what follows, we present a rigorous analysis of stochas- ment Y ∈ H.
tic filtering problem in the continuous-time domain. To Since γ is an element of a measurable space (Ω, F) on
simplify the analysis, we first consider the simple irregular which a complete probability measure P is defined, stochas-
stochastic differential equation (SDE): tic operator equation is a family of equations. The family
of equations has a unique member when P is a Dirac mea-
dxt
= f (t, xt ) + dt , t ∈ T (4) sure. Suppose Y is a smooth functional with continuous
dt first n derivatives, then (6) can be written as
t
where xt is a second-order stochastic process, ω t = 0 ds ds
is a Wiener process (Brownian motion) and dt can be re- 
N
dk Y
A(γ)Y (γ) = ak (t, γ) = X(γ), (7)
garded as a white noise. f : T ×L2 (Ω, F, P ) → L2 (Ω, F, P ) dtk
k=0
is a mapping to a (Lebesque square-integrable) Hilbert
space L2 (Ω, F, P ) with finite second-order moments. The which can be represented in the form of stochastic integral
solution of (4) is given by the stochastic integral equations of Fredholm type or Voltera type [418], with an
 t  t 
15 The Itô stochastic integral is defined as tt σ(t)dω(t) =
xt = x0 + f (s, xs )ds + dω s , (5)   0
0 0 lim n 2
n→∞ j=1 σ(tj−1 )Δωj . The Itô calculus satisfies dω (t) = dt,
14 Diffeomorphism is referred to a smooth mapping with a smooth dω(t)dt = 0, dtN +1 = dω N +2 (t) = 0 (N > 1). See [387], [360] for a
inverse, one-to-one mapping. detailed background about Itô calculus and Itô SDE.
MANUSCRIPT 7

appropriately defined kernel K: off-line; whereas filtering is aimed to sequentially infer


 the signal or state process given some observations by
Y (t, γ) = X(t, γ) + K(t, τ, γ)Y (τ, γ)dτ, (8) assuming the knowledge of the state and measurement
models.
which takes a similar form as the continuous-time Wiener- • Missing data problem: Missing data problem is
Hopf equation (see e.g. [247]) when K is translation invari- well addressed in statistics, which is concerned about
ant. probabilistic inference or model fitting given limited
Definition 6: [418] Any mapping Y (γ) : Ω → H which data. Statistical approaches (e.g. EM algorithm, data
satisfies A(γ)Y (γ) = X(γ) for every γ ∈ Ω, is said to be a augmentation) are used to help this goal by assum-
wide-sense solution of (6). ing auxiliary missing variables (unobserved data) with
The wide-sense solution is a stochastic solution if it is tractable (on-line or off-line) inference.
measurable w.r.t. P and Pr{γ : A(γ)Y (γ) = X(γ)} = 1. • Density estimation: Density estimation shares some
The existence and uniqueness conditions of the solution to commons with filtering in that both of them target at a
the stochastic operator equation (6) is given by the prob- dependency estimation problem. Generally, filtering is
abilistic Fixed-Point Theorem [418]. The essential idea of nothing but to learn the conditional probability distri-
Fixed-Point Theorem is to prove that A(γ) is a stochas- bution. However, density estimation is more difficult
tic contractive operator, which unfortunately is not always in the sense that it doesn’t have any prior knowledge
true for the stochastic filtering problem. on the data (though sometimes people give some as-
Let’s turn our attention to the measurement equation in sumption, e.g. mixture distribution) and it usually
an integral form works directly on the state (i.e. observation process
 t is tantamount to the state process). Most of density
estimation techniques are off-line.
yt = g(s, xs )ds + vt , (9)
0 • Nonlinear dynamic reconstruction: Nonlinear dy-
namic reconstruction arise from physical phenomena
where g : RNx → RNy . For any φ(·) ∈ RNx , the optimal (e.g. sea clutter) in the real world. Given some lim-
(in mean-square sense) filter φ̂(xt ) is the one that seeks an ited observations (possibly not continuously or evenly
minimum mean-square error, as given by recorded), it is concerned about inferring the physi-
 cally meaningful state information. In this sense, it
2 π(xt |y0:t )φ(x)dxt
φ̂(xt ) ≡ arg min{ φ − φ̂ } =  , (10) is very similar to the filtering problem. However, it
π(xt |y0:t )dxt
is much more difficult than the filtering problem in
where π(·) is an unnormalized filtering density. A common that the nonlinear dynamics involving f is totally un-
way to study the unnormalized filtering density is to treat known (usually assuming a nonparametric model to
it as a solution of the Zakai equation, as will be detailed in estimate) and potentially complex (e.g. chaotic), and
Section II-E. the prior knowledge of state equation is very limited,
and thereby severely ill-posed [87]. Likewise, dynamic
D.3 Relations to Other Problems reconstruction allows off-line estimation.
It is conducive to better understanding the stochastic fil-
E. Stochastic Differential Equations and Filtering
tering problem by comparing it with many other ill-posed
problems that share some commons in different perspec- In the following, we will formulate the continuous-time
tives: stochastic filtering problem by SDE theory. Suppose {xt }
• System identification: System identification has is a Markov process with an infinitesimal generator, rewrit-
many commons with stochastic filtering. Both of them ing state-space equations (1a)(1b) in the following form of
belong to statistical inference problems. Sometimes, Itô SDE [418], [360]:
identification is also meant as filtering in stochastic
control realm, especially with a driving-force as in- dxt = f (t, xt )dt + σ(t, xt )dω t , (11a)
put. However, the measurement equation can ad- dyt = g(t, xt )dt + dvt , (11b)
mit the feedback of previous output, i.e. yn =
g(xn , yn−1 , vn ). Besides, identification is often more where f (t, xt ) is often called nonlinear drift and σ(t, xt )
concerned about the parameter estimation problem in- called volatility or diffusion coefficient. Again, the noise
stead of state estimation. We will revisit this issue in processes {ω t , vt , t ≥ 0} are two Wiener processes. xt ∈
the Section IX. RNx , yt ∈ RNy . First, let’s look at the state equation
• Regression: In some perspective, filtering can be (a.k.a. diffusion equation). For all t ≥ 0, we define a
viewed as a sequential linear/nonlinear regression backward diffusion operator Lt as16
problem if state equation reduces to a random walk.
But, regression differs from filtering in the following 
Nx
∂ 1 
Nx
∂2
Lt = fti + aij
t , (12)
sense: Regression is aimed to find a deterministic map- i=1
∂xi 2 i,j=1 ∂xi ∂xj
ping between the input and output given a finite num-
ber of observation pairs {xi , yi }i=1 , which is usually 16 L
t is a partial differential operator.
MANUSCRIPT 8

where aij t = σ (t, xt )σ (t, xt ). Operator L corresponds to


i j
Given conditional pdf (18), suppose we want to calculate
an infinitesimal generator of the diffusion process {xt , t ≥ φ̂(xt ) = E[φ(xt )|Yt ] for any nonlinear function φ ∈ RNx .
0}. The goal now is to deduce conditions under which By interchanging the order of integrations, we have
one can find a recursive and finite-dimensional (close form)  ∞
scheme to compute the conditional probability distribution φ̂(xt ) = φ(x)p(xt |Yt )dx
p(xt |Yt ), given the filtration Yt 17 produced by the observa- −∞
tion process (1b).  ∞
Let’s define an innovations process18 = φ(x)p(x0 )dx
−∞
 t  t  ∞
et = yt − E[g(s, xs )|y0:s ]ds, (13) + φ(x)L̃s p(xs |Ys )dxds
0 0 −∞
 t ∞
where E[g(s, xs )|Ys ] is described as + φ(x)p(xs |Ys )es Σ−1
v,s dxds
0 −∞
ĝ(xt ) = E[g(t, xt )|Yt ]  t ∞
 ∞
= E[φ(x0 )] + p(xs |Ys )Ls φ(x)dxds
= g(xt )p(xt |Ys )dx. (14) 0 −∞
−∞  t
 ∞
+ φ(x)g(s, x)p(xs |Ys )dx
For any test function φ ∈ R Nx
, the forward diffusion oper- 0 −∞
ator L̃ is defined as  ∞
−ĝ(xs ) φ(x)p(xs |Ys )dx Σ−1v,s ds.

Nx
∂φ 1 
Nx
∂2φ −∞
L̃t φ = − fti + aij
t , (15)
i=1
∂xi 2 i,j=1 ∂xi ∂xj The Kushner equation lends itself a recursive form of fil-
tering solution, but the conditional mean requests all of
which essentially is the Fokker-Planck operator. Given ini- higher-order conditional moments and thus leads to an
tial condition p(x0 ) at t = 0 as boundary condition, it turns infinite-dimensional system.
out that the pdf of diffusion process satisfies the Fokker-
On the other hand, under some mild conditions, the un-
Planck-Kolmogorov equation (FPK; a.k.a. Kolmogorov
normalized conditional density of xt given Ys , denoted as
forward equation, [387]) 19
π(xt |Yt ), is the unique solution of the following stochas-
tic partial differential equation (PDE), the so-called Zakai
∂p(xt ) equation (see [505], [238], [285]):
= L̃t p(xt ). (16)
∂t
dπ(xt |Yt ) = L̃π(xt |Yt )dt + g(t, xt )π(xt |Yt )dyt (19)
By involving the innovation process (13) and assuming
E[vt ] = Σv,t , we have the following Kushner’s equation with the same L̃ defined in (15). Zakai equation and Kush-
(e.g., [284]): ner equation have a one-to-one correspondence, but Zakai
equation is much simpler,20 hence we are usually turned
dp(xt |Yt ) = L̃t p(xt |Yt )dt + p(xt |Yt )et Σ−1
v,t dt, (t ≥ 0) (17)
to solve the Zakai equation instead of Kushner equation.
which reduces to the FPK equation (16) when there are no In the early history of nonlinear filtering, the common way
observations or filtration Yt . Integrating (17), we have is to discretize the Zakai equation to seek the numerical
 t solution. Numerous efforts were devoted along this line
[285], [286], e.g. separation of variables [114], adaptive lo-
p(xt |Yt ) = p(x0 ) + p(xs |Ys )ds
0 cal grid [65], particle (quadrature) method [66]. However,
 t these methods are neither recursive nor computationally
+ L̃s p(xs |Ys )es Σ−1
v,s ds. (18) efficient.
0
17 One can imagine filtration as sort of information coding the pre- III. Bayesian Statistics and Bayesian Estimation
vious history of the state and measurement.
18 Innovations process is defined as a white Gaussian noise process. A. Bayesian Statistics
See [245], [247] for detailed treatment.
19 The stochastic process is determined equivalently by the FPK Bayesian theory (e.g., [38]) is a branch of mathemat-
equation (16) or the SDE (11a). The FPK equation can be inter- ical probability theory that allows people to model the
preted as follows: The first term is the equation of motion for a cloud uncertainty about the world and the outcomes of interest
of particles whose distribution is p(xt ), each point of which obeys the
equation of motion dx = f (xt , t). The second term describes the dis-
by incorporating prior knowledge and observational evi-
dt
turbance due to Brownian motion. The solution of (16) can be solved dence.21 Bayesian analysis, interpreting the probability as
exactly by Fourier transform. By inverting the Fourier transform, we
20 This is true because (19) is linear w.r.t. π(x |Y ) whereas (17)
can obtain t t
involves certain nonlinearity. We don’t extend discussion here due to
1  (x − x − f (x )Δt)2 space constraint.
0 0
p(x, t + Δt|x0 , t) = √ exp − , 21 In the circle of statistics, there are slightly different treatments to
2πσ0 Δt 2σ0 Δt
probability. The frequentists condition on a hypothesis of choice and
which is a Guaussian distribution of a deterministic path. put the probability distribution on the data, either observed or not;
MANUSCRIPT 9

a conditional measure of uncertainty, is one of the popu- priors; (iii) updating the hyperparameters of the prior. Op-
lar methods to solve the inverse problems. Before running timization and integration are two fundamental numerical
into Bayesian inference and Bayesian estimation, we first problems arising in statistical inference. Bayesian inference
introduce some fundamental Bayesian statistics. can be illustrated by a directed graph, a Bayesian network
Definition 7: (Bayesian Sufficient Statistics) Let p(x|Y) (or belief network) is a probabilistic graphical model with
denote the probability density of x conditioned on mea- a set of vertices and edges (or arcs), the probability depen-
surements Y. A statistics, Ψ(x), is said to be “sufficient” dency is described by a directed arrow between two nodes
if the distribution of x conditionally on Ψ does not depend that represent two random variables. Graphical models
on Y. In other words, p(x|Y) = p(x|Y ) for any two sets Y also allow the possibility of constructing more complex hi-
and Y s.t. Ψ(Y) = Ψ(Y ). erarchical statistical models [239], [240].

The sufficient statistics Ψ(x) contains all of information B. Recursive Bayesian Estimation
brought by x about Y. The Rao-Blackwell Theorem says In the following, we present a detailed derivation of re-
that when an estimator is evaluated under a convex loss, cursive Bayesian estimation, which underlies the principle
the optimal procedure only depends on the sufficient statis- of sequential Bayesian filtering. Two assumptions are used
tics. Sufficiency Principle and Likelihood Principle are two to derive the recursive Bayesian filter: (i) The states follow
axiomatic principles in the Bayesian inference [388]. a first-order Markov process p(xn |x0:n−1 ) = p(xn |xn−1 );
There are three types of intractable problems inherently (ii) the observations are independent of the given states.
related to the Bayesian statistics: For notation simplicity, we denote Yn as a set of observa-
• Normalization: Given the prior p(x) and likelihood tions y0:n := {y0 , · · · , yn }; let p(xn |Yn ) denote the condi-
p(y|x), the posterior p(x|y) is obtained by the product tional pdf of xn . From Bayes rule we have
of prior and likelihood divided by a normalizing factor
as p(Yn |xn )p(xn )
p(xn |Yn ) =
p(y|x)p(x) p(Yn )
p(x|y) =  . (20) p(yn , Yn−1 |xn )p(xn )
X
p(y|x)p(x)dx =
p(yn , Yn−1 )
• Marginalization: Given the joint posterior (x, z), p(yn |Yn−1 , xn )p(Yn−1 |xn )p(xn )
the marginal posterior is =
p(yn |Yn−1 )p(Yn−1 )

p(yn |Yn−1 , xn )p(xn |Yn−1 )p(Yn−1 )p(xn )
p(x|y) = p(x, z|y)dz, (21) =
Z p(yn |Yn−1 )p(Yn−1 )p(xn )
as shown later, marginalization and factorization plays p(yn |xn )p(xn |Yn−1 )
= . (23)
an important role in Bayesian inference. p(yn |Yn−1 )
• Expectation: Given the conditional pdf, some aver-
aged statistics of interest can be calculated As shown in (23), the posterior density p(xn |Yn ) is de-
 scribed by three terms:
• Prior: The prior p(xn |Yn−1 ) defines the knowledge of
Ep(x|y) [f (x)] = f (x)p(x|y)dx. (22)
X the model

In Bayesian inference, all of uncertainties (including
p(xn |Yn−1 ) = p(xn |xn−1 )p(xn−1 |Yn−1 )dxn−1 , (24)
states, parameters which are either time-varying or fixed
but unknown, priors) are treated as random variables.22
The inference is performed within the Bayesian framework where p(xn |xn−1 ) is the transition density of the state.
given all of available information. And the objective of • Likelihood: the likelihood p(yn |xn ) essentially deter-

Bayesian inference is to use priors and causal knowledge, mines the measurement noise model in the equation
quantitatively and qualitatively, to infer the conditional (2b).
probability, given finite observations. There are usually • Evidence: The denominator involves an integral
three levels of probabilistic reasoning in Bayesian analysis 
(so-called hierarchical Bayesian analysis): (i) starting with p(yn |Yn−1 ) = p(yn |xn )p(xn |Yn−1 )dxn . (25)
model selection given the data and assumed priors; (ii) esti-
mating the parameters to fit the data given the model and Calculation or approximation of these three terms are the
only one hypothesis is regarded as true; they regard the probability essences of the Bayesian filtering and inference.
as frequency. The Bayesians only condition on the observed data and
consider the probability distributions on the hypotheses; they put IV. Bayesian Optimal Filtering
probability distributions on the several hypotheses given some priors;
probability is not viewed equivalent to the frequency. See [388], [38], Bayesian filtering is aimed to apply the Bayesian statis-
[320] for more information. tics and Bayes rule to probabilistic inference problems, and
22 This is the true spirit of Bayesian estimation which is different
from other estimation schemes (e.g. least-squares) where the un- specifically the stochastic filtering problem. To our knowl-
known parameters are usually regarded as deterministic. edge, Ho and Lee [212] were among the first authors to
MANUSCRIPT 10

discuss iterative Bayesian filtering, in which they discussed p(x|y)

in principle the sequential state estimation problem and in- mode mode
cluded the Kalman filter as a special case. In the past few mean
median
mean
decades, numerous authors have investigated the Bayesian mode

filtering in a dynamic state space framework [270], [271],


[421], [424], [372], [480]-[484].
x

A. Optimal Filtering
Fig. 3. Left: An illustration of three optimal criteria that seek
An optimal filter is said “optimal” only in some specific different solutions for a skewed unimodal distribution, in which the
sense [12]; in other other words, one should define a cri- mean, mode and median do not coincide. Right: MAP is misleading
for the multimodal distribution where multiple modes (maxima) exist.
terion which measures the optimality. For example, some
potential criteria for measuring the optimality can be:
where Q(x) is an arbitrary distribution of x. The
1. Minimum mean-squared error (MMSE): It can be de- first term is called Kullback-Leibler (KL) divergence
fined in terms of prediction or filtering error (or equiv- between distributions Q(x) and P (x|y), the second
alently the trace of state-error covariance) term is the entropy w.r.t. Q(x). The minimization
 of free energy can be implemented iteratively by the
E[ xn − x̂n 2 |y0:n ] = xn − x̂n 2 p(xn |y0:n )dxn , expectation-maximization (EM) algorithm [130]:

which is aimed Q(xn+1 ) ←− arg max{Q, xn },


 to find the conditional mean x̂n = Q
E[xn |y0:n ] = xn p(xn |y0:n )dxn .
xn+1 ←− arg max{Q(xn }, x).
2. Maximum a posteriori (MAP): It is aimed to find the x
mode of posterior probability p(xn |y0:n ),23 which is
equal to minimize a loss function
Remarks:
E = E[1 − Ixn :xn −x̂n ≤ζ (xn )], • The above criteria are valid not only for state estima-
tion but also for parameter estimation (by viewing x
where I(·) is an indicator function and ζ is a small as unknown parameters).
scalar. • Both MMSE and MAP methods require the estima-
3. Maximum likelihood (ML): which reduces to a special tion of the posterior distribution (density), but MAP
case of MAP where the prior is neglected.24 doesn’t require the calculation of the denominator (in-
4. Minimax: which is to find the median of posterior tegration) and thereby more computational inexpen-
p(xn |y0:n ). See Fig. 3 for an illustration of the differ- sive; whereas the former requires full knowledge of
ence between mode, mean and median. the prior, likelihood and evidence. Note that how-
5. Minimum conditional inaccuracy25 : Namely, ever, MAP estimate has a drawback especially in a
 high-dimensional space. High probability density does
1
Ep(x,y) [− log p̂(x|y)] = p(x, y) log dxdy. not imply high probability mass. A narrow spike with
p̂(x|y)
very small width (support) can have a very high den-
6. Minimum conditional KL divergence [276]: The con- sity, but the actual probability of estimated state (or
ditional KL divergence is given by parameter) belonging to it is small. Hence, the width
 of the mode is more important than its height in the
p(x, y) high-dimensional case.
KL = p(x, y) log dxdy.
p̂(x|y)p(x) • The last three criteria are all ML oriented. By min-
imizing the negative log-likelihood − log p̂(x|y) and
7. Minimum free energy26 : It is a lower bound of maxi- taking the expectation w.r.t. a fixed or variational
mum log-likelihood, which is aimed to minimize pdf. Criterion 5 chooses the expectation w.r.t. joint
pdf p(x, y); when Q(x) = p(x, y), it is equivalent to
F(Q; P ) ≡ EQ(x) [− log P (x|y)]
 Criterion 7; Criterion 6 is a modified version of the
Q(x)  upper bound of Criterion 5.
= EQ(x) log − EQ(x) [log Q(x)],
P (x|y) The criterion of optimality used for Bayesian filtering is
23 When the mode and the mean of distribution coincide, the MAP the Bayes risk of MMSE.27 Bayesian filtering is optimal
estimation is correct; however, for multimodal distributions, the MAP in a sense that it seeks the posterior distribution which
estimate can be arbitrarily bad. See Fig. 3. integrates and uses all of available information expressed
24 This can be viewed as a least-informative prior with uniform dis-
by probabilities (assuming they are quantitatively correct).
tribution.
25 It is a generalization of Kerridge’s inaccuracy for the case of i.i.d. However, as time proceeds, one needs infinite computing
data. power and unlimited memory to calculate the “optimal”
26 Free energy is a variational approximation of ML in order to
minimize its upper bound. This criterion is usually used in off-line 27 For a discussion of difference between Bayesian risk and frequen-
Bayesian estimation. tist risk, see [388].
MANUSCRIPT 11

• The process noise and measurement noise are mutually


independent: E[dn vmT
] = 0 for all n, m.
Time update: Measurement
One-step prediction update: Correction Let x̂MAP
n denote the MAP estimate of xn that maxi-
of the measurement to the state estimate mizes p(xn |Yn ), or equivalently log p(xn |Yn ). By using the
yn xn
Bayes rule, we may express p(xn |Yn ) by

p(xn , Yn )
p(xn |Yn ) =
p(Yn )
Fig. 4. Schematic illustration of Kalman filter’s update as a
predictor-corrector.
p(xn , yn , Yn−1 )
= , (26)
p(yn , Yn−1 )

solution, except in some special cases (e.g. linear Gaussian where the expression of joint pdf in the numerator is further
or conjugate family case). Hence in general, we can only expressed by
seek a suboptimal or locally optimal solution.
p(xn , yn , Yn−1 ) = p(yn |xn , Yn−1 )p(xn , Yn−1 )
B. Kalman Filtering = p(yn |xn , Yn−1 )p(xn |Yn−1 )p(Yn−1 )
Kalman filtering, in the spirit of Kalman filter [250], = p(yn |xn )p(xn |Yn−1 )p(Yn−1 ). (27)
[253] or Kalman-Bucy filter [249], consists of an iterative
prediction-correction process (see Fig. 4). In the predic- The third step is based on the fact that vn does not depend
tion step, the time update is taken where the one-step on Yn−1 . Substituting (27) into (26), we obtain
ahead prediction of observation is calculated; in the cor- p(yn |xn )p(xn |Yn−1 )p(Yn−1 )
rection step, the measurement update is taken where the p(xn |Yn ) =
p(yn , Yn−1 )
correction to the estimate of current state is calculated.
In a stationary situation, the matrices An , Bn , Cn , Dn in p(yn |xn )p(xn |Yn−1 )p(Yn−1 )
=
(3a) and (3b) are constant, Kalman filter is precisely the p(yn |Yn−1 )p(Yn−1 )
Wiener filter for stationary least-squares smoothing. In p(yn |xn )p(xn |Yn−1 )
= , (28)
other words, Kalman filter is a time-variant Wiener filter p(yn |Yn−1 )
[11], [12]. Under the LQG circumstance, Kalman filter was
originally derived with the orthogonal projection method. which shares the same form as (23). Under the Gaussian
In the late 1960s, Kailath [245] used the innovation ap- assumption of process noise and measurement noise, the
proach developed by Wold and Kolmogorov to reformulate mean and covariance of p(yn |xn ) are calculated by
the Kalman filter, with the tool of martingales theory.28
From innovations point of view, Kalman filter is a whiten- E[yn |xn ] = E[Gn xn + vn ] = Gn xn (29)
ing filter.29 Kalman filter is also optimal in the sense that
and
it is unbiased E[x̂n ] = E[xn ] and is a minimum variance
estimate. A detailed history of Kalman filter and its many Cov[yn |xn ] = Cov[vn |xn ] = Σv , (30)
variants can be found in [385], [244], [246], [247], [238], [12],
[423], [96], [195]. respectively. And the conditional pdf p(yn |xn ) can be fur-
Kalman filter has a very nice Bayesian interpretation ther written as
[212], [497], [248], [366]. In the following, we will show 1
that the celebrated Kalman filter can be derived within a p(yn |xn ) = A1 exp − (yn − Gn xn )T Σ−1 v (yn − Gn xn ) ,
2
Bayesian framework, or more specifically, it reduces to a
(31)
MAP solution. The derivation is somehow similar to the
ML solution given by [384]. For presentation simplicity, where A1 = (2π)−Ny /2 |Σv |−1/2 .
we assume the dynamic and measurement noises are both Consider the conditional pdf p(xn |Yn−1 ), its mean and
Gaussian distributed with zero mean and constant covari- covariance are calculated by
ance. The derivation of Kalman filter in the linear Gaussian
scenario is based on the following assumptions: E[xn |Yn−1 ] = E[Fn,n−1 x̂n + dn−1 |Yn−1 ]
• E[dn dm ] = Σd δmn ; E[vn vm ] = Σv δmn .
T T = Fn−1,n x̂n−1 = x̂n|n−1 , (32)
• The state and process noise are mutually independent:
and
E[xn dTm ] = 0 for n ≤ m; E[xn vmT
] = 0 for all n, m.
28 The martingale process was first introduced by Doob and dis- Cov[xn |Yn−1 ] = Cov[xn − x̂n|n−1 ]
cussed in detail in [139]. = Cov[en,n−1 ], (33)
29 Innovations concept can be used straightforward in nonlinear fil-
tering [7]. From innovations point of view, one of criteria to justify the
optimality of the solution to a nonlinear filtering problem is to check respectively, where x̂n|n−1 ≡ x̂(n|Yn−1 ) represents the
how white the pseudo-innovations are, the whiter the more optimal. state estimate at time n given the observations up to n − 1,
MANUSCRIPT 12

en,n−1 is the state-error vector. Denoting the covariance of noting that en,n−1 = xn − x̂n|n−1 and yn = Gn xn + vn ,
en,n−1 by Pn,n−1 , by Gaussian assumption, we may obtain we further have
1
en = en,n−1 − Kn (Gn en,n−1 + vn )
p(xn |Yn−1 ) = A2 exp − (xn − x̂n|n−1 )T
2 = (I − Kn Gn )en,n−1 − Kn vn , (42)
×P−1
n,n−1 (xn − x̂n|n−1 ) , (34)
and it further follows
−Nx /2 −1/2
where A2 = (2π) |Pn,n−1 | . By substituting equa- Pn = Cov[eMAP ]
n
tions (31) and (34) to (26), it further follows
= (I − Kn Gn )Pn,n−1 (I − Kn Gn )T + Kn Σv KTn .
1
p(xn |Yn ) ∝ A exp − (yn − Gn xn )T Σ−1 v (yn − Gn xn ) Rearranging the above equation, it reduces to
2
1
− (xn − x̂n|n−1 )T P−1 (x
n,n−1 n − x̂ n|n−1 ,
) Pn = Pn,n−1 − Fn,n+1 Kn Gn Pn,n−1 . (43)
2
(35) Thus far, the Kalman filter is completely derived from
MAP principle, the expression of xMAP n is exactly the same
where A = A1 A2 is a constant. Since the denominator is solution derived from the innovations framework (or oth-
a normalizing constant, (35) can be regarded as an unnor- ers).
malized density, the fact doesn’t affect the following deriva- The above procedure can be easily extended to ML case
tion. without much effort [384]. Suppose we want to maximize
Since the MAP estimate of the state is defined by the the marginal maximum likelihood of p(xn |Yn ), which is
condition equivalent to maximizing the log-likelihood

∂log p(xn |Yn ) 
 = 0, (36) log p(xn |Yn ) = log p(xn , Yn ) − log p(Yn ), (44)
∂xn xn =x̂MAP
and the optimal estimate near the solution should satisfy
substituting equation (35) into (36) yields
−1 ∂log p(xn |Yn ) 
−1
 = 0. (45)
x̂MAP
n = GTn Σ−1
v Gn + Pn,n−1
∂xn xn =x̂ML

× P−1 T −1
n,n−1 x̂n|n−1 + Gn Σv yn .
Substituting (35) to (45), we actually want to minimize the
the cost function of two combined Mahalanobis norms 31
By using the lemma of inverse matrix,30 it is simplified as E = yn − Gn xn 2Σ−1 + xn − x̂n 2P−1 . (46)
v n,n−1

x̂MAP = x̂n|n−1 + Kn (yn − Gn x̂n|n−1 ), (37)


n Taking the derivative of E with respect to xn and setting
where Kn is the Kalman gain as defined by as zero, we also obtain the same solution as (37).
Remarks:
Kn = Fn+1,n Pn,n−1 GTn (Gn Pn,n−1 GTn + Σv )−1 . (38) • The derivation of the Kalman-Bucy filter [249] was
rooted in the SDE theory [387], [360], it can be also
Observing derived within the Bayesian framework [497], [248].
• The optimal filtering solution described by Wiener-
en,n−1 = xn − x̂n|n−1 Hopf equation is achieved by spectral factorization
= Fn,n−1 xn−1 + dn − Fn,n−1 x̂MAP technique [487]. By admitting state-space formula-
n−1
tion, Kalman filter elegantly overcomes the station-
= Fn,n−1 eMAP
n−1 + dn−1 , (39) arity assumption and provides a fresh look at the
filtering problem. The signal process (i.e.“state”)
and by virtue of Pn−1 = Cov[eMAP
n−1 ], we have is regarded as a linear stochastic dynamical system
driven by white noise, the optimal filter thus has
Pn,n−1 = Cov[en,n−1 ]
a stochastic differential structure which makes the
= Fn,n−1 Pn−1 FTn,n−1 + Σd . (40) recursive estimation possible. Spectral factorization
is replaced by the solution of an ordinary differen-
Since tial equation (ODE) with known initial conditions.
Wiener filter doesn’t treat the difference between the
en = xn − x̂MAP
n white and colored noises, it also permits the infinite-
= xn − xn|n−1 − Kn (yn − Gn x̂n|n−1 ), (41) dimensional systems; whereas Kalman filter works for
30 For A = B−1 + CD−1 CT , it follows from the matrix inverse 31 The Mahalanobis norm is defined as a weighted norm: A2 =
B
lemma that A−1 = B − BC(D + CT BC)−1 CT B. AT BA.
MANUSCRIPT 13

finite-dimensional systems with white noise assump- approximation (e.g. Gaussian sum filter) or linearization
tion. techniques (i.e. EKF) are usually used. In the EKF, by
• Kalman filter is an unbiased minimum variance estima- defining
tor under LOG circumstance. When the Gaussian as-
sumption of noise is violated, Kalman filter is still opti- df (x)  dg(x) 
F̂n+1,n =  , Ĝn =  ,
mal in a mean square sense, but the estimate doesn’t dx x=x̂n dx x=x̂n|n−1
produce the condition mean (i.e. it is biased), and
the equations (2a)(2b) can be linearized into (3a)(3b), and
neither the minimum variance. Kalman filter is not
the conventional Kalman filtering technique is further em-
robust because of the underlying assumption of noise
ployed. The details of EKF can be found in many books,
density model.
e.g. [238], [12], [96], [80], [195], [205], [206]. Because EKF
• Kalman filter provides an exact solution for linear
always approximates the posterior p(xn |y0:n ) as a Gaus-
Gaussian prediction and filtering problem. Concerning
sian, it works well for some types of nonlinear problems,
the smoothing problem, the off-line estimation version
but it may provide a poor performance in some cases when
of Kalman filter is given by the Rauch-Tung-Striebel
the true posterior is non-Gaussian (e.g. heavily skewed or
(RTS) smoother [384], which consists of a forward fil-
multimodal). Gelb [174] provided an early overview of the
ter in a form of Kalman filter and a backward recursive
uses of EKF. It is noted that the estimate given by EKF is
smoother. The RTS smoother is computationally effi-
usually biased since in general E[f (x)] = f (E[x]).
cient than the optimal smoother [206].
In summary, a number of methods have been developed
• The conventional Kalman filter is a point-valued fil-
for nonlinear filtering problems:
ter, it can be also extended to set-valued filtering [39],
• Linearization methods: first-order Taylor series expan-
[339], [80].
• In the literature, there exists many variants of Kalman sion (i.e. EKF), and higher-order filter [20], [437].
• Approximation by finite-dimensional nonlinear filters:
filter, e.g., covariance filter, information filter, square-
root Kalman filters. See [205], [247] for more details Beneš filter [33], [34], Daum filter [111]-[113], and pro-
and [403] for a unifying review. jection filter [202], [55].
• Classic PDE methods, e.g. [282], [284], [285], [505],

C. Optimum Nonlinear Filtering [496], [497], [235].


• Spectral methods [312].
In practice, the use of Kalman filter is limited by the • Neural filter methods, e.g. [209].
ubiquitous nonlinearity and non-Gaussianity of physical • Numerical approximation methods, as to be discussed
world. Hence since the publication of Kalman filter, numer- in Section V.
ous efforts have been devoted to the generic filtering prob-
lem, mostly in the Kalman filtering framework. A number C.1 Finite-dimensional Filters
of pioneers, including Zadeh [503], Bucy [61], [60], Won-
The on-line solution of the FPK equation can be
ham [496], Zakai [505], Kushner [282]-[285], Stratonovich
avoided if the unnormalized filtered density admits a finite-
[430], [431], investigated the nonlinear filtering problem.
dimensional sufficient statistics. Beneš [33], [34] first ex-
See also the papers seeking optimal nonlinear filters [420],
plored the exact finite-dimensional filter32 in the nonlinear
[289], [209]. In general, the nonlinear filtering problem per
filtering scenario. Daum [111] extended the framework to a
sue consists in finding the conditional probability distribu-
more general case and included Kalman filter and Beneš fil-
tion (or density) of the state given the observations up to
ter as special cases [113]. Some new development of Daum
current time [420]. In particular, the solution of nonlinear
filter with virtual measurement was summarized in [113].
filtering problem using the theory of conditional Markov
The recently proposed projection filters [202], [53]-[57], also
processes [430], [431] is very attractive from Bayesian per-
belong to the finite-dimensional filter family.
spective and has a number of advantages over the other
methods. The recursive transformations of the posterior In [111], starting with SDE filtering theory, Daum intro-
measures are characteristics of this theory. Strictly speak- duced a gradient function
ing, the number of variables replacing the density function ∂
is infinite, but not all of them are of equal importance. r(t, x) = ln ψ(t, x)
∂x
Thus it is advisable to select the important ones and reject
the remainder. where ψ(t, x) is the solution of the FPK equation of (11a)
The solutions of nonlinear filtering problem have two cat- with a form
egories: global method and local method. In the global ap- ∂ψ(t, x) ∂ψ(t, x) ∂f 1 ∂ 2 ψ
proach, one attempts to solve a PDE instead of an ODE =− f − ψtr + tr A ,
∂t ∂x ∂x 2 ∂xxT
in linear case, e.g. Zakai equation, Kushner-Stratonovich
equation, which are mostly analytically intractable. Hence with an appropriate initial condition (see [111]), and A =
the numerical approximation techniques are needed to solve σ(t, xt )σ(t, xt )T . When the measurement equation (11b) is
the equation. In special scenarios (e.g. exponential family) 32 Roughly speaking, a finite-dimensional filter is the one that can
with some assumptions, the nonlinear filtering can admit be implemented by integrating a finite number of ODE, or the one
the tractable solutions. In the local approach, finite sum has the sufficient statistics with finite variables.
MANUSCRIPT 14

linear with Gaussian noise (recalling the discrete-time ver- the MAP estimate, which is partially justified by the fact
sion (3b)), Daum filter admits a finite-dimensional solution that under certain regularity conditions the posterior dis-
1  tribution asymptotically approaches Gaussian distribution
p(xt |Yt ) = ψ s (xt ) exp (xt − mt )T P −1
t (xt − mt ) ,
as the number of samples increases to infinity. Laplace ap-
2 proximation is useful in the MAP or ML framework, this
where s is real number in the interval 0 < s < 1 defined in method usually works for the unimodal distribution but
the initial condition, mt and P t are two sufficient statis- produces a poor approximation result for the multimodal
tics that can be computed recursively.33 The calculation of distribution, especially in a high-dimensional space. Some
ψ(xt ) can be done off line which does not rely on the mea- new development of Laplace approximation can be found
surement, whereas mt and P t will be computed on line in MacKay’s paper [319].
using numerical methods. See [111]-[113] for more details.
The problem of the existence of a finite-dimensional fil- B. Iterative Quadrature
ter is concerned with the necessary and sufficient condi- Iterative quadrature is an important numerical approxi-
tions. In [167], a necessary condition is that the obser- mation method, which was widely used in computer graph-
vations and the filtering densities belong to the exponen- ics and physics in the early days. One of the popular
tial class. In particular, we have the Generalized Fisher- quadrature methods is Gaussian quadrature [117], [377]. In
Darmois-Koopman-Pitman Theorem: particular, a finite integral is approximated by a weighted
Theorem 1: e.g. [388], [112] For smooth nowhere vanish- sum of samples of the integrand based on some quadrature
ing densities, a fixed finite-dimensional filter exists if and formula
only if the unnormalized conditional density is from an ex-  b
ponential family m
f (x)p(x)dx ≈ ck f (xk ), (49)
a
π(xn |y0:n ) = π(xn ) exp[λT (xn )Ψ(y0:n )], (47) k=1

where Ψ(·) is a sufficient statistics, λ(·) is a function in X where p(x) is treated as a weighting function, and xk is
(which turns out to be the solution of specific PDE’s). the quadrature point. For example, it can be the k-th zero
the m-th order orthogonal Hermite polynomial Hm (x),34
The nonlinear finite-dimensional filtering is usually per-
for which the weights are given by
formed with the conjugate approach, where the prior and
posterior are assumed to come from some parametric prob- √
2m−1 m! m
ability function family in order to admit the exact and ana- ck = 2 .
m (Hm−1 (xk ))2
lytically tractable solution. We will come back to this topic
in Section VII. On the other hand, for general nonlinear The approximation is good if f (x) is a polynomial of de-
filtering problem, no exact solution can be obtained, vari- gree not bigger than 2m − 1. The values xk are determined
ous numerical approximation are hence need. In the next by the weighting function p(x) in the interval [a, b].35 This
section, we briefly review some popular numerical approxi- method can produce a good approximation if the nonlinear
mation approaches in the literature and focus our attention function is smooth. Quadrature methods, alone or com-
on the sequential Monte Carlo technique. bined with other methods, were used in nonlinear filtering
(e.g. [475], [287]). The quadrature formulae will be used
V. Numerical Approximation Methods
after a centering about the current estimate of the condi-
A. Gaussian/Laplace Approximation tional mean and rescaling according to the current estimate
Gaussian approximation is the simplest method to ap- of the covariance.
proximate the numerical integration problem because of its
C. Mulitgrid Method and Point-Mass Approximation
analytic tractability. By assuming the posterior as Gaus-
sian, the nonlinear filtering can be taken with the EKF If the state is discrete and finite (or it can be discretized
method. and approximated as finite), grid-based methods can pro-
Laplace approximation method is to approximate the in- vide a good solution and optimal way to update the filtered

tegral of a function f (x)dx by fitting a Gaussian at the density p(z n |y0:n ) (To discriminate from the continuous-
maximum x̂ of f (x), and further compute the volume un- valued state x, we denote the discrete-valued state as z
der the Gaussian [319]: from now on). Suppose the discrete state z ∈ N consists
  −1/2 of a finite number of distinct discrete states {1, 2, · · · , Nz }.
  For the state space z n−1 , let wn−1|n−1
i
denote the condi-
f (x)dx ≈ (2π)Nx /2 f (x̂) − ∇∇ log f (x) (48)
tional probability of each z n−1 given measurement up to
i

The covariance of the fitted Gaussian is determined by the 34 Other orthogonal approximation techniques can be also consid-
Hessian matrix of log f (x) at x̂. It is also used to approxi- ered.
mate the posterior distribution with a Gaussian centered at 35 The Fundamental Theorem of Gaussian Quadrature states that:
the abscissas of the m-point Gaussian quadrature formula are pre-
33 They degenerate into the mean and error covariance when (11a) cisely the roots of the orthogonal polynomial for the same interval
is linear Gaussian, and the filter reduces to the Kalman-Bucy filter. and weighting function.
MANUSCRIPT 15

n − 1, i.e. p(z n−1 = z i |y0:n−1 ) = wn−1|n−1


i
. Then the (a) (b)

posterior pdf at n − 1 can be represented as


Nz
p(z n−1 |y0:n−1 ) = i
wn−1|n−1 δ(z n−1 − z in−1 ), (50)
i=1
(c) (d)

and the prediction and filtering equations are further de-


rived as

Nz
p(z n |y0:n−1 ) = i
wn|n−1 δ(z n − z in ), (51)
(e) (f)
i=1

Nz
p(z n |y0:n ) = i
wn|n δ(z n − z in ), (52)
i=1

where

Nz
j
Fig. 5. Illustration of non-Gaussian distribution approximation: (a)
true distribution; (b) Gaussian approximation; (c) Gaussian sum ap-
i
wn|n−1 = wn−1|n−1 p(z in |z jn ), (53)
proximation; (d) histogram approximation; (e) Riemannian sum (step
j=1 function) approximation; (f) Monte Carlo sampling approximation.

i
i
wn|n−1 p(yn |z in )
wn|n = Nz j
. (54)
j=1 wn|n−1 p(yn |z jn )
rather than for the entire density. Even so, the computa-
If the state space is continuous, the approximate-grid based tion of multigrid-based point-mass approximation method
method can be similarly derived (e.g. [19]). Namely, we is nontrivial and the complexity is high (see [271]).
can always discretize the state space into Nz discrete cell Another sophisticated approximation method, based on
states, then a grid-based method can be further used to the piecewise constant approximation of density, was pro-
approximate the posterior density. The grid must be suf- posed in [271], [258]. The method is similar but not iden-
ficiently dense to obtain a good approximation, especially tical to the point-mass approximation. It defines a sim-
when the dimensionality of Nx is high, however the increase ple grid based on tiling the state space with a number of
of Nz will increase the computational burden dramatically. identical parallelepipeds, over each of them the density ap-
If the state space is not finite, then the accuracy of grid- proximation is constant, and the integration is replaced by
based methods is not guaranteed. As we will discuss in a discrete linear convolution problem. The method also al-
Section VII, HMM filter is quite fitted to the grid-based lows error propagation analysis along the calculation [271].
methods. The disadvantage of grid-based method is that
it requires the state space cannot be partitioned unevenly D. Moment Approximation
to give a great resolution to the state with high density
[19]. Some adaptive grid-based methods were proposed to Moment approximation is targeted at approximating the
overcome this drawback [65]. Given the predefined grid, moments of density, including mean, covariance, and higher
different methods were used to approximate the functions order moments. The approximation of the first two mo-
and carry out the dynamic Bayesian estimation and fore- ments is widely used in filtering [367]. Generally, we can
casting [62], [258], [271], [424], [373], [372]. empirically use the sample moment to approximate the true
In studying the nonlinear filtering, Bucy [62] and Bucy moment, namely
and Senne [63] introduced the point-mass method, which

1  (i) k
is a global function approximation method. Such method N

uses a simple rectangular grid, spline basis, step function, mk = E[xk ] = xk p(x)dx = |x |
X N i=1
the quadrature methods are used to determine the grid
points [64], [475], [271], the number of grid points is pre-
scribed to provide an adequate approximation. The density where mk denotes the m-th order moment and x(i) are
is assumed to be represented by a set of point masses which the samples from true distribution. Among many, Gram-
carry the information about the data; mesh grid and direc- Charlier and Edgeworth expansion are two popular higher-
tions are given in terms of eignevalues and eigenvectors of order moment approximation approaches. Due to space
conditional error covariance; the floating grid is centered at constraint, we cannot run into the details here, and re-
the current mean estimate and rotated from the state co- fer the reader to [ ] for more information. The applica-
ordinate frame into the principal axes of error ellipsoid (co- tions of higher-order moment approximation to nonlinear
variance); the grid along the axes is chosen to extend over filters are found in [427]. However, the computation cost of
a sufficient distance to cover the true state. For the multi- these approaches are rather prohibitive, especially in high-
modal density, it is suggested to define a grid for each mode dimensional space.
MANUSCRIPT 16

E. Gaussian Sum Approximation Kalman filtering, with a name of nprKF. The nprKF filter-
ing technique was also used to train the neural networks
Different from the linearized EKF or second-order ap-
[166].
proximation filter that only concentrate on the vicinity
of the mean estimate, Gaussian sum approximation uses The idea of derivative-free state estimation is following:
a weighted sum of Gaussian densities to approximate the In order to estimate the state information (mean, covari-
posterior density(the so-called Gaussian mixture model): ance, and higher-order moments) after a nonlinear trans-
formation, it is favorable to approximate the probability

m distribution directly instead of approximating the nonlin-
p(x) = cj N (x̂j , Σj ) (55) ear function (by linear localization) and apply the Kalman
j=1 filter in the transformed domain. The derivative-free UKF
m can overcome the drawback by using a deterministic sam-
where the weighting coefficients cj > 0 and j=1 cj = 1. pling approach to calculate the mean and covariance. In
The approximation is motivated by the observation that particular, the (2Nx + 1) sigma-points are generated and
any non-Gaussian density can be approximated to some propagated through the true nonlinearity, and the weighted
accurate degree by a sufficiently large number of Gaussian mean and covariance are further calculated [242], [474].
mixture densities, which admits tractable solution by cal- Compared with the EKF’s first-order accuracy, the esti-
culating individual first and second order moments. The mation accuracy of UKF is improved to the third-order for
Gaussian sum filter [421], [8], essentially uses this idea and Gaussian data and at least second-order for non-Gaussian
runs a bank of EKFs in parallel to obtain the suboptimal data [242], [474].
estimate. The following theorem reads the underlying prin- However, UT and UKF often encounter the ill-
ciple: conditioned 37 problem of covariance matrix in practice
Theorem 2: [12] Suppose in equations (2a)(2b) the (though it is theoretically positive semi-definite), although
noise vectors dn and vn are white Gaussian noises with the regularization trick and square-root UKF [460] can al-
zero mean and covariances Σd and Σv , respectively. leviate this. For enhancing the numerical robustness, we
If p(xn |y0:n ) = N (xn ; μn|n−1 , Σn|n−1 ), then for fixed propose another derivative-free KF based on singular-value
g(·), μn|n−1 and Σv , the filtered density p(xn |y0:n ) = decomposition (SVD).
cn p(xn |y0:n−1 )p(yn |xn ) (where cn is a normalizing The SVD-based KF is close in spirit to UKF, it only
constant) converges uniformly to N (xn ; μn|n , Σn|n ) as differs in that the UT is replaced by SVD and the sigma-
Σn|n−1 → 0. If p(xn |y0:n ) = N (xn ; μn|n , Σn|n ), point covariance becomes an eigen-covariance matrix, in
then for fixed f (·), μn|n and Σd , the predicted density which the pairwise (±) eigenvectors are stored into the col-

p(xn+1 |y0:n ) = p(xn+1 |xn )p(xn |y0:n )dxn converges uni- umn vector of the new covariance matrix. The number
formly to N (xn+1 ; μn+1|n , Σn+1|n ) as Σn|n → 0. of eigen-points to store is the same as the sigma points in
Some new development of Gaussian sum filter (as well UT. The idea behind SVD is simple: We assume the covari-
as Gaussian-quadrature filter) is referred to [235], [234], ance matrix be characterized by a set of eigenvectors which
where the recursive Bayesian estimation is performed, and correspond to a set of eigenvalues.38 For the symmetric co-
no Jacobian matrix evaluation is needed (similar to the variance matrix C, ED and SVD are equivalent, and the
unscented transformation technique discussed below). eigenvalues are identical to the singular values. We prefer
to calculate SVD instead of eigen-decomposition because
F. Deterministic Sampling Approximation the former is more numerically robust. The geometrical
interpretation of SVD compared with UT is illustrated in
The deterministic sampling approximation we discussed
Fig. 6. By SVD of square-root of the covariance matrix C
below is a kind of method called unscented transformation
(UT). 36 It can be viewed as a special numerical method

1/2 S 0
to approximate the sufficient statistics of mean and co- C =U VT (56)
0 0
variance. The intuition of UT is somewhat similar to the
point-mass approximation discussed above: it uses the so-
where C1/2 = chol(C) and chol represents Cholesky fac-
called sigma-points with additional skewed parameters to
torization; S is a diagonal matrix S = diag{s1 , · · · , sk },
cover and propagate the information of the data. Based on
when C1/2 is symmetric, U = V. Thus the eigenvalues
UT, the so-called unscented Kalman filter (UKF) was de-
are λk = s2k , and the eigenvectors of C is represented by
rived. The most mentionable advantage of UKF over EKF
is its derivative-nonlinear estimation (no need of calcula- the column vectors of matrix UUT . A Monte Carlo sam-
tion of Jacobians and Hessians), though its computational pling of a two-dimensional Gaussian distribution passing
complexity is little higher than the EKF’s. There are also a Gaussian nonlinearity is shown in Fig. 6. As shown,
other derivative-free estimation techniques available. In the sigma points and eigen-points can both approximately
[355], a polynomial approximation using interpolation for- characterize the structure of the transformed covariance
mula was developed and subsequently applied to nonlinear 37 Namely, the conditional number of the covariance matrix is very
large.
36 The name is somehow ad hoc and the word “unscented” does not 38 By assuming that, we actually assume that the sufficient statistics
imply its original meaning (private communication with S. Julier). of underlying data is second-order, which is quite not true.
MANUSCRIPT 17

20 0.185

15 0.18

10 0.175
mean
weighted mean
5 0.17
x mean y
0 0.165

covariance + covariance
−5 0.16
+ weighted
Px SVD _ f Py
−10 0.155 covariance

−15 0.15

−20 0.145

−25 0.14
−8 −6 −4 −2 0 2 4 6 8 10 12 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16

Fig. 6. SVD against Choleksy factorization in UT. Left: 1,000 data points are generated from a two-dimensional Gaussian distribution.
The small red circles linked by two thin lines are sigma points using UT (parameters α = 1, β = 2, κ = 0; see the paper [ ] for notations); the
two black arrows are the eigenvector multiplied by ρ = 1.4; the ellipses from inside to outside correspond to the scaling factors σ = 1, 1.4, 2, 3;
Middle: After the samples pass a Gaussian nonlinearity, the sigma points and eigen-points are calculated again for the transformed covariance;
Right: SVD-based derivative-free estimation block diagram.

matrix. For state space equations (2a)(2b) with additive Strong Law of Large Numbers (under some mild regular-
noise, the SVD-based derivative-free KF algorithm for the ity conditions), fˆNp (x) converges to E[f (x)] almost surely
state estimation is summarized in Table X in Appendix E. (a.s.) and its convergence rate is assessed by the Central
Limit Theorem
G. Monte Carlo Sampling Approximation 
Monte Carlo methods use statistical sampling and esti- Np (fˆNp − E[f ]) ∼ N (0, σ 2 ),
mation techniques to evaluate the solutions to mathemati- where σ 2 is the variance of f (x). Namely, the error rate is
cal problems. Monte Carlo methods have three categories: −1/2
of order O(Np ), which is slower than the order O(Np−1 )
(i) Monte Carlo sampling, which is devoted to developing
for deterministic quadrature in one-dimensional case. One
efficient (variance-reduction oriented) sampling technique
crucial property of Monte Carlo approximation is the es-
for estimation; (ii) Monte Carlo calculation, which is aimed
timation accuracy is independent of the dimensionality of
to design various random or pseudo-random number gen-
the state space, as opposed to most deterministic numerical
erators; and (iii) Monte Carlo optimization, which is de-
methods.39 The variance of estimate is inversely propor-
voted to applying the Monte Carlo idea to optimize some
tional to the number of samples.
(nonconvex or non-differentiable) functions, to name a few,
There are two fundamental problems arising in Monte
simulated annealing [257], dynamic weighting [494], [309],
Carlo sampling methods: (i) How to draw random sam-
[298], and genetic algorithm. In recent decades, modern
ples {x(i) } from a probability distribution P (x)?; and (ii)
Monte Carlo techniques have attracted more and more at-
How to estimate the expectation of a function
 w.r.t. the
tention and have been developed in different areas, as we
distribution or density, i.e. E[f (x)] = f (x)dP (x)? The
will briefly overview in this subsection. Only Monte Carlo
first problem is a design problem, and the second one is
sampling methods are discussed. A detailed background of
an inference problem invoking integration. Besides, several
Monte Carlo methods can refer to the books [168], [389],
central issues are concerned in the Monte Carlo sampling:
[306], [386] and survey papers [197], [318].
• Consistency: An estimator is consistent if the esti-
The underlying mathematical concept of Monte Carlo
mator converges to the true value almost surely as the
approximation is simple. Consider a statistical problem
number of observations approaches infinity.
estimating a Lebesque-Stieltjes integral:
• Unbiasedness: An estimator is unbiased if its ex-

pected value is equal to the true value.
f (x)dP (x),
X • Efficiency: An estimator is efficient if it produces
the smallest error covariance matrix among all unbi-
where f (x) is an integrable function in a measurable space.
ased estimators, it is also regarded optimally using the
As a brute force technique, Monte Carlo sampling uses a
information in the measurements. A well-known effi-
number of (independent) random variables in a probabil-
ciency criterion is the Cramér-Rao bound.
ity space (Ω, F, P ) to approximate the true integral. Pro-
• Robustness: An estimator is robust if it is insensitive
vided one draws a sequence of Np i.i.d. random samples
to the gross measurement errors and the uncertainties
{x(1) , · · · , x(Np ) } from probability distribution P (x), then
of the model.
the Monte Carlo estimate of f (x) is given by
• Minimal variance: Variance reduction is the central
issue of various Monte Carlo approximation methods,
1 
Np
fˆNp = f (x(i) ), (57) most improvement techniques are variance-reduction
Np i=1 oriented.
39 Note that, however, it doesn’t mean Monte Carlo methods can
for which E[fˆNp ] = E[f ] and Var[fˆNp ] = N1p Var[f ] = N
σ2
p beat the curse of dimensionality, an issue that will be discussed in
(see Appendix A for a general proof). By the Kolmogorov Section VI-P.
MANUSCRIPT 18

In the rest of this subsection, we will provide a brief in- estimate (59) is given by [59]
troduction of many popular Monte Carlo method relevant
1
to our paper. No attempt is made here to present a com- Varq [fˆ] = Varq [f (x)W (x)]
plete and rigorous theory. For more theoretical details or Np
applications, reader is referred to the books [199], [389], 1
= Varq [f (x)p(x)/q(x)]
[168], [306]. Np
  2
1 f (x)p(x)
= − Ep [f (x)] q(x)dx
Np q(x)
 
G.1 Importance Sampling 1 (f (x)p(x))2 
= − 2p(x)f (x)Ep [f (x)] dx
Np q(x)
Importance sampling (IS) was first introduced by Mar- (Ep [f (x)]) 2
shall [324] and received a well-founded treatment and dis- +
Np
cussion in the seminal book by Hammersley and Hanscomb  
[199]. The objective of importance sampling is aimed to 1 (f (x)p(x))2  (Ep [f (x)])2
= dx − . (61)
sample the distribution in the region of “importance” in Np q(x) Np
order to achieve computational efficiency. This is impor-
The variance can be reduced when an appropriate q(x) is
tant especially for the high-dimensional space where the
chosen to (i) match the shape of p(x) so as to approximate
data are usually sparse, and the region of interest where
the true variance; or (ii) match the shape of |f (x)|p(x) so as
the target lies in is relatively small in the whole data space.
to further reduce the true variance.40 Importance sampling
The idea of importance sampling is to choose a proposal
estimate given by (60) is biased (thus a.k.a. biased sam-
distribution q(x) in place of the true probability distribu-
pling)41 but consistent, namely the bias vanishes rapidly
tion p(x), which is hard-to-sample. The support of q(x) is
at a rate O(Np ). Provided q is appropriately chosen, as
assumed to cover that of p(x). Rewriting the integration
Np → ∞, from the Weak Law of Large Numbers, we know
problem as
Eq [W (x)f (x)]
fˆ → .
  Eq [W (x)]
p(x)
f (x)p(x)dx = f (x) q(x)dx, (58) It was also shown [180] that if E[W̃ (x)] < ∞ and
q(x)
E[W̃ (x)f 2 (x)] < ∞, (60) converges to Ep [f ] a.s. and the
Lindeberg-Lévy Central Limit Theorem still holds:

Monte Carlo importance sampling is to use a number of Np (fˆ − Ep [f ]) ∼ N (0, Σf ),
(say Np ) independent samples drawn from q(x) to obtain
a weighted sum to approximate (58): where
 
Σf = Varq W̃ (x)(f (x) − Ep [f (x)]) . (62)

1 
Np
A measure of efficiency of importance sampler is given by
fˆ = W (x(i) )f (x(i) ), (59) Σ
the normalized version of (62): Varpf[f ] , which is related to
Np i=1
the effective sample size, as we will discuss later.
Remarks:
• Importance sampling is useful in two ways [86]: (i) it
where W (x(i) ) = p(x(i) )/q(x(i) ) are called the importance provides an elegant way to reduce the variance of the
weights (or importance ratios). If the normalizing factor estimator (possibly even less than the true variance);
of p(x) is not known, the importance weights can be only and (ii) it can be used when encountering the diffi-
evaluated up to a normalizing constant, hence W (x(i) ) ∝ culty to sample from the true probability distribution
Np
p(x(i) )/q(x(i) ). To ensure that i=1 W (x(i) ) = 1, we nor- directly.
malize the importance weights to obtain • As shown in many empirical experiments [318], impor-
tance sampler (proposal distribution) should have a
heavy tail so as to be insensitive to the outliers. The
1
Np super-Gaussian distributions usually have long tails,
Np i=1 W (x(i) )f (x(i) ) 
Np
fˆ = Np ≡ W̃ (x(i) )f (x(i) ), (60) with kurtosis bigger than 3. Alternatively, we can
1 (j)
Np j=1 W (x ) i=1 roughly verify the “robust” behavior from the acti-
vation function defined as ϕ(x) = −d log dx
q(x)
: if ϕ(x)
is bounded, q(x) has a long tail, otherwise not.
W (x(i) ) 40 Inan ideal case, q(x) ∝ |f (x)|p(x), the variance becomes zero.
where W̃ (x(i) ) =  Np are called the normalized 41 It
j=1 W (x
(j) )
is unbiased only when all of importance weights W̃ (i) = 1
importance weights. The variance of importance sampler (namely p(·) = q(·), and it reduces to the estimate fˆNp in (57)).
MANUSCRIPT 19

• Although theoretically the bias of importance sampler • It usually takes a long time to get the samples when
vanishes at a rate O(Np ), the accuracy of estimate is the ratio p(x)/Cq(x) is close to zero [441].
not guaranteed even with a large Np . If q(·) is not
close to p(·), it can be imagined that the weights are G.3 Sequential Importance Sampling
very uneven, thus many samples are almost useless A good proposal distribution is essential to the efficiency
because of their negligible contributions. In a high- of importance sampling, hence how to choose an appropri-
dimensional space, the importance sampling estimate ate proposal distribution q(x) is the key to apply a suc-
is likely dominated by a few samples with large impor- cessful importance sampling [200], [506], [266]. However,
tance weights. it is usually difficult to find a good proposal distribution
• Importance sampler can be mixed with Gibbs sampling especially in a high-dimensional space. A natural way to
or Metropolis-Hastings algorithm to produce more ef- alleviate this problem is to construct the proposal distri-
ficient techniques [40], [315]. bution sequentially, which is the basic idea of sequential
• Some advanced (off-line) importance sampling schemes, importance sampling (SIS) [198], [393].
such as adaptive importance sampling [358], annealed
In particular, if the proposal distribution is chosen in a
importance sampling [348], [350], smoothed impor-
factorized form [144]
tance sampling [49], [322], dynamic importance sam-
pling [494], (regularized) greedy importance sampling, 
n
Bayesian importance sampling [382] etc. are also avail- q(x0:n |y0:n ) = q(x0 ) q(xt |x0:t−1 , y0:t ), (64)
able. t=1

G.2 Rejection Sampling then the importance sampling can be performed recur-
sively. We will give the derivation detail when discussing
Rejection sampling (e.g. [199]) is useful when we know the SIS particle filter in Section VI. At this moment, we
(pointwise) the upper bound of underlying distribution or consider a simplified (unconditional pdf) case for the ease of
density. The basic assumption of rejection sampling is sim- understanding. According to the “telescope” law of prob-
ilar to that of importance sampling. Assume there exists a ability, we have the following:
known constant C < ∞ such that p(x) < Cq(x) for every
x ∈ X, the sampling procedure reads as follows: p(x0:n ) = p(x0 )p(x1 |x0 ) · · · p(xn |x0 , · · · , xn−1 ),
• Generate a uniform random variable u ∼ U(0, 1);
q(x0:n ) = q0 (x0 )q1 (x1 |x0 ) · · · qn (xn |x0 , · · · , xn−1 ).
• Draw a sample x ∼ q(x);
p(x)
• If u < Cq(x) , return x, otherwise go to step 1. Hence the importance weights W (x0:n ) can be written as
The samples from rejection sampling are exact, and the
acceptance probability for a random variable is inversely p(x0 )p(x1 |x0 ) · · · p(xn |x0 , · · · , xn−1 )
W (x0:n ) = ,
proportional to the constant C. In practice, the choice q0 (x0 )q1 (x1 |x0 ) · · · qn (xn |x0 , · · · , xn−1 )
of constant C is critical (which relies on the knowledge of
p(x)): if C is too small, the samples are not reliable be- which be recursively calculated as
cause of low rejection rate; if C is too large, the algorithm
p(xn |x0:n−1 )
will be inefficient since the acceptance rate will be low. In Wn (x0:n ) = Wn−1 (x0:n−1 ) .
Bayesian perspective, rejection sampling naturally incor- qn (xn |x0:n−1 )
porates the normalizing denominator into the constant C.
Remarks:
If the prior p(x) is used as proposal distribution q(x), and
the likelihood p(y|x) ≤ C where C is assumed to be known, • The advantage of SIS is that it doesn’t rely on the un-
the bound on the posterior is given by derlying Markov chain. Instead, many i.i.d. replicates
are run to create an importance sampler, which con-
p(y|x)p(x) Cq(x) sequently improves the efficiency. The disadvantage
p(x|y) = ≤ ≡ C  q(x),
p(y) p(y) of SIS is that the importance weights may have large
variances, resulting in inaccurate estimate [315].
and the acceptance rate for drawing a sample x ∈ X is • SIS method can be also used in a non-Bayesian com-

p(x|y) p(y|x) putation, such as evaluation of the likelihood function


= , (63) in the missing-data problem [266].
C  q(x) C
• It was shown in [266] that the unconditional variance
which can be computed even the normalizing constant p(y) of the importance weights increases over time, which
is not known. is the so-called weight degeneracy problem: Namely,
Remarks: after a few iterations of algorithm, only few or one
• The draws obtained from rejection sampling are exact of W (x(i) ) will be nonzero. This is disadvantageous
[414]. since a lot of computing effort is wasted to update
• The prerequisite of rejection sampling is the prior those trivial weight coefficients. In order to cope with
knowledge of constant C, which is sometimes unavail- this situation, resampling step is suggested to be used
able. after weight normalization.
MANUSCRIPT 20

;;;;
;; ;;
;;
Cq (x)
q(x)

p(x) p(x)

Fig. 7. Illustration of importance sampling (left) and acceptance-rejection sampling (right). p(x) is the true pdf (solid line), q(x) is the
proposal distribution (dashed line). For rejection sampling, some random samples x(i) are generated below Cq(x), which are rejected if they
lie in the region between p(x) and Cq(x); if they also lie below p(x), they are accepted.

G.4 Sampling-Importance Resampling for the future use, though resampling doesn’t neces-
sarily improve the current state estimate because it
The sampling-importance resampling (SIR) is motivated
also introduces extra Monte Carlo variation.
from the Bootstrap and jackknife techniques. Bootstrap
• Resampling schedule can be deterministic or dynamic
technique is referred to a collection of computationally in-
[304], [308]. In deterministic framework, resampling is
tensive methods that are based on resampling from the ob-
taken at every k time step (usually k = 1). In a dy-
served data [157], [408], [321]. The seminal idea originated
namic schedule, a sequence of thresholds (that can be
from [155] and was detailed in [156], [157]. The intuition
constant or time-varying) are set up and the variance
of bootstrapping is to evaluate the properties of an estima-
of the importance weights are monitored; resampling
tor through the empirical cumulative distribution function
is taken only when the variance is over the threshold.
(cdf) of the samples instead of the true cdf.
The validity of inserting a resampling step in SIS algo-
In the statistics literature, Rubin [395], [396] first ap-
rithm has been justified by [395], [303], since resampling
plied SIR technique to Monte Carlo inference, in which
step also brings extra variation, some special schemes are
the resampling is inserted between two importance sam-
needed. There are many types of resampling methods avail-
pling steps. The resampling step42 is aimed to eliminate
able in the literature:
the samples with small importance weights and duplicate
the samples with big weights. The generic principle of SIR 1. Multinomial resampling [395], [414], [193]: the
proceeds as follows: procedure reads as follows (see also [19])

• Draw Np random samples {x(i) }i=1 pN


from proposal dis- • Produce a uniform distribution u ∼ U(0, 1), construct
a cdf for importance weights (see Fig. 1), calculate
tribution q(x); i
• Calculate importance weights W (i) ∝ p(x)/q(x) for si = j=1 W̃ (j) ;
• Find si s.t. si−1 ≤ u < si , the particle with index i
each sample x(i) ;
• Normalize the importance weights to obtain W̃ (i) ; is chosen;
(i) (i)
• Given {x , W̃ }, for j = 1, · · · , Np , generate new
• Resample with replacement N times from the discrete
Np
set {x(i) }i=1 , where the probability of resampling from samples x by duplicating x(i) according to the asso-
(j)

ciated W̃ (i) ;
each x(i) is proportional to W̃ (i) . (i)
• Reset W = 1/Np .
Remarks (on features): Multinomial resampling uniformly generates Np new
• Resampling usually (but not necessarily) occurs be- independent particles from the old particle set. Each
tween two importance sampling steps. In resampling particle is replicated Ni times (Ni can be zero),
step, the particles and associated importance weights namely each x(i) produces Ni children. Note that
{x(i) , W̃ (i) } are replaced by the new samples with Np (i)
here i=1 Ni = Np , E[Ni ] = Np W̃ , Var[Ni ] =
equal importance weights (i.e. W̃ (i) = 1/Np ). Re- (i) (i)
Np W̃ (1 − W̃ ).
sampling can be taken at every step or only taken if
2. Residual resampling [211], [304]: Liu and Chen
regarded necessary.
[304] suggested a partially deterministic resampling
• As justified in [303], resampling step plays an criti-
method. The two-step selection procedure is as fol-
cal role in importance sampling since (i) if importance
lows [304]:
weights are uneven distributed, propagating the “triv-
ial” weights through the dynamic system is a waste • For each i = 1, · · · , Np , retain ki = Np W̃ (i)  copies
of computing power; (ii) when the importance weights (i)
of xn ;
are skewed, resampling can provide chances for select- • Let Nr = Np − k1 − · · · − kNp , obtain Nr i.i.d.
ing “important” samples and rejuvenate the sampler (i)
draws from {xn } with probabilities proportional to
(i)
42 It is also called selection step. Np W̃ − ki (i = 1, · · · , Np );
MANUSCRIPT 21

• Reset W (i) = 1/Np . from importance sampling. Suppose the state space is de-
composed into two equal, disjoint strata (subvolumes), de-
Residual resampling procedure is computationally noted as a and b, for stratified sampling, the total number
cheaper than the conventional SIR and achieves a of Np samples are drawn from two strata separately and we
lower sampler variance, and it doesn’t introduce ad- have the stratified mean fˆ = 12 (fˆa + fˆb ), and the stratified
ditional bias. Every particle in residual resampling is variance
replicated.
3. Systematic resampling (or Minimum variance Vara [fˆ] + Varb [fˆ]
Var[fˆ ] =
sampling) [259], [69], [70]: the procedure proceeds as 4
follows: Vara [f ] + Varb [f ]
= , (65)
2Np
• u ∼ U(0, 1)/Np ; j = 1; = 0;i = 0;
• do while u < 1 where the second equality uses the facts that Vara [fˆ] =
• if  > u then 2 ˆ 2
Np Vara [f ] and Varb [f ] = Np Varb [f ]. In addition, it can be
• u = u + 1/Np ; output x(i)
• else proved that43
• pick k in {j, · · · , Np } Np Var[fˆ] = Var[f ]
• i = x(k) ,  =  + W (i)
Vara [f ] + Varb [f ] (Ea [f ] − Eb [f ])2
• switch (x(k) , W (k) ) with (x(j) , W (j) ) = +
• j =j+1 2 4
2
• end if  (E a [f ] − Eb [f ])
= Np Var[fˆ ] +
• end do 4
ˆ
≥ Np Var[f ], (66)
The systematic resampling treats the weights as con-
tinuous random variables in the interval (0, 1), which where the third line follows from (65). Hence, the vari-
are randomly ordered. The number of grid points ance of stratified sampling Var[fˆ ] is never bigger than that
{u + k/Np } in each interval is counted [70]. Every par- of conventional Monte Carlo sampling Var[fˆ], whenever
ticle is replicated and the new particle set is chosen to Ea [f ] = Eb [f ].
minimize Var[Ni ] = E[(Ni − E[Ni ])2 ]. The complexity In general, provided the numbers of simulated samples
of systematic resampling is O(Np ). from strata a and b are Na and Nb ≡ Np −Na , respectively,
4. Local Monte Carlo resampling [304]: The sam- (65) becomes
ples are redrawn using rejection method or Metropolis-
Hastings method. We will briefly describe this scheme 1  Vara [f ] Varb [f ] 
Var[fˆ ] = + , (67)
later in Section VI. 4 Na Np − N a
Remarks (on weakness):
the variance is minimized when
• Different from the rejection sampling that achieves ex-
act draws from the posterior, SIR only achieves ap- Na σa
= , (68)
proximate draws from the posterior as Np → ∞. Some Np σa + σb
variations of combining rejection sampling and impor-
and the achieved minimum variance is
tance sampling are discussed in [307].
• Although resampling can alleviate the weight degener- (σa + σb )2
Var[fˆ ]min = . (69)
acy problem, it unfortunately introduces other prob- 4Na
lems [144]: after one resampling step, the simulated
trajectories are not statistically independent any more, Remarks:
thus the convergence result due to the original central • In practice, it is suggested [376] that (67) be changed
limit theorem is invalid; resampling causes the samples to the generic form
that have high importance weights to be statistically 1  Vara [f ] Varb [f ] 
selected many times, thus the algorithm suffers from Var[fˆ ] = + ,
4 (Na ) α (Np − Na )α
the loss of diversity.
• Resampling step also limits the opportunity to paral- with an empirical value α = 2.
lelize since all of the particles need to be combined for • Stratified sampling works very well and is efficient in
selection. a not-too-high dimension space (say Nx ≤ 4), when
Nx grows higher, the use of this technique is limited
G.5 Stratified Sampling because one needs to estimate the variance of each
The idea of stratified sampling is to distribute the sam- stratum. In [376], an adaptive recursive stratified sam-
ples evenly (or unevenly according to their respective vari- pling procedure was developed to overcome this weak-
ance) to the subregions dividing the whole space. Let fˆ ness (see [377] for implementation details).
(statistics of interest) denote the Monte Carlo sample av- 43 The inequality (66) is called the “parallel axis theorem” in
erage of a generic function f (x) ∈ RNx , which is attained physics.
MANUSCRIPT 22

G.6 Markov Chain Monte Carlo reversible, ergodic Markov chain with invariant distribu-
tion Q.46 Generally, we don’t know how fast the Markov
Consider a state vector x ∈ RNx in a probability space
chain will converge to an equilibrium,47 neither the rate
(Ω, F, P ), K(·, ·) is assumed to be a transition kernel in
of convergence or error bounds. Markov chain can be also
the state space, which represents the probability of moving
used for importance sampling, in particular, we have the
from x to a point in a set S ∈ B (where B s a Borel σ-field
following theorem:
on RNx ), a Markov chain is a sequence of random variable
{xn }n≥0 such that Theorem 4: [315] Let K(x, x ) denote a transitional ker-
nel for a Markov chain on RNx with p(x) as the den-
Pr(xn ∈ B|x0 , · · · , xn−1 ) = Pr(xn ∈ B|xn−1 ), sity of its invariant distribution, let q(x) denote the pro-
posal
 distribution with W (x) as importance weights, then
and K(xn−1 , xn ) = p(xn |xn−1 ). A Markov chain is charac- W (x)q(x)K(x, x )dx = p(x ) for all x ∈ RNx .
terized by the properties of its states, e.g. transiency, pe-
Metropolis-Hastings Algorithm. Metropolis-Hastings
riodicity, irreducibility,44 and ergodicity. The foundation
algorithm,48 initially studied by Metropolis [329], and later
of Markov chain theory is the Ergodicity Theorem, which
redeveloped by Hastings [204], is a kind of MCMC al-
establishes under which a Markov chain can be analyzed
gorithm whose transition is associated with the accep-
to determine its steady state behavior.
tance probability. Assume q(x, x ) as the proposal dis-
Theorem 3: If a Markov chain is ergodic, then there ex-
tribution (candidate target) that doesn’t satisfy the re-
ists a unique steady state distribution π independent of the
versibility condition, without loss of generality, suppose
initial state.
π(x)q(x, x ) > π(x )q(x , x), which means the probability
Markov chain theory is mainly concerned about finding moving from x to x is bigger (more frequent) than the
the conditions under which there exists an invariant dis- probability moving from x to x. Intuitively, we want to
tribution Q and conditions under which iterations of tran- change this situation to reduce the number of moves from
sition kernel converge to the invariant distribution [185], x to x . By doing this, we introduce a probability of move,
[91]. The invariant distribution satisfies 0 < α(x, x ) < 1, if the move is not performed, the process
 returns x as a value from the target distribution. Hence
Q(dx ) = K(x, dx )π(x)dx, the the transition from x to x now becomes:
X

π(x ) = K(x, x )π(x)dx pMH (x, x ) = q(x, x )α(x, x ), (71)
X 
where x = x . In order to make (71) satisfy reversibility
where x ∈ S ⊂ RNx , and π is the density w.r.t. Lebesgue condition, α(x, x ) need to be set to [204]:
measure of Q such that Q(dx ) = π(x )dx . The n-th it-    
)q(x ,x)
eration is thus given by X K (n−1) (x, dx )K(x , S). When  min π(xπ(x)q(x,x ) , 1 , if π(x)q(x, x ) > 0,
α(x, x ) = (72)
n → ∞, the initial state x will converge to the invariant 1 otherwise
distribution Q.
Markov chain Monte Carlo (MCMC) algorithms turn Hence the probability that Markov process stays at x is
around the Markov chain theory. The invariant distribu- given by
tion or density is assumed to be known which correspond 
to the target density π(x), but the transition kernel is un- 1− q(x, x )α(x, x )dx , (73)
X
known. In order to generate samples from π(·), the MCMC
methods attempt to find a K(x, dx ) whose n-th iteration and the transition kernel is given by
(for large n) converges to π(·) given an arbitrary starting
KMH (x, dx ) = q(x, x )α(x, x )dx
point.   
One of important properties of Markov chain is the re- + 1− q(x, x )α(x, x )dx δx (dx(74)

).
versible condition (a.k.a. “detailed balance”)45 X

π(x)K(x, x ) = π(x )K(x , x), (70) In summary, a generic Metropolis-Hastings algorithm


proceeds as follows [91]:
which states that the unconditional probability of moving
x to x is equal to the unconditional probability of moving • For i = 1, · · · , Np , at iteration n = 0, draw a starting
x to x, where x, x are both generated from π(·). The point x0 from a prior density;
distribution Q is thus the invariant distribution for K(·, ·). 46 Note that the samples are independent only when the Markov

In the MCMC sampling framework, unlike the impor- chain is reversible and uniformly ergodic, otherwise they are depen-
dent for which the Central Limit Theorem doesn’t hold for the con-
tance or rejection sampling where the samples are drawn in- vergence.
dependently, the samples are generated by a homogeneous, 47 Only the samples that are drawn after the Markov chain ap-
proaches the equilibrium are regarded as the representative draws
44 A Markov chain is called irreducible if any state can be reached from the posterior. The time for Markov chain converging to equilib-
from any other state in a finite number of iterations. rium is called the burn-in time.
45 Markov chains that satisfy the detailed balance are called re- 48 This algorithm appears as the first entry of a recent list of great
versible Markov chains. algorithms of 20th-century scientific computing.
MANUSCRIPT 23

• generate a uniform random variable u ∼ U(0, 1), and • Gibbs sampling is an alternating sampling scheme,
x ∼ q(xn , ·); since the conditional density to be sampled is low-
• If u < α(xn , x ), set xn+1 = x , else xn+1 = xn ; dimensional, the Gibbs sampler is a nice solution to
• n = n + 1, repeat steps 2 and 3, until certain (say k) estimation of hierarchical or structured probabilistic
steps (i.e. burn-in time), store x(i) = xk . model.
• i = i + 1, repeat the procedure until Np samples are • Gibbs sampling can be viewed as a Metropolis method
drawn, return the samples {x(1) , · · · , x(Np ) }. in which the proposal distribution is defined in terms
of the conditional distributions of the joint distribution
Remarks: and every proposal is always accepted [318].
• If the candidate-generating density is symmetric (e.g. • Gibbs sampling has been extensively used for dynamic
random walk), i.e. q(x, x ) = q(x , x), the probability state space model [71] within the Bayesian framework.
of move reduces to π(x )/π(x), hence (72) will reduce • Adaptive rejection Gibbs sampling algorithm was also
to: if π(x ) ≥ π(x) the chain moves to x ; and remains developed in [187].
the same otherwise. This is the original algorithm in In addition to Metropolis-Hastings algorithm and Gibbs
[329], it was also used in simulated annealing [257]. sampling, MCMC methods are powerful and have a huge
• The probability of move doesn’t need the knowledge literature. We cannot extend the discussions due to the
of normalizing constant of π(·). space constraint and refer the reader to [176], [182], [185]
• The draws are regarded as the samples from the target for more discussions on MCMC methods, and the review
density only after the chain has passed the transient paper [416] for Bayesian estimation using MCMC methods.
phase, the convergence to the invariant distribution In the context of sequential state estimation, Metropolis-
occurs under mild regularity conditions (irreducibility Hastings algorithm and Gibbs sampling are less attrac-
and aperiodicity) [416]. tive because of their computational inefficiency in a non-
• The efficiency of Metropolis algorithm is determined iterative fashion. On the other hand, both of them use ran-
by the ratio of the accepted samples to the total num- dom walk to explore the state space, the efficiency is low
ber of samples. Too large or too small variance of the when Nx is big. Another important issue about MCMC
driving-force noise may result in inefficient sampling. methods is their convergence: How long it takes a MCMC
• It was suggested in [95] to use a Gaussian proposal dis- to converge to an equilibrium? How fast is the conver-
tribution N (μ, Σ) for Metropolis-Hastings algorithm gence rate? Many papers were devoted to investigating
(or in MCMC step of particle filter), where the mean these questions [99], [140].49 One way to reduce the reduc-
and covariance are determined by ing the “blind” random-walk behavior in Gibbs sampling is
Np (i) (i)
the methods of over-relaxation [2], [349], [318]; another way
i=1 W x is the so-called hybrid Monte Carlo method as we discuss
μ = Np ,
(i) next.
i=1 W
Np (i) (i)
i=1 W (x − μ)(x(i) − μ)T G.7 Hybrid Monte Carlo
Σ = Np (i)
.
i=1 W Hybrid Monte Carlo (HMC) algorithm [152] is a kind
of asymptotically unbiased MCMC algorithm for sampling
Gibbs Sampling. Gibbs sampling, initially developed by
from complex distributions. In particular, it can be viewed
Geman and Geman in image restoration [178], is a special
as a Metropolis method which uses gradient information
form of MCMC [185], [173], or a special form of Metropolis-
to reduce random walk behavior. Assume the probability
Hastings algorithm [329], [204], [175], [176]. The Gibbs
distribution is written as [346], [318]
sampler uses the concept of alternating (marginal) con-
ditional sampling. Given an Nx -dimensional state vector exp(−E(x))
P (x) = , (75)
x = [x1 , x2 , · · · , xNx ]T , we are interested in drawing the Z
samples from the marginal density in the case where joint where Z is a normalizing constant. The key idea of HMC
density is inaccessible or hard to sample. The generic pro- is not only use the energy E(x) but also its gradient (w.r.t.
cedure is as follows (e.g., [73]): to x), since the gradient direction might indicate the way
• At iteration n = 0, draw x0 from the prior density
to find the state with a higher probability [318].
p(x0 ); In the HMC,50 the state space x is augmented by a mo-
• At iterations n = 1, 2, · · · , draw a sample x1,n from mentum variable η; and two proposals are alternately used.
p(x1 |x2,n−1 , x3,n−1 , · · · , xNx ,n−1 ); The first proposal randomizes the momentum variable with
• draw a sample x2,n from p(x2 |x1,n , x3,n−1 , · · · , xNx ,n−1 ); the state x unchanged; the second proposal changes both
··· x and η using the simulated Hamilton dynamics as follows
• draw a sample xNx ,n from p(xNx |x1,n , x2,n , · · · , xNx −1,n ); [318]
To illustrate the idea of Gibbs sampling, an example with H(x, η) = E(x) + K(η), (76)
four-step iterations in a two-dimensional probability space 49 See also the recent special MCMC issue in Statistical Science, vol.
p(x1 , x2 ) is presented in Fig. 8. 16, no. 4, 2001.
Remarks: 50 A pseudocode of HMC algorithm was given in [318].
MANUSCRIPT 24

x2 x2 x2

x n+3

x n+2
p(x 2 |x 1 )
x n+1
p(x 1 |x 2 )
xn

x1 x1 x1

Fig. 8. An illustration of Gibbs sampling in a two-dimensional space (borrowed and changed from MacKay (1998) with permission). Left:
Starting from state xn , x1 is sampled from the conditional pdf p(x1 |x2,n−1 ). Middle: A sample is drawn from the conditional pdf p(x2 |x1,n ).
Right: Four-step iterations in the probability space (contour).

where K(η) is a kinetic energy with the form K(η) = 12 η T η. G.8 Quasi-Monte Carlo
These two proposals are used to produce samples from the
Another important Monte Carlo method attempting to
joint distribution:
accelerate the convergence is quasi-Monte Carlo (QMC)
1 (e.g., see [353], [425], [363]), which was extensively used in
PH (x, η) = exp[−H(x, η)]
ZH computer graphics. The mathematical foundation of QMC
1 is the number theory instead of probability theory, hence it
= exp[−E(x)] exp[−K(η)], (77) is a deterministic method. The idea of QMC methods is to
ZH
substitute the pseudo-randomly generated sequence used
where ZH = ZZK is a normalizing constant. The distribu- in the regular MC methods with a deterministic sequence
tion PH (x, η) is separable and the marginal distribution of in order to minimize the divergence, and also to replace the
x is the desired distribution exp[−E(x)]/Z. By discarding probabilistic error bounds of regular MC with determinis-
the momentum variables, a sequence of random samples tic bounds. In the QMC, a popular class of deterministic
x(i) can be generated that can be viewed as asymptotically sequence called low-discrepancy sequence (LDS) is often
being drawn from P (x). The first proposal draws a new used to generate the samples points [353]. The LDS has
momentum from the Gaussian density exp[−K(η)]/ZK . In a minimum discrepancy51 O((log Np )Nx −1 /Np ) (for a large
the second proposal, the momentum determines where the Np ), which isfaster than the regular MC methods’ error
state should go, and the gradient of E(x) determines how bound O(1/ Np ) (from Central Limit Theorem). There
the momentum η changes according to the following dif- are many methods for constructing LDS, among them the
ferential equations lattice rule (LR) is a popular one due to its simplicity and
potential variance redundancy advantage [295], [296]. By
ẋ = η (78a)
using some lattice rule to generate a point set
∂E(x)
η̇ = − . (78b)  (i − 1)
∂x S= (1, a, · · · , aNx −1 ) mod 1, i = 1, · · · , Np ,
Np
Since the motion of x is driven by the direction of momen-
tum η, intuitively the state converges faster than the con- where Np is the number of lattice points in S and a is
ventional MC methods. With perfect simulatio of Hamil- an integer between 1 and Np − 1. For a square-integrable
ton dynamics, the total energy H(x, η) is a constant, thus function f over [0, 1)Nx , the estimator of QMC via a lattice
(72) is always 1 and the proposal is always accepted; with rule is given by
imperfect simulation, we can obtain, asymptotically, the
samples from PH (x, η) [318]. 1 
Np

Remarks: fˆLR = f ((xi + Δ) mod 1). (79)


Np i=1
• HMC method can be used for particle filter [94]: In-
stead of being weighted by the likelihood, each particle It was shown in [295] that the estimate (79) is unbiased
produces a Markov chain that follows the gradient of and Var[fˆLR ]  Var[fˆMC ]; in particular when f is linear,
the posterior over large distances, which allows it to
Var[fˆLR ] = N1p Var[fˆMC ]; in some cases where f is nonlin-
rapidly explore the state space and produce samples
from the target distribution. ear, the convergence rate O(1/Np2 ) might be achieved.
• Some improved HMC methods were developed in [347], Remarks:
[346]. • QMC can be viewed as a special quadrature tech-
• The idea of using gradient information in HMC can nique with a different scheme choosing the quadrature
be extended to sequential framework, e.g. the HySIR 51 It is a measure of the uniformity of distribution of finite point
algorithm [120]. sets.
MANUSCRIPT 25

TABLE II
tics, the posterior distribution or density is empirically rep-
A List of Popular Monte Carlo Methods.
resented by a weighted sum of Np samples drawn from the
posterior distribution
author(s) method inference references

1 
Np
Metropolis MCMC off line [330], [329]

Marshall importance sampling on/off line [324], [199], [180]


p(xn |Yn ) ≈ δ(xn − x(i)
n ) ≡ p̂(xn |Yn ), (80)
Np n=1
N/A rejection sampling off line [199], [197]

(i)
where xn are assumed to be i.i.d. drawn from p(xn |Yn ).
N/A stratified sampling on/off line [376], [377], [69]

When Np is sufficiently large, p̂(xn |Yn ) approximates the


Hastings MCMC off line [204]

Geman & Geman Gibbs sampling off line [178], [175]


true posterior p(xn |Yn ). By this approximation, we can
Handschin & Mayne SIS off line [200], [506], [266]
estimate the mean of a nonlinear function
Rubin multiple imputation off line [394], [395] 
Rubin SIR on/off line [397], [176]
E[f (xn )] ≈ f (xn )p̂(xn |Yn )dxn
Gordon et al. bootstrap on line [191], [193]
Np 
Duane et al. HMC on/off line [152], [347], [346]
1 
N/A QMC on/off line [353], [425], [354] = f (xn )δ(xn − x(i)
n )dxn
Np i=1
Chen & Schmeiser hit-and-run MC off line [81], [417]

1 
Np
N/A slice sampling off line [336], [351]

N/A perfect sampling off line [133], [490]


= f (x(i) ˆ
n ) ≡ fNp (x). (81)
Np i=1

Since it is usually impossible to sample from the true pos-


points, it can be used for marginal density estimation terior, it is common to sample from an easy-to-implement
[363]. distribution, the so-called proposal distribution 52 denoted
• QMC method can be also applied to particle filters by q(xn |Yn ), hence
[361]. 
p(xn |Yn )
To the end of this subsection, we summarize some popu- E[f (xn )] = f (xn ) q(xn |Yn )dxn
lar Monte Carlo methods available in the literature in Table q(xn |Yn )

II for the reader’s convenience. Wn (xn )
= f (xn ) q(xn |Yn )dxn
p(Yn )

VI. Sequential Monte Carlo Estimation: 1
Particle Filters = f (xn )Wn (xn )q(xn |Yn )dxn ,(82)
p(Yn )
With the background knowledge of stochastic filtering, where
Bayesian statistics, and Monte Carlo techniques, we are
now in a good position to discuss the theory and paradigms p(Yn |xn )p(xn )
Wn (xn ) = . (83)
of particle filters. In this section, we focus the attention on q(xn |Yn )
the sequential Monte Carlo approach for sequential state
estimation. Sequential Monte Carlo technique is a kind of Equation (82) can be rewritten as
recursive Bayesian filter based on Monte Carlo simulation, 
f (xn )Wn (xn )q(xn |Yn )dxn
it is also called bootstrap filter [193] and shares many com- E[f (xn )] = 
p(Yn |xn )p(xn )dxn
mon features with the so-called interacting particle system 
approximation [104], [105], [122], [123], [125], CONDEN- f (xn )Wn (xn )q(xn |Yn )dxn
= 
SATION [229], [230], Monte Carlo filter [259]-[261], [49], Wn (xn )q(xn |Yn )dxn
sequential imputation [266], [303], survival of fittest [254], Eq(xn |Yn ) [Wn (xn )f (xn )]
and likelihood weighting algorithm [254]. = . (84)
Eq(xn |Yn ) [Wn (xn )]
The working mechanism of particle filters is following:
The state space is partitioned as many parts, in which the (i)
By drawing the i.i.d. samples {xn } from q(xn |Yn ), we can
particles are filled according to some probability measure. approximate (84) by
The higher probability, the denser the particles are concen-
1
Np (i) (i)
trated. The particle system evolves along the time accord- i=1 Wn (xn )f (xn )
Np
ing to the state equation, with evolving pdf determined E[f (xn )] ≈ Np (i)
1
by the FPK equation. Since the pdf can be approximated Np i=1 Wn (xn )
by the point-mass histogram, by random sampling of the 
Np
state space, we get a number of particles representing the = W̃n (x(i) (i) ˆ
n )f (xn ) ≡ f (x), (85)
evolving pdf. However, since the posterior density model i=1
is unknown or hard to sample, we would rather choose an- 52 It is also called importance density or important function. The
other distribution for the sake of efficient sampling. optimal proposal distribution is the one that minimizes the condi-
To avoid intractable integration in the Bayesian statis- tional variance given the observations up to n.
MANUSCRIPT 26

TABLE III
where
SIS Particle Filter with Resampling.
(i)
Wn (xn )
W̃n (x(i)
n ) = Np (j)
. (86)
j=1 Wn (xn ) For time steps n = 0, 1, 2, · · ·
(i) (i)
1: For i = 1, · · · , Np , draw the samples xn ∼ q(xn |x0:n−1 , y0:n )
Suppose the proposal distribution has the following fac- (i) (i) (i)
and set x0:n = {x0:n−1 , xn }.
torized form (i)
2: For i = 1, · · · , Np , calculate the importance weights Wn ac-
q(x0:n |y0:n ) = q(xn |x0:n−1 , y0:n )q(x0:n−1 |y0:n−1 ) cording to (88).
(i)

n 3: For i = 1, · · · , Np , normalize the importance weights W̃n
= q(x0 ) q(xt |x0:t−1 , y0:t ). according to (86).
t=1 4: Calculate N̂ef f according to (90), return if N̂ef f > NT ,
(j)
otherwise generate a new particle set {xn } by resampling
Similar to the derivation steps in (23), the posterior (i)
with replacement Np times from the previous set {x0:n } with
p(x0:n |y0:n ) can be factorized as (j) (i) (i)
probabilities Pr(x0:n = x0:n ) = W̃0:n , reset the weights
p(yn |xn )p(xn |xn−1 ) (i)
W̃n = 1/Np .
p(x0:n |y0:n ) = p(x0:n−1 |y0:n−1 )
p(yn |y0:n−1 )
(i)
Thus the importance weights Wn can be updated recur- The second equality above follows from the facts that
sively Var[ξ] = E[ξ 2 ] − (E[ξ])2 and Eq [W̃ ] = 1. In practice, the
(i)
p(x0:n |y0:n ) true Nef f is not available, thus its estimate, N̂ef f , is alter-
Wn(i) = (i) natively given [305], [303]:
q(x0:n |y0:n )
(i) (i) (i) (i) 1
p(yn |xn )p(xn |xn−1 )p(x0:n−1 |y0:n−1 ) N̂ef f = Np . (90)
∝ (i) (i) (i)
(i) 2
q(xn |x0:n−1 , y0:n )q(x0:n−1 |y0:n−1 ) i=1 (W̃n )
(i) (i) (i)
(i) p(yn |xn )p(xn |xn−1 ) When N̂ef f is below a predefined threshold NT (say Np /2
= Wn−1 (i) (i)
. (87)
q(xn |x0:n−1 , y0:n ) or Np /3), the resampling procedure is performed. The
above procedure was also used in the rejection control [304]
A. Sequential Importance Sampling (SIS) Filter that combines the rejection method [472] and importance
In practice, we are more interested in the current fil- sampling. The idea is following: when the N̂ef f < NT
tered estimate p(xn |y0:n ) instead of p(x0:n |y0:n ). Pro- (where NT can be either a predefined value or the median of
(i) (i) the weights), then each sample is accepted with probability
vided q(xn |x0:n−1 , y0:n ) is assumed to be equivalent to (i)
(i) (i) min{1, Wn /NT }; all the accepted samples are given a new
q(xn |x0:n−1 , yn ), (87) can be simplified as (j) (i)
weight Wn = max{NT , Wn }, and the rejected samples
(i) (i) (i)
(i) p(yn |xn )p(xn |xn−1 ) are restarted and rechecked at the all previously violated
Wn(i) = Wn−1 (i) (i)
. (88) thresholds. It is obvious that this procedure is computa-
q(xn |x0:n−1 , yn )
tional expensive as n increases. Some advanced scheme like
As discussed earlier, the problem of the SIS filter is that partial rejection control [308] was thus proposed to reduce
the distribution of the importance weights becomes more the computational burden, while preserving the dynamic
and more skewed as time increases. Hence, after some it- control of the resampling schedule. A generic algorithm of
erations, only very few particles have non-zero importance SIS particle filter with resampling is summarized in Table
weights. This phenomenon is often called weight degener- III.
acy or sample impoverishment [396], [193], [40], [304]. An
intuitive solution is to multiply the particles with high nor- B. Bootstrap/SIR filter
malized importance weights, and discard the particles with The Bayesian bootstrap filter due to Gordon, Salmond
low normalized importance weights, which can be be done and Smith [193], is very close in spirit to the sampling im-
in the resampling step. To monitor how bad is the weight portance resampling (SIR) filter developed independently
degeneration, we need a measure. A suggested measure in statistics by different researchers [304], [307], [369], [370],
for degeneracy, the so-called effective sample size, Nef f , [69], with a slight difference on the resampling scheme.
was introduced in [266] (see also [303], [305], [315], [144], Here we treat them as the same class for discussion. The
[350])53 key idea of SIR filter is to introduce the resampling step as
Np we have discussed in Section V-G.4. The resampling step
Nef f =
1 + Varq(·|y0:n ) [W̃ (x0:n )] is flexible and varies from problems as well as the selection
scheme and schedule. It should be noted that resampling
Np
= ≤ Np . (89) does not really prevent the weight degeneracy problem, it
Eq(·|y0:n ) [(W̃ (x0:n ))2 ] just saves further calculation time by discarding the par-
53 It was claimed that [70], [162] the estimate N
ef f is not robust,
ticles associated with insignificant weights. What it really
see discussion in Section VI-P.3. does is artificially concealing the impoverishment by re-
;;;;;;;;
;;;
;;;; ;;;
MANUSCRIPT 27

;;;;;;;;
p(x n |y n- 1) particle cloud
{x n (i) }

;;;;;;;;
;;;;
;;;;;;;;
;;;;;;
p(y n |x n ) correction

p(x n |y n ) resampling

prediction

p(x n+1|y n ) {x n+1(i) }

Fig. 9. An illustration of generic particle filter with importance sampling and resampling.

TABLE IV
in off-line processing), the posterior estimate (and its
SIR Particle Filter Using Transition Prior as Proposal
relevant statistics) should be calculated before resam-
Distribution.
pling.
• As suggested by some authors [259], [308], in the re-
For time steps n = 0, 1, 2, · · · sampling stage, the new importance weights of the sur-
(i) (i)
1: Initialization: for i = 1, · · · , Np , sample x0 ∼ p(x0 ), W0 = viving particles are not necessarily reset to 1/Np , but
1
N
. rather abide certain procedures.
p
(i)
2: Importance Sampling: for i = 1, · · · , Np , draw samples x̂n ∼ • To alleviate the sample degeneracy in SIS filter, we can
(i) (i) (i) (i)
p(xn |xn−1 ), set x̂0:n = {x0:n−1 , x̂n }. change (88) as
(i)
3: Weight update: Calculate the importance weights Wn =
(i) α p(yn |xn )p(xn |xn−1 )
p(yn |x̂n ). Wn = Wn−1 ,
(i) Wn
(i) q(xn |x0:n−1 , yn )
4: Normalize the importance weights: W̃n =  Np (j)
.
j=1 Wn
(i)
where the scalar 0 < α < 1 plays a role as annealing
5: Resampling: Generate Np new particles xn from the set
(i) (i)
factor that controls the impact of previous importance
{x̂n } according to the importance weights W̃n .
weights.
6: Repeat Steps 2 to 5.
C. Improved SIS/SIR Filters

placing the high important weights with many replicates In the past few years, many efforts have been devoted
of particles, thereby introducing high correlation between to improving the particle filters’ performance [69], [189],
particles. [428], [345], [456], [458], [357]. Here, due to space limita-
tion, we only focus on the improved schemes on (efficient)
A generic algorithm of Bayesian bootstrap/SIR filter us-
sampling/resampling and variance reduction.
ing transition prior density as proposal distribution is sum-
marized in Table IV, where the resampling step is per- In order to alleviate the sample impoverishment prob-
formed at each iteration using any available resampling lem, a simple improvement strategy is prior boosting [193].
method discussed earlier. Namely, in the sampling step, one can increase the number
of simulated samples drawn from the proposal, Np > Np ;
Remarks:
but in the resampling step, only Np particles are preserved.
• Both SIS and SIR filters use importance sampling
Carpenter, Clifford, and Fearnhead [69] proposed using
scheme. The difference between them is that in SIR
a sophisticated stratified sampling (also found in [259]) for
filter, the resampling is always performed (usually be-
particle filtering. In particular, the posterior density is
tween two importance sampling steps); whereas in SIS
assumed to comprise of Np distinct mixture strata54
filter, importance weights are calculated sequentially,
resampling is only taken whenever needed, thus SIS 
Np

Np
filter is less computationally expensive. p(x) = ci pi (x), ci = 1, (91)
• The choice of proposal distributions in SIS and SIR i=1 i=1
filters plays an crucial role in their final performance.
According to [69], a population quantity can be estimated
• Resampling step is suggested to be done after the fil-
efficiently by sampling a fixed number Mi from each stra-
tering [75], [304], because resampling brings extra ran-
dom variation to the current samples. Normally (eps. 54 This is the so-called survey sampling technique [199], [162].
MANUSCRIPT 28

Np
tum, with i=1 Mi = Np (Np  Mi ). The efficiency is • if a(i) < 1, remove the sample with probability 1−a(i) ;
(j) (i)
attained with Neyman allocation Mi ∝ ci σi (where σi is assign the weight Wn = Wn /a(i) to the survived
the variance of generic function f (x) in the i-th stratum), sample.
(j)
or with proportional allocation Mi = ci Np for simplicity. • Return the new particle set {x(j) , Wn }.
It was argued that in most of cases the proportional allo-
cation is more efficient than simple random sampling from D. Auxiliary Particle Filter
p(x). In the particle filtering context, the coefficients ci A potential weakness of generic particle filters discussed
and pi (x) are determined recursively [69]: above is that the particle-based approximation of filtered
 (i)
density is not sufficient to characterize the tail behavior of
p(xn |xn−1 )p(yn |xn )dxn true density, due to the use of finite mixture approxima-
ci = Np  (i)
, (92)
i=1 p(xn |xn−1 )p(yn |xn )dxn
tion; this is more severe when the outliers are existent. To
(i) alleviate this problem, Pitt and Shephard [370], [371] in-
p(xn |xn−1 )p(yn |xn ) troduced the so-called auxiliary particle filter (APF). The
pi (xn ) =  (i)
. (93)
p(xn |xn−1 )p(yn |xn )dxn idea behind it is to augment the existing “good” particles
(i)
{x(i) } in a sense that the predictive likelihoods p(yn |x0:n−1 )
For the i-th stratum, the importance weights associated (i)
are large for the “good” particles. When p(yn |x0:n−1 ) can-
with the Mi particles are updated recursively by not be computed analytically, it uses an analytic approx-
(i)
(j) (i) (j) imation; when p(yn |x0:n−1 ) can be computed exactly, it
(i) p(xn |xn−1 )p(yn |xn ) uses the optimal proposal distribution (which is thus called
Wn(j) = Wn−1 (j)
(94)
ci pi (xn ) “perfect adaptation” [370]). The APF differs from SIR in
that it reverses the order of sampling and resampling, which
i−1 i is possible when the importance weights are dependent on
for =1 M < j ≤ =1 M . By stratified sampling in
the update stage, the variance reduction is achieved.55 In xn . By inserting the likelihood inside the empirical density
the resampling stage, a sample set of size Np is selected mixture, we may rewrite the filtered density as
from the 10 × Np predicted values to keep the size of par- 
ticle set unchanged.56 By taking advantage of the method p(xn |y0:n ) ∝ p(yn |xn ) p(xn |xn−1 )p(xn−1 |y0:n−1 )dxn−1
of simulating order statistics [386], an improved SIR algo-
rithm with O(Np ) complexity via stratified sampling was 
Np
(i) (i)
∝ Wn−1 p(yn |xn )p(xn |xn−1 ), (95)
developed [69], to which reader is referred for more details.
i=1
Many improved particle filters are devoted to the resam-
pling step. For instance, given the discrete particle set 
Np
(i) (i)
(i) (i) Np where p(xn−1 |y0:n−1 ) = Wn−1 δ(xn−1 − xn−1 ). Now
{xn , W̃n }i=1 , it was suggested [308] that in the resam- i=1
(j) (j) Np (i)
pling stage, a new independent particle set {xn , W̃n }j=1 the product Wn−1 p(yn |xn ) is treated as a combined prob-
is generated as follows: ability contributing to the filtered density. By introducing
(j) (i) an auxiliary variable ξ (ξ ∈ {1, · · · , Np }) that plays a role
• For j = 1, · · · , Np , xn replaces xn with probability of index of the mixture component, the augmented joint
proportional to a(i) ; density p(xn , ξ|y0:n ) is updated as
(j) (j)
• The associated new weights W̃n is updated as W̃n =
(i)
W̃n /a(i) . p(xn , ξ = i|y0:n ) ∝ p(yn |xn )p(xn , ξ = i|y0:n−1 )
= p(yn |xn )p(xn |ξ = i, y0:n−1 )p(i|y0:n−1 )
In the conventional multinomial resampling scheme (Sec- (i) (i)
(i) = p(yn |xn )p(xn |xn−1 )Wn−1 . (96)
tion V-G.4), a(i) = Np Wn ; however
 in general, the choices
(i) (i) Henceforth a sample can be drawn from joint density (96)
of a(i) are flexible, e.g. a(i) = Wn , or a(i) = |Wn |α .
Liu, Chen and Logvinenko [308] also proposed to use a par- via simply neglecting the index ξ, by which a set of par-
(i) Np
tially deterministic reallocation scheme instead of resam- ticles {xn }i=1 are drawn from the marginalized density
pling to overcome the extra variation in resampling step. p(xn |y0:n ) and the indices ξ are simulated with probabili-
The reallocation procedure proceeds as follows [308]: ties proportional to p(ξ|y0:n ). Thus, (95) can be approxi-
mated by
• For i = 1, · · · , Np , if a(i) ≥ 1, retain ki = a(i)  (or
(i)
ki = a(i)  + 1) copies of the xn ; assign the weight 
Np
(i) (i)
(j) (i) p(xn |y0:n ) ∝ Wn−1 p(yn |x(i)
n , ξ )p(xn |xn−1 ),
i
(97)
Wn = Wn /ki for each copy; i=1

(i)
55 Intuitively, they use the weighted measure before resampling where ξ i denotes the index of the particle xn at time step
rather than resampling and then using the unweighted measure, be- n − 1, namely ξ i ≡ {ξ = i}. The proposal distribution used
cause the weighted samples are expected to contain more information (i) Np
than an equal number of unweighted points. to draw {xn , ξ i }i=1 is chosen as a factorized form
56 The number 10 was suggested by Rubin [395] where N  10.
p
The number of particle set is assumed to be unchanged. q(xn , ξ|y0:n ) ∝ q(ξ|y0:n )q(xn |ξ, y0:n ), (98)
MANUSCRIPT 29

TABLE V
where
Auxiliary Particle Filter.
(i)
q(ξ i |y0:n ) ∝ p(yn |μ(i)
n )Wn−1 (99)
(i) For time steps n = 1, 2, · · ·
q(xn |ξ i , y0:n ) = p(xn |xn−1 ). (100) (i) (i) (i)
1: For i = 1, · · · , Np , calculate μn (e.g. μn = E[p(xn |xn−1 )].
(i) 2: For i = 1, · · · , Np , calculate the first-stage weights Wn
(i)
where μ is a value (e.g. mean, mode, or sample value) =
(i) (i) (i) (i) (i)
associated with p(xn |xn−1 ) from which the i-th particle is Wn−1 p(yn |μn ) and normalize weights W̃n = Wn
 Np (j)
.
j=1 Wn
drawn. Thus the true posterior is further approximated by 3: Use the resampling procedure in SIR filter algorithm to obtain
(i) Np
new {xn , ξ i }i=1 .

Np
(i) (ξ=i) (i) (i) (i)
p(xn |y0:n ) ∝ Wn−1 p(yn |μ(ξ=i)
n )p(xn |xn−1 ). (101) 4: For i = 1, · · · , Np , sample xn ∼ p(xn |xn−1 , ξ i ), update the
(i)
i=1 second-stage weights Wn according to (102).

From (99) and (100), the important weights are recursively


updated as
(i) (i) (ξ=i)
(ξ=i) p(yn |xn )p(xn |xn−1 )
Wn(i) = Wn−1 (i)
• When the process noise is small, the performance of
q(xn , ξ i |y0:n ) APF is usually better than that of SIR filter, how-
(i) ever, when the process noise is large, the point esti-
p(yn |xn )
∝ (ξ=i)
. (102) (i)
mate μn doesn’t provide sufficient information about
p(yn |μn ) (i)
p(xn |xn−1 ), then the superiority of APF is not guar-
The APF is essentially a two-stage procedure: At the first anteed [19].
stage, simulate the particles with large predictive likeli- • In the APF, the proposal distribution is proposed as a
hoods; at the second stage, reweigh the particles and draw mixture density that depends upon the past state and
the augmented states. This is equivalent to making a the most recent observations.
proposal that has a high conditional likelihood a priori, • The idea of APF is also identical to that of local Monte
thereby avoiding inefficient sampling [370]. The auxiliary Carlo method proposed in [304], where the authors
variable idea can be used for SIS or SIR filters. An auxil- proposed two methods for draw samples {x, ξ}, based
iary SIR filter algorithm is summarized in Table V. on either joint distribution or marginal distribution.
It is worthwhile to take a comparison between APF and • The disadvantage of APF is that the sampling is drawn
SIR filter on the statistical efficiency in the context of the in an augmented (thus higher) space, if the auxiliary
random measure E[W̃ 2 (x(i) )]. Pitt and Shephard [370] index varies a lot for a fixed prior, the gain is negligible
showed that when the likelihood does not vary over dif- and the variance of importance weights will be higher.
ferent ξ, then the variance of APF is smaller than that of • The APF is computationally slower since the proposal
SIR filter. APF can be understood as a one-step ahead fil- is used twice. It was argued that [162] (chap. 5) the
(i) (i)
tering [369]-[371]: the particle xn−1 is propagated to ξn in resampling step of APF is unnecessary, which intro-
the next time step in order to assist the sampling from the duces nothing but inaccuracy. This claim, however, is
posterior. On the other hand, APF resamples p(xn−1 |y0:n ) not justified sufficiently.
instead of p(xn |y0:n ) used in SIR, hence it usually achieves • The idea of auxiliary variable can be also used for
lower variance because the past estimate is more reliable. MCMC methods [210], [328].
Thus APF actually takes advantage of beforehand the in-
formation from likelihood model to avoid inefficient sam-
pling because the particles with low likelihood are deemed
less informative; in other words, the particles to be sam-
pled are intuitively pushed to the high likelihood region. E. Rejection Particle Filter
But when the conditional likelihood is not insensitive to
the state, the difference between APF and SIR filter is It was suggested in [222], [441], [444], [49] that the re-
insignificant. APF calculates twice the likelihood and im- jection sampling method is more favorable than the impor-
portance weights, in general it achieves better performance tance sampling method for particle filters, because rejection
than SIS and SIR filters. sampling achieves exact draws from the posterior. Usually
Remarks: rejection sampling doesn’t admit a recursive update, hence
• In conventional particle filters, estimation is usually how to design a sequential procedure is the key issue for
performed after the resampling step, which is less ef- the rejection particle filter.
ficient because resampling introduces extra random Tanizaki [441]-[444] has developed a rejection sampling
variation in the current state [75], [304]. APF basi- framework for particle filtering. The samples are drawn
cally overcomes this problem by doing one-step ahead from the filtering density p(xn |y0:n ) without evaluating any
(i)
estimation based on the point estimate μn that char- integration. Recalling (20) and inserting equations (24)
(i)
acterizes p(xn |xn−1 ). and (25) to (23), the filtering density can be approximated
MANUSCRIPT 30

TABLE VI
• Rejection particle filter usually produces better results
Rejection Particle Filter.
than SIR filter if the proposal distribution is appro-
priate and the supremum of the ratio p(·)/q(·) exists.
For time steps n = 1, 2, · · · However, if the acceptance probability α(z) is small,
(i) (i)
1: For i = 1, draw xn−1 with probability λn ; it takes a long time to produce a sufficient sample set.
2: Generate a random draw z ∼ q(xn ); • Another drawback of rejection particle filter is that
3: Draw a uniform random variable u ∼ U (0, 1);
(i) the computing time for every time step is fluctuating
4: If u ≤ α(z), accept z as xn ; otherwise go back to step 2;
5: i = i + 1, repeat the procedure until i = Np ; because of the uncertainty of acceptance probability,
6: Calculate the sample average fˆNp , and calculate the posterior if the acceptance rate is too low, real-time processing
according to (103). requirement is not satisfied.
• It was suggested by Liu [305] to use Var[fˆ]/Np as a
measure to verify the efficiency for rejection sampling
as and importance sampling. It was claimed based on
 many experiments that, for a large Np , importance
1
p(xn |y0:n ) = p(yn |xn )p(xn |xn−1 )p(xn−1 |y0:n−1 )dxn−1 sampling is more efficient in practice.
Cn • Rejection sampling can be also used for APF. In fact,
(i)
1  Cn p(yn |xn )p(xn |xn−1 )
Np (i) the proposal of APF accounts for the most recent ob-
≈ (i) servations and thus is more close to true posterior,
Np i=1 Cn Cn
thereby may increase the average acceptance rate.

Np (i)
p(yn |xn )p(xn |xn−1 )
= λ(i)
n (i)
, (103)
F. Rao-Blackwellization
i=1 Cn
Rao-Blackwellization, motivated by the Rao-Blackwell
(i) (i) theorem, is a kind of marginalization technique. It was first
where λn = Cn /Np Cn . The normalizing constant Cn is
given as used in [175] to calculate the marginal density with Monte
  Carlo sampling method. Casella and Robert [74] also de-
Cn = p(yn |xn )p(xn |xn−1 )p(xn−1 |y0:n−1 )dxn−1 dxnveloped Rao-Blackwellization methods for rejection sam-
pling and Metropolis algorithm with importance sampling
procedure. Because of its intrinsic property of variance re-
1 
Np Np
(ji)
≈ p(y |x
n n|n−1 ) ≡ Ĉn , (104) duction, it has been used in particle filters to improve the
Np2 i=1 j=1 performance [304], [14], [315], [145], [119]. There are couple
ways to use Rao-Blackwellization: (i) state decomposition;
(ji) (i) (j)
where xn|n−1 is obtained from f (xn−1 , dn ). In addition, (ii) model simplification; and (iii) data augmentation, all of
(i) which are based on the underlying Rao-Blackwell theorem:
C is given as
n
 Theorem 5: [388] Let fˆ(Y ) be an unbiased estimate of
(i)
Cn(i) = p(yn |xn )p(xn |xn−1 )dxn f (x) and Ψ is a sufficient statistics for x. Define fˆ(Ψ(y)) =
Ep(x) [fˆ(Y )|Ψ(Y ) = Ψ(y)], then fˆ[Ψ(Y )] is also an unbiased
1 
Np
(ji) estimate of f (x). Furthermore,
≈ p(yn |xn|n−1 ) ≡ Ĉn(i) . (105)
Np j=1
Varp(x) [fˆ(Ψ(Y ))] ≤ Varp(x) [fˆ(Y )],
Hence the filtering density is approximated as a mixture
(i)
distribution associated with the weights λn , which are ap- and equality if and and only if Pr(fˆ(Y ) = fˆ(Ψ(Y ))) = 1.
(i)
proximated by Ĉn /Np Ĉn . The acceptance probability, de-
noted by α(·), is defined as The proof of this theorem is based on Jensen’s Inequality
(see e.g., [462]). The importance of Rao-Blackwellization
(i) theorem is that, with a sufficient statistics Ψ, we can im-
p(yn |z)p(z|xn−1 )/q(z)
α(z) = (i)
, (106) prove any unbiased estimator that is not a function of Ψ
sup{p(yn |z)p(z|xn−1 )/q(z)} by conditioning on Ψ; in addition, if Ψ is sufficient for x
z
and if there is a unique function of Ψ that is an unbiased
where q(·) is a proposal distribution. The estimation pro- estimate of f (x), then such function is a minimum variance
cedure of rejection particle filter is summarized in Table unbiased estimate for f (x).
VI. For dynamic state space model, the basic principle of
The proposal distribution q(xn ) can be chosen as tran- Rao-Blackwellization is to exploit the model structure in
sition density p(xn |xn−1 ) or a mixture distribution (e.g. order to improve the inference efficiency and consequently
Gaussian mixture, see Section VI-M.4). But the variance to reduce the variance. For example, we can attempt to de-
of proposal distribution should be bigger than the posterior compose the dynamic state space into two parts, one part
density’s, since it is supposed to have a broad support. being calculated exactly using Kalman filter, the other part
Remarks: being inferred approximately using particle filter. Since the
MANUSCRIPT 31

first part is inferred exactly and quickly, the computing where the latent variable zn is related to the measurement
power is saved and the variance is reduced. The follow- yn with an analytic (e.g. exponential family) conditional
ing observations were given in [143], [144]. Let the states pdf p(yn |zn ). Hence, the state estimation problem can be
vector be partitioned into two parts xn = [x1n x2n ], where written by
marginal density p(x2n |x1n ) is assumed to be tractable ana- 
lytically. The expectation of f (xn ) w.r.t. the posterior can p(x0:n |y0:n ) = p(x0:n |z0:n )p(z0:n |y0:n )dz0:n . (110)
be rewritten by:
 The probability distribution p(z0:n |y0:n ) is approximated
E[f (xn )] = f (x1n , x2n )p(x1n , x2n |y0:n )dxn by the Monte Carlo simulation:

λ(x10:n )p(x10:n )dx10:n 
Np
=  1 2 2 1 2 1 1 |y ≈
(i)
Wn(i) δ(z0:n − z0:n ),
p(y0:n |x0:n , x0:n )p(x0:n |x0:n )dx0:n p(x0:n )dx0:n p(z 0:n 0:n ) (111)
 i=1
λ(x10:n )p(x10:n )dx10:n
= 
p(y0:n |x10:n )p(x10:n )dx10:n thus the filtered density p(xn |y0:n ) is obtained by

where 
Np
(i)
 p(xn |y0:n ) ≈ Wn(i) p(xn |z0:n ), (112)
λ(x10:n ) = f (x1n , x2n )p(y0:n |x1n , x2n )p(x20:n |x10:n )dx20:n . i=1

(i)
which is a form of mixture model. When p(xn |z0:n ) is
And the weighted Monte Carlo estimate is given by Gaussian, this can be done by conventional Kalman filter
Np 1,(i) 1,(i) technique, as exemplified in [83], [14], [325]; if f and g
λ(x0:n )W (x0:n ) (i)
fˆRB = i=1
Np 1,(i)
. (107) are either/both nonlinear, p(xn |z0:n ) can be inferred by
i=1 W (x0:n ) running a bank of EKFs. For any nonlinear function f (x),
The lower variance of marginalized estimate is achieved Rao-Blackwellization achieves a lower variance estimate
  
because of the Rao-Blackwellization theorem 
Var[f (xn )|y0:n ] ≥ Var E[f (xn )|z0:n , y0:n ]y0:n .
   
Var[f (x)] = Var E[f (x1 , x2 )|x1 ] + E Var[f (x1 , x2 )|x1 ] .
Remarks:
• In practice, appropriate model transformation (e.g.
It has been proved that [143], [315], the variance of ratio
from Cartesian coordinate to polar coordinate)
of two joint densities is not less than that of two marginal
may simplify the model structure and admit Rao-
densities
Blackwellization.57
 p(x1 , x2 )    p(x1 , x2 )dx2  • Two examples of marginalized Rao-Blackwellization
Varq = Var q  in particle filtering are Conditionally Gaussian State-
q(x1 , x2 ) q(x1 , x2 )dx2
  p(x1 , x2 )   Space Model, Partially Observed Gaussian State-
 1
+Eq Varq x Space Model and Finite State HMM Model. Rao-
q(x1 , x2 ) Blackwellization can be also used for MCMC [74].
  p(x1 , x2 )dx2 
• Similar to the idea of APF, Rao-Blackwellization can
≥ Varq  , (108)
q(x1 , x2 )dx2 be also done one-step ahead [338], in which the sam-
pling and resampling steps are switched when the im-
where portant weights are independent on the measurements
  p(x1 , x2 )   and the important proposal distribution can be ana-
p(x1 , x2 )dx2  1
 = Eq x . lytically computed.
q(x1 , x2 )dx2 q(x1 , x2 )
G. Kernel Smoothing and Regularization
Hence by decomposing the variance, it is easy to see
that the variance of the importance weights via Rao- In their seminal paper [193], Gordon, Salmond and
Blackwellization is smaller than that obtained using direct Smith used an ad hoc approach called jittering to alleviate
Monte Carlo method. the sample impoverishment problem. In each time step, a
Rao-Blackwellization technique is somewhat similar to small amount of Gaussian noise is added to each resampled
the data augmentation method based on marginalization particle, which is equivalent to using a Gaussian kernel to
[445] in that it introduces a latent variable with assumed smooth the posterior. Another byproduct of jittering is to
knowledge to ease the probabilistic inference. For instance, prevent the filter from divergence, as similarly done in the
consider the following state-space model EKF literature.
Motivated by the kernel smoothing techniques in statis-
xn+1 = f (xn , dn ), (109a) tics, we can use a kernel to smooth the posterior estimate
zn = g(xn , vn ), (109b) 57 The same idea was often used in the EKF for improving the lin-
yn ∼ p(yn |zn ), (109c) earization accuracy.
MANUSCRIPT 32

by replacing the Dirac-delta function with a kernel func- tails are left for the interested reader and not discussed
tion58 here.


Np
H. Data Augmentation
p(xn |y0:n ) ≈ Wn(i) Kh (xn , x(i)
n ), (113)
i=1 The data augmentation idea arises from the missing data
problem, it is referred to a scheme of augmenting the ob-
where Kh (x) = h−Nx K( xh ) with K being a symmetric, uni- served data, thereby making the probabilistic inference eas-
modal and smooth kernel and h > 0 being the bandwidth ier. Data augmentation was first proposed by Dempster et
of the kernel. Some candidate kernels can be Gaussian or al. [130] in a deterministic framework for the EM algo-
Epanechnikov kernel [345] rithm, and later generalized by Tanner and Wong [445] for
 Nx +2 2
posterior distribution estimation in a stochastic framework,
K(x) = 2VNx (1 − x ), if x < 1 (114) which can be viewed as a Rao-Blackwell approximation of
0, otherwise the marginal density.
where VNx denotes the volume of the unit hypersphere H.1 Data Augmentation is an Iterative Kernel Smoothing
in RNx . The advantage of variance reduction of kernel Process
smoothing is at a cost of increase of bias, but this problem
Data augmentation is an iterative procedure for solv-
can be alleviated by gradually decreasing the kernel width
h as time progresses, an approach being employed in [481]. ing a fixed operator equation (the following content follows
closely [445], [446]). Simply suppose
Kernel smoothing is de facto a regularization technique
[87]. Some regularized particle filters were also developed 
in the past few years [222], [364], [365], [345]. Within par- p(x|y) = p(x|y, z)p(z|y)dz, (115)
ticle filtering update, regularization can be taken before Z
or after the correction step, resulting in the so-called pre- p(z|y) = p(z|x , y)p(x |y)dx . (116)
regularized particle filter (pre-RPF) and post-regularized X

particle filter (post-RPF) [345]. The pre-PRF is also close Substituting (116) to (115), it follows that the posterior
to the kernel particle filter [222] where the kernel smoothing satisfies
is performed in the resampling step. The implementation 
of RPF is similar to the regular particle filter, except in π(x) = K(x, x )π(x )dx , (117)
the resampling stage. For the post-RPF, the resampling 
procedure reads as follows [345]:
K(x, x ) = p(x|y, z)p(z|x , y)dz, (118)
(i)
• Generate ξ ∈ {1, · · · , Np }, with Pr(ξ = i) = Wn ;
• Draw a sample from a selected kernel s ∼ K(x); where (118) is a Fredholm integral equation of the first kind,
(i) (ξ) which can be written in the following operator form
• Generate the particles xn = xn + hAn s, where h is
the the optimal bandwidth of the kernel, An is cho- 
sen to be the square root of the empirical covariance T f (x) = K(x, x )f (x )dx , (119)
matrix if whitening is used, otherwise An = ξ.
where f is an arbitrary integrable function, T is an integral
The resampling of pre-PRF is similar to the that of post- operator, and (119) is an operator fixed point equation.
RPF except an additional rejection step is performed, Noticing the mutual dependence of p(x|y) and p(z|y), by
reader is referred to [222], [345] for details. It was proved applying successive substitution we can obtain an iterative
that the RPF converge to the  optimal filter in the weak method
sense, with a rate O(h2 + 1/ Np ), when h = 0, it reduces

to the rate of regular particle filter O(1/ Np ). πn+1 (x) = (T πn )(x) = (T n+1 π0 )(x). (120)
In [364], [345], an algorithm called “progressive correc-
tion” was proposed for particle filters, in which the correc- It was shown in [445] that under some regularity condition,
tion step is split into several subcorrection steps associated πn+1 (x) − p ≤ πn (x) − p , thus πn (x) → p(x|y) when
with a decreasing sequence of (fictitious) variance matrices n → ∞.
for the observation noise (similar to the idea of annealing). If (T πn )(x) cannot be calculated analytically, then
The intuition of progressive correction is to decompose the πn+1 (x) can be approximated by the Monte Carlo sam-
likelihood function into multiple stages since the error in- pling
duced in the correction step is usually unbounded (e.g. the
1 
Np
measurement noise is small) and thus more attention is de-
served. Though theoretically attractive, the implementa- πn+1 (x) = p(x|y, z(i) ). (121)
Np i=1
tion of partitioned sampling is quite complicated, the de-
58 Itwas also called the localization sampling or local multiple im- The quantities z(i) are called multiple imputations by Ru-
putation [3]. bin [395], [397]. The data augmentation algorithm consists
MANUSCRIPT 33

of iterating the Imputation (I) step and the Posterior (P) which is identical to (121).
step. Remarks:
Np
1. I-Step: Draw the samples {z(i) }i=1 from current • Data augmentation can be viewed as a two-step Gibbs
approximation πn (x) to the predictive distribution sampling, where the augmented data z and true state
p(z|y), which comprises of two substeps x are alternatingly marginalized.
(i)
• Generate x from πn (x); • In the APF, the auxiliary variable can be viewed as a
(i)
• Generate z from p(z|y, x(i) ). sort of data augmentation technique.
2. P-Step: Update the current approximation to p(x|y) • Similar to the EM algorithm [130], data augmentation
to be the mixture of conditional densities via (121), algorithm exploits the simplicity of the posterior dis-
where p(x|y, z) is supposed to be analytically calcu- tribution of the parameter given the augmented data.
lated or sampled easily. A detailed discussion on state-of-the-art data augmen-
tation techniques was found in [461], [328].
H.2 Data Augmentation as a Bayesian Sampling Method • A comparative discussion between data augmentation
Data augmentation can be used as a Bayesian sampling and SIR methods is referred to [445].
technique in MCMC [388]. In order to generate a sam-
ple from a distribution π(x|y), the procedure proceeds as I. MCMC Particle Filter
follows: When the state space is very high (say Nx > 10), the
• Start with an arbitrary z(0) .
performance of particle filters depends to a large extent
• For 1 ≤ k ≤ N , generate
on the choices of proposal distribution. In order to tackle
• x(k) according to marginal distribution π(x|y, z(k−1) ); more general and more complex probability distribution,
• z(k) according to marginal distribution π(z|y, x(k) ). MCMC methods are needed. In particle filtering frame-
work, MCMC is used for drawing the samples from an in-
When N is large and the chain x(k) is ergodic with invariant variance distribution, either in sampling step or resampling
distribution π(x|y), the final sample x(N ) can be regarded step.
a sample x(i) ∼ π(x|y). Many authors have tried to integrate the MCMC tech-
Np
The sample set {x(i) }i=1 obtained in this way has a con- nique to particle filtering, e.g., [40], [304], [162], [315],
ditional structure [175], [388]. It is interestingly found that [370], [164]. Berzuini et al. [40] used the Metropolis-
. In- Hastings importance sampling for filtering problem. Re-
Np
one can take advantage of the dual samples {z(i) }i=1
deed, if the quantity of interest is Eπ [f (x)|y], one can cal- calling the Metropolis-Hastings algorithm in Section V-
culate the average of conditional expectation whenever it G.6, within the Bayesian estimation framework, 
π(x) =
is analytically computable p(x|y) ∝ p(x)p(y|x), the proposal q(x , x) is rewritten as
q(x|x ), the acceptance probability (moving from x to x )
1 
Np (72) can be rewritten by
ρ̂2 = Eπ [f (x)|y, z(i) ] (122)
Np i=1  p(y|x )p(x )q(x|x ) 
α(x, x ) = min ,1 . (125)
p(y|x)p(x)q(x |x)
instead of the unconditional Monte Carlo average
Provided we use the prior as proposal (i.e. q(x|x ) = p(x)),
1 
Np
ρ̂1 = f (x(i) ). (123) (125) will reduce to
Np i=1
 p(y|x ) 
The justification of substituting (123) with (122) is the α(x, x ) = min ,1 , (126)
p(y|x)
Rao-Blackwell Theorem, since
   which says that the acceptance rate only depends on the
 1
Eπ (ρ̂1 − Eπ [f (x)|y])y = Varπ [f (x)|y] likelihood. Equivalently, we can define the transition func-
Np tion K(x, x ) = p(x |x) as
  
1    
≥ Varπ Eπ [f (x)|y, z]y (x )
Np q(x )min 1, WW (x) , if x = x
   
K(x, x ) = 
 W (z) 
= Eπ (ρ̂2 − Eπ [f (x)|y, z])y . 1 − z=x q(z)min[1, W (x) ]dz, if x = x

Generally, under a quadratic loss (or any other strictly con- where W (x) = p(x)/q(x) represents the importance
vex loss), it is favorable to work with conditional expecta- weight. The samples are drawn from Metropolis-Hastings
tions. Hence, data augmentation provides a way to approx- algorithm only after the “burn-in” time of Markov chain,
imate the posterior p(x|y) by the average of the conditional namely the samples during the burn-in time are discarded,
densities [388] and the next Np samples are stored.59 However, there are
some disadvantages of this algorithm. When the dynamic
1 
Np
p(x|y) = p(x|y, z(i) ), (124) 59 It was also suggested by some authors to discard the burn-in
Np i=1 period for particle filters for the purpose of on-line processing.
MANUSCRIPT

;;;; 34

;;;;
[481] and Liu and Chen [304], in the latter of which a Gibbs
sampling form of the move step was performed.

;;;;
Lately, Fearnhead [164] has proposed an efficient method
to implement the MCMC step for particle filter based on
the sufficient statistics. Usually, the whole trajectories of

;;;;
particles need to be stored [186], Fearnhead instead used
the summary of trajectories as sufficient statistics on which
the MCMC move is applied. Let Ψ = Ψ(x0:n−1 , z0:n ) de-

;;;;;
;; ;
note the sufficient statistics for xn , according to the Fac-
torization theorem (e.g. [388]), the unnormalized joint dis-
tribution can be factorized by two functions’ product

π(xn , x0:n−1 , z0:n ) = λ1 (xn , Ψ)λ2 (x0:n−1 , z0:n ).

Fig. 10. Sampling-importance-resampling (SIR) followed by a re-


versible jump MCMC step. The particles are moved w.r.t. an invari-
ant transitional kernel without changing the distribution. The implementation idea is to assume the invariant distri-
bution is p(xn |Ψ) conditioning on the sufficient statistics
instead of the whole state and measurement trajectories.
noise (Σd ) is small,60 the Markov chain usually takes a long The sufficient statistics are also allowed to be updated re-
time to converge, and the burn-in time is varied. cursively, see [164] for some examples.
It was also suggested to perform a reversible jump
MCMC step, after the resampling, to each particle in or-
der to increase the diversity of simulated samples without J. Mixture Kalman Filters
affecting the estimated posterior distribution (see Fig. 10).
The advantages are twofold [41]: (i) If particles are al- Mixture Kalman filters (MKF) is essentially a stochastic
ready distributed according to the posterior, then applying bank of (extended) Kalman filters, each Kalman filter is
a Markov-chain transition kernel with the same invariant run with Monte Carlo sampling approach. The idea was
distribution to particles will not change the new particles’ first explored in [6], and further explored by Chen and Liu
distribution, in addition, it also reduces the correlations be- [83] (also implicitly in [144]) with resampling and rejection
tween the particles; (ii) on the other hand, if particles are control schemes. This also follows West’s idea that the
not in the region of interest, the MCMC step may have posterior can be approximated by a mixture model [481].
possibility to move them to the interesting state space. In fact, MKF can viewed as a special case of particle filter
Nevertheless, adding MCMC move step also increase the with marginalization and Rao-Blackwellization on condi-
computation burden of the particle filter, thus the merit of tionally Gaussian linear dynamic model. The advantage of
such step should be only justified by specific application. MKF is its obvious computational efficiency, it also found
One special MCMC particle filter is the resample-move many successful applications in tracking and communica-
algorithm [186], [41], which combines SIR and MCMC sam- tions [83], [84], [476].
pling; it was shown experimentally that this methodology
can somehow alleviate the progressive degeneration prob-
lem. The basic idea is as follows [186]: The particles are
(i) Np K. Mixture Particle Filters
grouped into a set Sn = {xn }i=1 at time step n, and they
are propagated through the state-space equations by using
It is necessary to discriminate two kinds of mixture par-
SIR and MCMC sampling, at time n + 1, the resampled
ticle filters in the literature: (i) mixture posterior (arising
particles are moved according to a Markov chain transi-
from mixture transitional density or mixture measurement
tion kernel to form a new set Sn+1 ; in the rejuvenation
density), and (ii) mixture proposal distribution. The exam-
stage, two steps are performed: (i) in the resample step,
(i) ple of the first kind is the Gaussian sum particle filter [268],
draw the samples {xn } from Sn such that they are se- where the posterior is approximated by a Gaussian sum,
(i)
lected with probability proportional to {W (xn )}; (ii) in which can be further used a sampling-based particle filter
the move step, the selected particles are moved to a new for inference. The examples of the second kind were pro-
position by sampling from a Markov chain transitional ker- posed by many authors from different perspectives [162],
nel. The resample-move algorithm essentially includes SIS [69], [370], [144], [459]. The mixture proposal is especially
[200], [506], [266] as special case, where the rejuvenation useful and efficient for the situations where the posterior is
step is neglected, as well as the previous work by West multimodal. We give more general discussion as follows.
60 Σ is directly related to the variation of samples drawn from tran-
The idea is to assume the underlying posterior is a mix-
d
sition prior, and consequently related to the sample impoverishment ture distribution such that we can decompose the proposal
problem. distribution in a similar way. For instance, to calculate a
MANUSCRIPT 35

expected function of interest, we have • Annealed particle filter [131].


 m • The branching and interacting particle filters discussed
E[f (x)] = f (x) cj pj (x)dx, in continuous-time domain [122], [123], [125], [104],
j=1 [105].

m  • Genetic particle filter via evolutionary computation
= cj f (x)pj (x)dx, [455].
j=1
m  M. Choices of Proposal Distribution
pj (x)
= cj f (x) qj (x)dx The potential criteria of choosing a good proposal distri-
qj (x)
j=1 bution should include:
m 
= Wj f (x)qj (x)dx (127) • The support of proposal distribution should cover that
j=1 of posterior distribution, in other words, the proposal
should have a broader distribution.
p (x)
where Wj = cj qjj (x) . Namely, for m mixtures of qi (x) with • The proposal distribution has a long-tailed behavior
total number of Np particles, each mixture has Np /m par- to account for outliers.
ticles if allocated evenly (but not necessarily). However, • Ease of sampling implementation, preferably with lin-
the form of qi (x) can differ and the number of particles ear complexity.
associated to qi (x) can be also different according to the • Taking into account of transition prior and likelihood,
prior knowledge (e.g. their variances). In this context, we as well as most recent observation data.
have the mixture particle filters (MPF). Each particle fil- • Achieving minimum variance.
ter has individual proposal. The idea of MPF is similar • Being close (in shape) to the true posterior.
to the stratified sampling and partitioned sampling idea, However, achieving either of these goals is not easy
and includes the idea using EKF/UKF as Gaussian pro- and we don’t know what the posterior suppose to look
posal approximation as special cases, as to be discussed like. Theoretically, it was shown [506], [6], [266] that
sooner. Also note that MPF allow the parallel implemen- (i)
the choice of proposal distribution q(xn |x0:n−1 , y0:n ) =
tation, and each proposal distribution allows different form (i)
and sampling scheme. p(xn |xn−1 , yn ) minimizes the variance of importance
(i) (i)
The estimate given by MPF is represented as weights Wn conditional upon x0:n−1 and y0:n (see [144]

m  for a simple proof). By this, the importance weights
(i) (i) (i)
E[f (xn )] = 
Wn,j f (xn )qj (xn |Yn )dxn can be recursively calculated as Wn = Wn−1 p(yn |xn−1 ).
j=1 However, this optimal proposal distribution suffers from

m 
Eqj (xn |Yn ) [Wn,j (xn )f (xn )] certain drawbacks [144]: It requires sampling from
(i) (i)
=  (x )] p(xn |xn−1 , yn ) and evaluating the integral p(yn |xn−1 ) =
Eqj (xn |Yn ) [Wn,j n  (i)
j=1
p(yn |xn )p(xn |xn−1 )dxn .61 On the other hand, it should

m N
p /m
(i) (i)
be also pointed out that there is no universal choice for

≈ W̃j,n (xj,n )f (xj,n ), (128) proposal distribution, which is usually problem dependent.
j=1 i=1 Choosing an appropriate proposal distribution requires a
 (i) good understanding of the underlying problem. In the fol-
where W̃j,n (xj,n ) is the normalized importance weights
lowing, we present some rules of thumb available in the
from the j-th mixture associated with the i-th particle.
literature and discuss their features.
L. Other Monte Carlo Filters
M.1 Prior Distribution
There are also some other Monte Carlo filters that has
Prior distribution was first used for proposal distribu-
not been covered in our paper, which are either not updated
tion [200], [201] because of its intuitive simplicity. If
sequentially (but still with recursive nature), or based on
q(xn |x0:n−1 , y0:n ) = p(xn |xn−1 ), the importance weights
HMC or QMC methods. Due to space constraint, we do
are updated by
not extend the discussion and only refer the reader to the
specific references. (i)
Wn(i) = Wn−1 p(yn |x(i)
n ), (129)
• Gibbs sampling for dynamic state space model [71],
[72]. Those Monte Carlo filters are useful when the which essentially neglects the effect of the most recent
real-time processing is not too demanding. observation yn . In the CONDENSATION (CONditional
• Quasi Monte Carlo filters or smoothers, which use DENSity propagATION) algorithm [229], [230], a transi-
Metropolis-Hastings algorithm [440], [443]. tion prior was used as the proposal distribution for visual
• Non-recursive Monte Carlo filters [439], [438], [443].
61 Generally the integral has no analytic form and thus requires ap-
• Particle filters based on HMC technique [94].
proximation; however, it is possible to obtain the analytic evaluation
• Particle filters based on QMC and lattice technique in some cases, e.g. the Gaussian state-space model with nonlinear
[361]. state equation.
MANUSCRIPT 36

63
tracking. This kind of proposal distribution is easy to im- if we let q(xn |xn−1 , yn ) = p(xn |xn−1 )β , then
plement, but usually results in a high variance because
the most recent observation yn is neglected in p(xn |xn−1 ). p(yn |xn )p(xn |xn−1 )
Wn = Wn−1
The problem becomes more serious when the likelihood is q(xn |xn−1 , yn )
peaked and the predicted state is near the likelihood’s tail p(yn |xn )p(xn |xn−1 )
(see Fig. 11 for illustration), in other words, the measure- = Wn−1
p(xn |xn−1 )β
ment noise model is sensitive to the outliers.
= Wn−1 p(yn |xn )p(xn |xn−1 )α
From (129), we know that importance weights are pro-
portional to the likelihood model. It is obvious that W (x) where α = 1 − β, and 0 ≤ α ≤ 1. When α = 1, it re-
will be very uneven if the likelihood model is not flat. In duces to the normal SIR filter (129); when α = 0, it is
the Gaussian measurement noise situation, the flatness will equivalent to taking a uniform distribution (infinitely flat)
be determined by the variance. If Σv is small, the distribu- as proposal. The choice of annealing parameter α depends
tion of the measurement noise is peaked, hence W (x) will on the knowledge of the noise statistics:
be peaked as well, which makes the the sample impoverish- • When Σd < Σv , the support of prior distribution is
ment problem more severe. Hence we can see that, choosing largely outside the flat likelihood (see the first illustra-
transition prior as proposal is really a brute force approach tion of Fig. 11). In this case, we let 0 < α < 1, which
whose result can be arbitrarily bad, though it was widely thus makes the shape of the prior more flat. This is
used in the literature and sometimes produced reasonably also tantamount to the effect of “jitter”: adding some
good results (really depending on the noise statistics!). Our artificial noise makes the drawn samples broadly lo-
caution is: Do not run into this proposal model unless you cated.64
know something about your problem; do not use something • When Σd ≈ Σv , the most support of prior overlap
just because of its simplicity! that of the likelihood (see the second illustration of
For some applications, state equations are modeled as Fig. 11). In this case, prior proposal is fine and we let
an autoregressive (AR) model xn+1 = An xn + dn , where α = 1.
time-varying An can be determined sequentially or block- • When Σd > Σv , the prior is flat compared to the
by-block way (by solving Yule-Walker equation). In the peaked likelihood (see the third illustration of Fig. 11).
on-line estimation, it can be augmented into a pseudo-state In this case, we cannot do much about it by changing
vector. However, it should be cautioned that for time- α.65 And we will discuss this problem in detail in sub-
varying AR model, the use of transitional prior proposal sections M.3 and M.5.
is not recommended. Many experimental results have con-
Another perspective to understand the parameter β
firmed this [189], [467]. This is due to the special stability
is following: by taking the logarithm of the posterior,
condition of AR process.62 When the Monte Carlo samples
p(xn |y0:n ), we have
of AR coefficients are generated violating the stability con-
dition, the AR-driven signal will oscillate and the filtered log p(xn |y0:n ) ∝ log p(yn |xn ) + β log p(xn |xn−1 ),
states will deviate from the true ones. The solution to this
problem is Rao-Blackwellization [466] or careful choice of which essentially states that the log-posterior can be inter-
proposal distribution [189]. preted as a penalized log-likelihood, with log p(xn |xn−1 ) as
a smoothing prior, β is a tuning parameter controlling the
trade-off between likelihood and prior.

M.2 Annealed Prior Distribution M.3 Likelihood


When the transition prior is used as proposal, the cur-
The motivation of using transition prior as proposal is rent observation yn is neglected. However, the particles
its simplicity. However, it doesn’t take account of the noise that have larger importance weights at previous time step
statistics Σd and Σv . Without too much difficulty, one can n − 1 don’t necessarily have large weights at current step
imagine that if the samples drawn from prior doesn’t cover n. In some cases, the likelihood is far tighter than the
the likelihood region, the performance of the particle filter prior and is comparably closer (in shape) to the posterior.
will be very poor since the contributions of most particles Hence we can employ the likelihood as proposal distribu-
are insignificant. This fact further motivates us to use an- tion,66 which results in the likelihood particle filter. The
nealed prior as proposal to alleviate this situation. idea behind that is instead of drawing samples from the
Recall the update equation of importance weights (88), state transition density and then weighting them according
to their likelihood, samples are drawn from the likelihood
63 β can be viewed as a variational parameter.
64 The pdf of the sum of two random variables is the convolution of
the two pdf’s of respective random variables.
65 Note that letting α > 1 doesn’t improve the situation.
62 A sufficient condition for stability of AR model is that the poles 66 Here likelihood can be viewed as an “observation density” in
are strictly within the unit circle. terms of the states.
MANUSCRIPT 37

p(x n |x n-1 ) p(y n |x n )

;;; ;;;
;; ;;
;;
;
p(y n |x n ) p(x n |x n-1 ) p(y n |x n )
p(x n |x n-1 )

Fig. 11. Left: Σd < Σv , transition prior p(xn |xn−1 ) is peaked compared to the flat likelihood p(yn |xn ), and their overlapping region is
indicated by the thick line; Middle: Σd ≈ Σv , the support of prior and likelihood largely overlap, where the prior proposal works well;
Right: an illustration of poor approximation of transition prior as proposal distribution when the likelihood is peaked Σd > Σv . Sampling
from the prior doesn’t generate sufficient particles in the overlapping region.

TABLE VII
and then assigned weights proportional to the state transi-
tion density.67 In some special cases where the likelihood Likelihood Particle Filter (an example in the text).

model can be inverted easily xn = g−1 (yn , vn ), one can al-


teratively use likelihood as proposal distribution. To give For time steps n = 0, 1, 2, · · ·
(i)
an example [19], assume the likelihood model is quadratic, 1: Draw i.i.d. samples sn ∼ p̂(sn
|yn ) ∝ p(yn |sn );
say yn = Gn x2n + vn , without loss of generality. Denote (i)
2: u = U (0, 1), xn = sgn(u − 12 )
(i)
sn ;
sn = |xn |2 , then we can sample sn from the equation 3:
(i) (i) (i) (i)
Importance weight update: Wn = Wn−1 p(xn |xn−1 )|xn |;
(i)

sn = G−1 n (yn − vn ). From the Bayes rule, the proposal 4:


(i)
Weight normalization to get W̃n ;
can be chosen to be [19] 5:
(i) (i) Np
Resampling to get new {xn , Wn }i=1 using SIS procedure.

p(yn |sn ), if sn ≥ 0
p(sn |yn ) ∝ , (130)
0, otherwise
• Note that it is not always possible to sample from like-
(i) lihood because the mapping yn = g(xn , vn ) is usu-
then p(xn |sn ) is chosen to be a pair of Dirac delta func-
ally many-to-one. Above example is only a two-to-one
tions
mapping whose distribution p(xn |yn ) is bimodal.
 
(i) (i) • It is cautioned that using likelihood as proposal dis-
δ xn − sn + δ xn + sn tribution will increase the variance of the simulated
(i)
p(xn |sn ) = . (131)
2 samples. For instance, from the measurement equa-
tion yn = xn + vn (vn ∼ N (0, Σv )), we can draw
(i)
By letting the proposal q(xn |xn−1 , y0:n ) ∝ p(xn |sn )p(sn |yn ), (i) (i)
samples from xn = yn − vn , thus E[xn ] = E[yn ],
(i)
The importance weights Wn are updated as [19] Var[xn ] = Var[yn ] + Σv . This is a disadvantage for
the Monte Carlo estimate. Hence it is often not rec-
(i)
(i) (i) p(xn |yn ) ommended especially when Σv is large.
Wn(i) ∝ Wn−1 p(x(i)
n |xn−1 ) (i)
, (132)
p(sn |yn )
M.4 Bridging Density and Partitioned Sampling
n |yn )
p(x(i)
where the ratio (i) is the determinant of the Jacobian Bridging density [189], was proposed for proposal distri-
p(sn |yn )
of the transformation from sn to xn [19] bution as an intermediate distribution between the prior
and likelihood. The particles are reweighed according to
(i)
p(xn |yn )  ds  the intermediate distribution and resampled.
 n
(i)
∝   = 2|xn |. (133) Partitioned sampling [313], was also proposed for a pro-
p(sn |yn ) dxn
posal distribution candidate, especially when the distribu-
Hence (132) is rewritten as tions are the functions of part of the states and the peaked
likelihood can be factorized into several broader distribu-
(i) (i) tions. The basic procedure is as follows [313], [314]:
Wn(i) ∝ Wn−1 p(x(i) (i)
n |xn−1 )|xn |. (134)
• Partition the state space into two or more parts;
Taking the likelihood as proposal amounts to pushing the • Draw the samples in the partitioned space, and pass
particles to the high likelihood region, this is efficient when the samples into the factorized dynamics respectively;
the transition prior is broad (Σd is large) compared to the • Generate new particle sets via resampling.
peaked likelihood (Σv is small). In above quadratic likeli- Since the particles are drawn independently from different
hood example, the procedure of likelihood particle filter is partitioned spaces, which are little or not correlated, par-
given in Table VII. titioned sampling leads to a considerable improvement in
Remarks: sampling efficiency and reduction of the need of the sam-
67 The likelihood particle filter is similar but not identical to the
ples. This scheme is very useful especially when the mea-
APF in that neither the auxiliary variable is introduced, nor is the surement components are independent and have different
mixture density proposal involved. individual likelihood models, e.g. [313], [464].
MANUSCRIPT 38

M.5 Gradient-Based Transition Density where dn and vn are assumed to be Gaussian distributed.
Following [143], [144], we denote the log-likelihood of
Bearing in mind the second and third proposal crite-
p(xn |xn−1 , yn ) as l(x) = log p(xn |xn−1 , yn ), and
ria in the beginning of this subsection, we also proposed
∂l(x)  ∂l2 (x) 
another proposal distribution by using the gradient infor-
mation [88]. Before sampling from the transition density l (x) =  , l (x) =  ,
(i) ∂x x=xn ∂x∂xT x=xn
xn ∼ p(xn |xn−1 ), we attempt to use the information ig-
nored in the current observation yn . To do that, we plug thus l(xn ) can be approximated by the second-order Taylor
in an intermediate step (MOVE-step) to move the particles series:
in previous step towards the gradient descent direction, 68
by using first-order information. The idea behind that is 1
l(xn ) ≈ l(x) + l (x)(xn − x) + (xn − x)T l (x)(xn − x).
to push the particles into the high likelihood region, where 2
the likelihood is evaluated by current observation yn and
previous state xn−1 . For instance, the MOVE-step can be Under the assumption that l(xn ) being concave, the pro-
implemented through posal distribution can be shown to have a Gaussian distri-
bution
• Gradient descent

∂(yn − g(x))2  q(xn |xn−1 , yn ) ∼ N (μ(x) + x, Σ(x)), (136)


x̂n|n−1 = x̂n−1|n−1 − η  ,
∂x x=x̂n−1|n−1
where the covariance and mean are given by Σ(x) =
where the scalar 0 < η < 1 is the learning rate param- −l (x)−1 and μ(x) = Σ(x)l (x), respectively; when
eter. p(xn |xn−1 , yn ) is unimodal, it reduces to the zero mean
• Natural gradient μ(x) = 0.

∂(yn − g(x))2  M.7 Unscented Particle Filter


x̂n|n−1 = x̂n−1|n−1 − ηΣ−1
d  ,
∂x x=x̂n−1|n−1 In [459], [474], the unscented Kalman filter (UKF) was
used to approximate the proposal distribution of the par-
• EKF updates [120] ticle filter, which results in the so-called unscented particle
filter (UPF). The advantage of UKF over EKF to approx-
Pn|n−1 = Pn−1|n−1 + Σd imate the proposal distribution lies in the fact that UKF
Kn = Pn|n−1 ĜTn (Ĝn Pn|n−1 ĜTn + Σv )−1 can better handle the heavy-tailed distributions thus more
tailored for non-Gaussian scenarios. In fact, UPF has been
x̂n|n−1 = x̂n−1|n−1 + Kn (yn − g(x̂n−1|n−1 ))
successfully applied in object tracking [398], financial time
Pn|n = Pn|n−1 − Kn Ĝn Pn|n−1 , series modeling, robot navigation. Detailed implementa-
tion of UPF is referred to [459], [474]. EKF proposal and
where Ĝn = ∂g(x)
∂x |x=x̂n|n−1 .
UPF both use Gaussian approximation of proposal, but
The MOVE-step is followed by the normal sampling UKF produces more accurate estimate than EKF and it is
from transition density, this new proposal distribution can derivative-free.
be understood as a one-step-ahead transition density in a
sense that it uses the likelihood model (gradient informa- N. Bayesian Smoothing
tion) a priori to help choose samples. In this sense, it is As discussed in the beginning, filtering technique can
similar to the APF and likelihood particle filter. For more be extended to the smoothing problem,69 where the fu-
discussions and experimental results of this gradient-based ture observations are allowed to estimate current state. In
SIR filter, see [88]. the Bayesian/particle filtering framework, the task is to es-
timate the posterior density p(xn |y0:n+τ ). In particular,
M.6 EKF as Proposal Distribution three kinds of smoothing are discussed in the below.
The proposal distribution q(xn |xn−1 , yn ) can be as-
sumed to be a parameterized mixture distribution (e.g. N.1 Fixed-point smoothing
Gaussian mixture), with finite-dimensional parameters de- Fixed-point smoothing is concerned with achieving
termined by xn−1 and yn . If the optimal proposal distri- smoothed estimate of state xn at a fixed point n, i.e. with
bution is nonlinear, it can be approximated by an EKF, obtaining x̂n|n+τ for fixed n and all τ ≥ 1. In linear case,
as shown in [144], [83]. In this case, the state-space model the fixed-point smoothing problem is a Kalman filtering
reduces to a nonlinear additive Gaussian model: problem in disguise and therefore able to be solved by di-
rect use of Kalman filter techniques [12]. Suppose the index
xn+1 = f (xn ) + dn , (135a) of the fixed point is m at time step n (m ≤ n), we want
yn = g(xn ) + vn , (135b) to estimate the posterior p(xm |y0:n ). By forward filtering
68 Similar idea was also used in [120] for training neural networks. 69 The multiple-step ahead prediction was discussed in [144], [443].
MANUSCRIPT 39

forward sampling, at time n we know the posterior distri- [98] proposed an alternative way. Using Bayes rule, the
bution P (x0:n |y0:n ), by marginalization, we can obtain fixed-lag smoothing density is factorized by
p(yn+τ |y0:n+τ −1 , x0:n )p(x0:n |y0:n+τ −1 )

Np
p(x0:n |y0:n+τ ) =
P (xm |y0:n ) ≈ W̃n(i) δ(xm − x(i)
m ),
p(yn+τ |y0:n+τ −1 )
i=1 p(yn+τ |y0:n+τ −1 , xn )
= ×
namely, we use current important weights to replace the p(yn+τ |y0:n+τ −1 )
previous values. p(xn |yn:n+τ −1 , x0:n−1 )p(x0:n−1 |y0:n+τ −1 ).
In the simplest case where only one-step backward Using a factorized proposal distribution
smoothing (i.e. τ = 1) is considered, it reduces to

n


Np q(x0:n |y0:n+τ ) = q(x0 |y0:τ ) q(xt |x0:t−1 , y0:t+τ )
(i)
P (xn−1 |y0:n ) ≈ W̃n(i) δ(xn−1 − xn−1 ), t=1
i=1 = q(xn |x0:n−1 , y0:n+τ )q(x0:n−1 |y0:n+τ −1 ),

the justification for this approximation is to assume the the unnormalized importance weights can be updated by
(i) (i)
important weights W̃n are more accurate than W̃m (and
(i) W (x0:n+τ ) = W (x0:n+τ −1 ) ×
W̃n−1 ), since they are calculated based on more informa-
tion. p(yn+τ |yn−1:n+τ −1 , x0:n )p(xn |yn:n+τ −1 , x0:n−1 )
.
If the fixed point is the current time step (i.e. τ = 0), q(xn |x0:n−1 , y0:n+τ )p(yn+τ |y0:n+τ −1 )
we can also smooth the estimate by sampling the state tra- Generally, p(yn+τ |yn−1:n+τ −1 , x0:n ) is not evaluated, but
(i) (i) (i)
jectory history [162]: xn ∼ p(xn |Xn−1 ) where Xn−1 = for sufficiently large τ , it can be approximately viewed as a
(n−τ ) (i)
{xn , · · · , xn−1 } (1 ≤ τ ≤ n). Namely, the current constant for all x0:n [98]. The fixed-lag smoothing is a for-
particles are sampled from a τ -length state history, and ward sampling backward chaining procedure. However, the
consequently the memory requirement is τ Np . The new smoothing density p(xn |yn+τ ) can be also obtained using
(i) the filtered density instead of fixed-lag smoothing technique
state history Xn is generated by simply augmenting the
(i) (i) by using the forward filtering backward sampling technique
f (xn−1 , dn−1 ) to Xn−1 and discard the least recent one.
[71], [143], [98], [466]. Besides, the joint estimation problem
This procedure certainly is more computationally demand-
(with state and uncertain parameter) can be also tackled
ing.
using fixed-lag smoothing technique, reader is referred to
N.2 Fixed-lag smoothing [98] for details.

Fixed-lag smoothing is concerned with on-line smoothing N.3 Fixed-interval smoothing


of data where there is a fixed delay τ between state recep-
Fixed-interval smoothing is concerned with the smooth-
tion and the availability of its estimate, i.e. with obtaining
ing of a finite set of data, i.e. with obtaining x̂n|M for
x̂n|n+τ for all n and fixed τ .
fixed M and all n in the interval 0 ≤ n ≤ M . Fixed-
Similar to the fixed-point smoothing, at the step n + interval smoothing is usually discussed in an off-line esti-
τ , the particle filter yields the approximated distribution mation framework. But for short interval, the sequential
P̂ (x0:n+τ |y0:n+τ ) estimation is still possible with the increasing computer
power nowadays.

Np
(i) (i)
P̂ (x0:n+τ |y0:n+τ ) = W̃n+τ δ(x0:n+τ − x0:n+τ ). (137) Firstly in the forward step, we run a particle filter to
i=1
obtain p(xn |y0:n ) for all 0 < n < M . Secondly in the back-
ward step, the smoothing process is recursively updated
By marginalization, we can obtain the approximated fixed- by
lag smoothing distribution
p(xn:M |y0:M ) = p(xn+1:M |y0:M )p(xn |xn+1:M , y0:M )

Np
(i)
= p(xn+1:M |y1:M )p(xn |xn+1 , y0:n )
P̂ (xn |y0:n+τ ) ≈ W̃n+τ δ(xn − x(i)
n ). (138) p(xn+1 |xn , y0:n )p(xn |y0:n )
i=1 = p(xn+1:M |y1:M )
p(xn+1 |y0:n )
Hence in order to get the smoothing density, we need to re- (139)
store the trajectories of states and draw the samples from
respective distribution. Ideally this will give a better re- where the second step uses the assumption of first-order
sult, in practice however, this is not true. First, when τ Markov dynamics. In (139), p(xn:M |y0:M ) denotes cur-
is big, the approximations (137) and (138) are poor [144]; rent smoothed estimate, p(xn+1:M |y0:M ) denotes future
second, resampling brings inaccuracy to the approximation smoothed estimate, p(xn |y0:n ) is the current filtered es-
n+1 |xn ,y0:n )
especially in SIR where resampling is performed in every timate, p(x
p(xn+1 |y0:n ) is the incremental ratio of modified
iteration. To overcome these problems, Clapp and Godsill dynamics.
MANUSCRIPT 40

Similar to the fixed-lag smoothing, at time step n, we which can be used to approximate (142) to get P̂ (yn ) =
1
Np (i)
can have the following distribution i=1 Wn . However, this is an a priori likelihood
Np


Np P̂n|n−1 (yn ) which uses the predicted estimate x̂n|n−1 in-
(i) (i)
P̂ (x0:M |y0:M ) = W̃M δ(x0:M − x0:M ). stead of the filtered estimate x̂n|n ; on the other hand, the
i=1 resampling step makes the a posteriori likelihood estimate
impossible. Alternatively, we can use another method for
By marginalizing the above distribution, we can further estimating likelihood [144]. By factorization of (142), we
obtain p̂(xn |y0:M ) for any 0 ≤ n ≤ M . In practice, this is obtain
infeasible because of the weight degeneracy problem [144]:
(i) Np 
n
At time M , the state trajectories {x0:M }i=1 have been pos- p(y0:n ) = p(y0 ) p(yt |y0:t−1 ), (143)
sibly resampled many times (M −1 times in the worst case), t=1
hence there are only a few distinct trajectories at times n where
for n  M . Doucet, Godsill and Andrieu proposed [144] a 
new fixed-interval smoothing algorithm as follows. Rewrit- p(y |y
n 0:n−1 ) = p(yn |xn )p(xn |y0:n−1 )dxn
ing p(xn |y0:M ) via [258] 
 = p(yn |xn−1 )p(xn−1 |y0:n−1 )dxn−1 .
p(xn+1 |y0:M )p(xn+1 |xn )
p(xn |y0:M ) = p(xn |y0:n ) dxn+1 ,
p(xn+1 |y0:n ) where the first equality uses the predicted estimate (at time
step n) based on p(xn−1 |y0:n−1 ), and second equality uses
the smoothing density p(xn |y0:M ) is approximated by
the filtered estimate at time step n − 1. The likelihood
based these estimates are given respectively by

Np
(i) (i)
p̂(xn |y0:M ) = W̃n|M δ(xn − xn ), (140) Np
(i)
i=1 P̂ (yn |y0:n−1 ) = W̃n−1 p(yn |x(i)
n ), (144a)
i=1
where p̂(xn |y0:M ) is assumed to have the same support (de-
scribed by the particles) as the filtering density p̂(xn |y0:n ) 
Np
(i) (i)
P̂ (yn |y0:n−1 ) = W̃n−1 p(yn |xn−1 ). (144b)
but with different important weights. The normalized im-
(i) i=1
portance weights W̃n|M are calculated as follows:
(i) (i)
A detailed discussion on the likelihood estimate using
• Initialization: At time n = M , set W̃n|M = W̃M . different particle filters and different sampling schemes is
• Evaluation: For n = M − 1, · · · , 0, referred to [443].


Np (i) (j) (i)
(i) (i) W̃n p(xn+1 |xn ) P. Theoretical and Practical Issues
W̃n|M = W̃n+1|M Np (i) (j) (i)
(141)
i=1 W̃n p(xn+1 |xn )
j=1
P.1 Convergence and Asymptotic Results
As discussed earlier, although the convergence71 of
The derivation of (141) is referred to [144]. The algo-
Monte Carlo approximation is quite clear (e.g. [180]), the
rithmic complexity is O(M Np2 ) with memory requirement
convergence behavior of sequential Monte Carlo method
O(M Np ). Some other work on fixed-interval smoothing us-
or particle filter is different and deserves special attention.
ing rejection particle filters are found in [259], [438], [222].
Many authors have explored this issue from different per-
O. Likelihood Estimate spectives, but most results are available in the probability
literature. In particular, it has been theoretically shown
Particle filters can be also used to estimate the likeli- that under some mild conditions the particle methods con-
hood [259], [144], [223], wherever the maximum-likelihood verge to the solution of the Zakai equation [103], [107]
estimation principle can be applied.70 and Kushner-Stratonovich equation [104]. Crisan [106] pre-
Suppose we want to estimate the likelihood of the data sented a rigorous mathematical treatment of convergence
 of particle filters and gave the sufficient and necessary con-
p(y0:n ) = W (x0:n )q(x0:n |y0:n )dx0:n , (142) ditions for the a.s. convergence of particle filter to the true
posterior. A review of convergence results on particle filter-
as discussed earlier, if the proposal distribution is transition ing methods has been recently given by Crisan and Doucet
prior, the conditional likelihood (observation density) will from practical point of view [106], [102]. We summarize
be given by the main results from their survey paper.
Almost Sure Convergence: If the the transition ker-
1  (i)
Np
nel K(xt |xt−1 ) is Feller,72 , importance weights are up-
p̂(yn |xn ) = W (xn ), per bounded, and the likelihood function is continuous,
Np i=1 n
71 A brief introduction of different concepts of convergence is given
70 In fact, the Monte Carlo EM (MCEM), or quasi Monte Carlo EM in Appendix B.
algorithms can be developed within this framework [389], however, 72 A kernel is Feller means that for any continuous bounded function
further discussion is beyond the scope of current paper. φ, Kφ is also a continuous bounded function.
MANUSCRIPT 41

bounded, and strictly positive, then with Np → ∞ the where J(x) is the Fisher information matrix defined by
filtered density given by particle filter converges asymptot-
∂ T

ically to the true posterior. J(x) = E log p(x, y) log p(x, y) .
Mean Square Convergence: If likelihood function is ∂x ∂x
bounded, for any bounded function φ ∈ RNx , then for t ≥ If the estimate is unbiased (namely E[x̂ − x] = 0), then
0, there exists a Ct|t independent of Np s.t. E(x) is equal to the variance, and (146) reduces to

2 φ 2
E (P̂t|t , φ) − (Pt|t , φ) ≤ Ct|t , (145) E(x) ≥ J−1 (x) (147)
Np
 and the estimate satisfying (147) is called Fisher efficient.
where (P̂t|t , φ) = φ(x0:t )P (dx0:t |y0:t ), φ = sup |φ(x0:t )|. Kalman filter is Fisher-efficient under LQG circumstance in
x0:t
It should be cautioned that, it seems at the first sight that which the state-error covariance matrix plays a similar role
particle filtering method beats the curse of dimensional- as the inverse Fisher information matrix.76 Many efforts
ity,73 as the rate of convergence, 1/Np , is independent on were also devoted to studying the error bounds of non-
the state dimension Nx . This is nevertheless not true be- linear filtering [504], [45], [138], [188], [407], [451] (see also
cause in order to assure (145) holds, the number of par- [410] for a review and unified treatment, and the references
ticles Np needs to increase over the time since it depends therein). Naturally, the issue is also interesting within the
on Ct|t , a term that further relies on Nx . As discussed in particle filtering framework. Recently, it has been estab-
[102], in order to assure the uniform convergence, both Ct|t lished in [36] that under some regularity conditions, the
and the approximation error accumulates over the time.74 particle filters also satisfy the Cramér-Rao bound77
This phenomenon was actually observed in practice and ex-
emplified in [359], [116], [361]. Daum and Huang particu- E[x̃n x̃Tn ] ≥ Pn (148)
larly gave a critical comment on this problem and presented E[ x̃n 2 ] ≥ tr(Pn ) (149)
some empirical formula for complexity estimate. Besides,
where x̃n = xn − x̂n|n is the one-step ahead prediction
the uniform convergence and stability issues were also dis-
error, and
cussed in [294].
In a high-dimensional space (order of tens or higher), Pn+1 = Fn (P−1
n + Rn )
−1 −1 T
Fn + Gn Qn G−1
n ,
particle filters still suffer the problem of curse of dimen-  ∂ 
sionality. Empirically, we can estimate the requirement of P−1
0 = E − log p(x0 ) ,
∂x0 x0
the number of particles, although this bound in practice is  ∂ 
loose and usually data/problem dependent. Suppose the Fn = E f (xn , dn ) ,
minimum number is determined by the effective volume ∂xn
 ∂ 
(variance) of the search space (proposal) against the tar- R−1
n = E − log p(yn |xn ) ,
get space (posterior). If the proposal and posterior are ∂xn xn
 ∂ 
uniform in two Nx -dimensional hyperspheres with radii r GTn = E f (xn , dn ) ,
and R (R > r) respectively,75 the effective particle number ∂dn
Nef f is approximately measured by the the volume ratio  ∂ 
in the proposal space against posterior space, namely Q−1
n = E − log p(dn ) .
∂dn dn
Nef f ≈ Np × (r/R)Nx The upper bound is time-varying and can be recursively
when the ratio is low (r  R), the effective number de- updated by replacing the expectation with Monte Carlo
creases exponentially as Nx increases; on the other hand, average. For derivation details and discussions, see [35],
if we want to keep the effective number as a constant, we [36]; for more general unified treatment (filtering, predic-
need to increase Np exponentially as Nx increases. tion, smoothing) and extended situations, see [410]. A spe-
An important asymptotic result is the error bound of the cific Cramér-Rao bound in multi-target tracking scenario
filter. According to the Cramér-Rao theorem, the expected was also given in [218].
square error of an estimate is generally given by P.2 Bias-Variance
E(x) = E[(x − x̂)2 ] Let’s first consider the exact Monte Carlo sampling. The
 2
1 + dE[x̂−x] true and Monte Carlo state-error covariance matrices are
dx
≥ + (E[x̂ − x])2 , (146) defined by
J(x)
73 This
Σ = Ep [(x − μ)(x − μ)T ],
term was first used by Bellman in 1961, which refers to the
exponential growth of hypervolume as a function of dimensionality. Σμ̂ = Ep [(x − μ̂)(x − μ̂)T ],
74 Unfortunately, most convergence results did not specify very
clearly and might produce confusion for the reader. We must caution 76 For the information filter, the information matrix is equivalent to
that any claim of an established theoretical result should not violate the J(x).
the underlying assumption, e.g. smoothness, regularity, exponential 77 In contrast to the conventional Cramér-Rao bound for determin-
forgetting; any unsatisfied condition will invalidate the claim. istic parameters, it is not required that the estimated state x̂ be
75 More generalized discussion for hyperellipses is given in [94]. unbiased, as many authors have suggested [462], [410].
MANUSCRIPT 42

;;; ;;;;
TABLE VIII p q

;;; ;;;;
A List of Statistics Notations. p q
C C
D min { KL( q||p ) }

;;; ;;;;
notation definition comment
D
f (x) N/A nonlinear function in RNx A
B A B
fˆN (x) (58) exact MC estimate
p
fˆ(x) (60) weighted estimate of IS

Ep [f ] p(x)f (x)dx true mean

Σf ≡ Varp [f ] Ep [(f − Ep [f ])2 ] true variance


Fig. 12. A geometrical interpretation of Monte Carlo estimate statis-
Σ̂ ˆ
f
(151) sample variance
tics. The points A, B, C, D represent Ep [f ], Eq [fˆ], fˆ, fˆNp , respec-

Eq [f ] q(x)f (x)dx mean w.r.t. proposal distribution q tively. |AB| = |Ep [f ] − Eq [fˆ]| represents the bias, |AC| = |Ep [f ] − fˆ|,

Ep [fˆN ]
p
p(x)fˆN (x)dx
p
mean of fˆN , equal to Ep [f ]
p
p, q represent two probability densities in the convex set, p is target
density, q is the proposal distribution. Left: when q = p, the es-
Varp [fˆN ] Ep [(fˆN − Ep [fˆN ])2 ] variance of exact MC estimate
p

p p timate is biased, the variance Eq [AC2 ] varies. Right: when q is
Eq [fˆ] q(x)fˆ(x)dx mean of fˆ w.r.t. q, equal to Eq [f ] close to p, or KL(qp) is small, bias vanishes (A approaches B) and
Varq [fˆ] Eq [(fˆ − Eq [fˆ])2 ] variance of weighted sampler w.r.t q C approaches D, the variance decrease with increasing Np ; when A
overlaps B, AC2 represents the total error.
VarMC [fˆN ] EMC [(f − Ep [fˆN ])2 ] w.r.t. Monte Carlo runs
p p
VarMC [fˆ] EMC [(fˆ − Eq [fˆ])2 ] w.r.t. Monte Carlo runs

p(x0:n |y0:n ) needs storing the data up to n), not to mention


the sampling inaccuracy as well as the existence of noise.
1

Np
In the Monte Carlo filtering context, suppose x̂n is an
where μ = Ep [x], μ̂ = Np x(i) where {x(i) } are i.i.d.
i=1 estimate given by the particle filter, by writing
samples drawn from true pdf p(x). It can be proved that
[49] xn − x̂n = (xn − Eq [x̂n |y0:n ]) + (Eq [x̂n |y0:n ] − x̂n ),

1 we may calculate the expected gross error


Σμ̂ = (1 + )Σ   
Np 
E = Eq tr (xn − x̂n )(xn − x̂n )T y0:n
= Σ + Varp [μ̂], (150)   

= tr Eq (xn − x̂n )(xn − x̂n )T y0:n
where the second line follows the fact that Ep [(μ − μ̂)(μ −    
μ̂)T ] = N1p Σ (see Appendix A). Hence, the uncertainty 
= tr Eq (x̂n − Eq [x̂n |y0:n ])(x̂n − Eq [x̂n |y0:n ])T y0:n
from the exact Monte Carlo sampling part is the order of   
Np−1 , for example, Np = 20 adds an extra 5% to the true Covariance

variance. In practice, we usually calculate the sample vari-
+ (Eq [x̂n |y0:n ] − xn )(Eq [x̂n |y0:n ] − xn )T
(152)
ance in place of true variance, for Monte Carlo simulation,   
we have Bias2

 Np where
1 
Σ̂μ̂ = (μ̂ − x(i) )(μ̂ − x(i) )T . (151)
Np − 1 i=1 Eq [xn |y0:n ] = xn W (xn )q(xn |y0:n )dxn ,

It should be cautioned that Σ̂μ̂ is an unbiased estimate of and W (xn ) = p(xn |y0:n )/q(xn |y0:n ). If p = q, the bias
Σ instead of Σμ̂ , the unbiased estimate of Σμ̂ is given by vanishes to zero at a rate O(Np ), then E only accounts for
(1 + Np−1 )Σ̂μ̂ . variance, and the state-error covariance is the true covari-
Second, we particularly consider the importance sam- ance. If p = q, E generally consists of both bias and vari-
pling where the i.i.d. samples are drawn from the pro- ance where the bias is a nonzero constant. Hence, equation
posal distribution. Recalling some notations defined ear- (152) represents the bias-(co)variance dilemma.79 When
lier (for the reader’s convenience, they are summarized in the loss E is fixed, the bias and variance is a trade-off.80
Table VIII, a geometrical interpretation of Monte Carlo es- As suggested in [322], generally, we can define the bias and
timates is shown in Fig. 12), it must be cautioned again variance of importance sampling or MCMC estimate as:
that although fˆNp is unbiased (i.e. Ep [f (x)] = Ep [fˆNp (x)]), Bias = Eq [fˆ(x)] − Ep [f (x)],
however, fˆ is biased (i.e. Ep [f (x)] = Eq [fˆ(x)]). In prac-  2 
tice, with moderate sample size, it was shown in [256] that Var = Eq fˆ(x) − Eq [fˆ(x)] ,
the bias is not negligible.78 The bias accounts for the fol- 79 It is also called the trade-off between approximation error and
lowing sources: limited simulated samples, limited com- estimation error.
puting power and limited memory (calculation of posterior 80 In a very loose sense, Kalman filter can be imagined as a special
particle filter with only one “perfect” particle propagation, in which
78 An improved Bayesian bootstrap method was proposed for re- the unique sample characterizes the sufficient information of the pro-
ducing the bias of the variance estimator, which is asymptotically totype data from the distribution. The variance estimate of Kalman
equivalent to the Bayesian bootstrap method but has better finite filter or EKF is small, whereas its bias (innovation error) is relatively
sample properties [256]. larger than that of particle filter.
MANUSCRIPT 43

TABLE IX
where fˆ(x) is given by the weighted importance sampling.
Monte Carlo Experimental Results of Example 1. (The
The quality of approximation is measured by a loss function
E, as decomposed by results are averaged on 100 independent runs using 10,000
samples with different random seeds. The bold font
 2  indicates the statistics are experimentally measured,
E = Eq fˆ(x) − Ep [f (x)]
whereas the others are analytically calculated.)
= Bias2 + Var.
statistics f1 (x) f2 (x)

Ep [f ] 0.1103 0.0488
Example 1: Consider two bounded functions
 Ep [fˆN ]
p
0.1103 0.0488

Cx, if 0 ≤ x ≤ 1 Eq [f ] 0.1570 0.0720


f1 (x) = ,
0, otherwise fˆN (x) 0.1103 0.0489
 p

Cx3 , if 0 ≤ x ≤ 1 fˆ(x) 0.1103 0.0489


f2 (x) = , Σf ≡ Varp [f ]
0, otherwise 0.0561 0.0235

Σ̂ ˆ 0.0562 0.0236
fN
p
where the constant C = 1. The true pdf p(x) is a Cauchy Σ̂ ˆ 0.0646 0.0329
f
density and the proposal distribution q(x) is a Gaussian Varp [fˆN ] 0.0561 × 10−4 0.0235 × 10−4
p
pdf (see the illustration in Fig. 14), as follows Varq [f ] 0.0748 0.0336

1 VarMC [fˆN ] (0.0012)2 (0.0009)2


p
p(x) = ,
πσ(1 + x2 /σ 2 ) VarMC [fˆ] (0.0014)2 (0.0012)2


N̂ef 3755 6124
1 f
q(x) = √ exp(−x2 /2σ 2 ), Nef f /Np 2208/10000 (22.8%)
2πσ
N̂ef f /Np 6742/10000 (67.4%)

both with variance σ 2 = 1. Hence the means of f1 (x) and NKL 4.0431

f2 (x) w.r.t. two distributions are calculated as Var[NKL ] 0.0161

 1
x ln 2
Ep [f1 (x)] = 2
dx = ,
0 π(1 + x ) 2π • Rao-Blackwellization [74], [304], [315], [144], [145],
 1 [119], [458], [338], [23].
x3 (1 − ln 2)
Ep [f2 (x)] = 2)
dx = , • Stratified sampling [376], [69].
0 π(1 + x 2π
 1 • Importance sampling [199], slicing sampling [351].
1 1 1
Eq [f1 (x)] = √ x exp(−x2 /2)dx = √ − √ , • Survey sampling [199], [162].
0 2π 8π 8πe • Partition sampling [313].
 1  
1 2 9 • Antithetic variate [200], [201], [442] and control variate
Eq [f2 (x)] = √ x3 exp(−x2 /2)dx = − . [5], [201] (see Appendix D).
0 2π π 2πe
• Group averaging [267].
We draw Monte Carlo samples from two distributions (see • Moment matching method [52].
Appendix C for implementation) with Np varying from 100 • jitter and prior boosting [193].
to 10,000. The analytic calculation results are compared • Kernel smoothing [222], [345].
with the ensemble average over 100 independent runs of • QMC and lattice method [413], [299], [368], [361],
Monte Carlo simulation with different initial random seeds. [295], [296].
The experimental results are partially summarized in Table
IX and shown in Fig. 13. P.3 Robustness
Remarks (on experimental results): Robustness (both algorithmic robustness and numerical
• As observed in Fig. 13, as Np increases, the estimates robustness) issue is important for the discrete-time filter-
of both fˆNp and fˆ become more accurate; and the ing. In many practical scenarios, the filter might encounter
variances decrease at a rate O(Np−1 ). the possibility of divergence where the algorithmic assump-
• As seen from Table IX, fˆ is equal to fˆNp (mean value tion is violated or the numerical problem is encountered
based on 100 Monte Carlo runs), but their variances (e.g., ill-conditioned matrix factorization). In retrospect,
are different (see right plot of Fig. 13). many authors have explored this issue from different per-
• Noting in experiments we use C = 1, but it can be spectives, e.g. robust KF [80], robust EKF [80], [158], min-
expected that when C > 1 (C < 1), the variance in- imax filter [273], or hybrid Kalman/minimax filter. Many
creases (decreases) by a ratio C 2 . useful rules of thumb for improving robustness were dis-
cussed in [80]. Here we focus our attention on the particle
To the end of the discussion of bias-variance, we summa- filters.
rize the popular variance reduction techniques as follows: There are two fundamental problems concerning the ro-
• Data augmentation [445], [446]. bustness in particle filters. First, when there is an outlier,
MANUSCRIPT 44

0.115
0.025
0.114
0.02
0.113
0.015
0.112

0.111 0.01

0.11 0.005

0.109
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
100
Np 100 500 1000 2000 4000 8000 10000

0.02
0.0505
0.015
0.05

0.0495
0.01
0.049
0.005
0.0485

0.048
100 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
Np 100 500 1000 2000 4000 8000 10000

0.08
100
0.07

Neff Nest N’eff (f1)


eff N’ (f )
0.06 eff 2
4

0.05

0.04
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Np

50
0.045

0.04
2
0.035

0.03

0.025

0.02

0.015

0.01
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Np 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Fig. 13. Monte Carlo experimental results of Example 1. The first row shows the results of f1 (x) and the second row for f2 (x). Top Left:
Monte Carlo Mean of fˆ compared to the true mean Ep [f ] (solid line). Top Right: Monte Carlo varianceof fˆ within 100 independent runs.
Bottom Left: Error bar of the sample variance of Σ̂fˆ (solid line) compared to the sample variance Σ̂fˆ (dotted line). The dots are given
Np

by the means of 100 trial results of sample variance, the bars denote their standard deviations. Bottom Right: Ordered − log10 W̃ (x(i) )
(left ordinate) and W (x(i) ) (right ordinate; both in ascending order of abscissa) and effective sample size estimates (in one trial).

30

the importance weights will be very unevenly distributed


and it usually requires a large number of Np to assure the 25

accuracy of empirical density approximation. Hence the


measurement density p(yn |xn ) is supposed to insensitive
20

to the xn . Second, the empirical distribution from the 15

samples often approximates poorly for the long-tailed dis-


10

tribution, either for proposal distribution or for posterior.


This is imaginable because the probability sampling from 5

the tail part of distribution is very low, and resampling 0

somehow makes this problem more severe. Many results 0 0.5 1 1.5
x
2 2.5 3 3.5

have shown that even the mixture distribution can not well
Fig. 14. The ratio curve of important ratio function W (x) of Example
describe the tail behavior of the target distribution. Hence,  exp(x2 /2)
1. Solid line: true W (x) = 2/π 1+x2 ; dotted line: bounded
outliers will possibly cause the divergence of filter or pro-
curve specified by C.
duce a very bad performance.
Recently, it has been shown in [162], [70] that the sample
size estimate given by (89) is not robust, the approximated Fearnhead gave a simple example and illustrated that, the
expression might be infinitely wrong for certain f (x), p(x) estimate expression (89) of Nef f (derived by using first
and q(x). It can be derived that two moments of W (x) and f (x)) can be very poor (for two

simple cases, one leads to Nef f /Nef f → 0 and the other
1 
Nef f /Nef f → ∞). In [70], a more robust effective sample
Varq [fˆ] = Varq [f (x)W (x)]
Np size estimate was proposed
1  2 
= Eq f (x) − Ep [f (x)] W 2 (x) + O(Np−2 ), 
Np
Np Np (f (x(i) ) − fˆ(x))2 W (x(i) )
 i=1
where W (x) = p(x)/q(x). For a large Np , the true effective N̂ef f = . (154)

Np
sample size is given as [162], [70] (f (x(i) ) − fˆ(x))2 W 2 (x(i) )
i=1
 Varp [f ]
Nef f = Another critical issue is the estimate of the important
Varq [fˆ] weights within the IS, SIS, SIR framework. Note that
Np Ep [(f (x) − Ep [f (x)])2 ] W (x) = p(x)/q(x) is a function81 instead of a point esti-
≈  . (153)
Eq (f (x) − Ep [f (x)])2 W 2 (x) 81 More precisely, W (x) is a ratio function between two pdf’s. Es-
MANUSCRIPT 45

(a) (b)
mate. Being a function usually implies certain prior knowl- 0.5 0.5
edge, e.g. smoothness, non-negativeness, finite support.
0.4 0.4
However, when we use a finite number of random (uneven)
0.3 0.3
samples to represent this function, the inaccuracy (both
bias and variance) is significant. This problem becomes 0.2 0.2

more severe if the outliers come in. Experimentally, we 0.1 0.1


found that in a simple problem (Example 1), the distri-
0 0
bution of important weights are very peaked, even with a −10 −5 0 5 10 −10 −5 0 5 10

very large Np (e.g. 10,000 to 100,000). Most importantly, (c) (d)


as we can see in Fig. 14, the ratio curve (for Example 1) 0.5 0.5

W (x) = 2/π exp(x
2

1+x2
/2)
is unbounded.82 When x is bigger 0.4 0.4

than 3 (namely 3σ where σ 2 = 1; for Gaussian it accounts


2 0.3 0.3

for 99% support of the distribution), the ratio becomes 0.2 0.2
very large.83 Imaginably, this phenomenon is the intrin- 0.1 0.1
sic reason of weight unevenness when outliers come in, no
0 0
matter in sequential or non-sequential framework. To alle- −10 −5 0 5 10 −10 −5 0 5 10

viate this problem, a natural solution is simply to bound


Fig. 15. An illustration of some heavy tailed densities and robust den-
the important ratio function: sity model. (a) Cauchy density p(ξ) = πσ(1+ξ 1
2 /σ 2 ) ; (b) Laplace den-
 1 1
p(ξ)/q(ξ) 0≤ξ<C sity p(ξ) = 2σ exp(−|ξ|/σ); (c) Hyperbolic cosine p(ξ) = π cosh(ξ) ;
W (ξ) = ,
p(C)/q(C) ξ ≥ C (d) Huber’s robust density with = 0.2 and c = 0.8616. The dashed
line is zero-mean Gaussian density for comparison, all of densities
or have unity variances σ 2 = 1.

W (ξ) = ϕ(p(ξ)/q(ξ)),
Robust issues can be addressed in the robust statis-
where ϕ(·) is a bounded function, e.g. piecewise linear tics framework [214], [255]. Here we are particularly in-
function or scaled sigmoid function. The constant C here terested in the robust proposal or likelihood model. As
plays a similar role of C in the rejection sampling discussed discussed earlier, proposal distribution used in importance
in Section V-G.2, both of which determine the acceptance sampler is preferred to be a heavy-tailed density. In the
rate of the samples. The choices of the bound C or scal- Bayesian perspective, we know that the proposal distribu-
ing parameters also requires strong prior knowledge of the tion q(x|y) is assumed to approximate the posterior p(x|y)
problem (e.g. the support of target density). The use of and q(x|y) ∝ p(y|x)p(x). If the likelihood p(y|x) is upper-
bounded important weights essentially implies that we only bounded, say p(y|x) ≤ C, then the prior can be a good
use the reliable samples, ignoring the samples with very big candidate for proposal distribution since q(x|y) ∝ Cp(x)
weights. The reason is intuitively justified by the following: and it is also easy to implement. This fact motivates us
Since W (x) is an ratio function between two pdf’s, in prac- to come up with a robust loss function or robust likeli-
tice, the support of these pdf’s are often limited or com- hood density p(y|x),84 which assumes an -contaminated
pact, which means the distributions are sparse (esp. when mixture density. In spirit of robustness, the following like-
Nx is high). In order to handle the outliers and improve lihood model is used
the robustness, we only use the samples from the reliable ⎧ 
support based on prior knowledge and discard the others ⎨ √1− exp − ξ22 |ξ| < cσ
2πσ  2σ
p(ξ) = (155)
as outliers, though we also encounter the risk of neglecting ⎩ √1− exp c22 − c|ξ| |ξ| > cσ
2πσ 2σ σ
the tail behavior of target density. This is tantamount to
specifying a upper bound for the important ratio function
where 0 <  < 1, and c is determined from the normaliza-
W (x).
tion condition [463]
Another improved strategy is use kernel smoothing tech-

nique (Section VI-G) to smooth the importance ratio func- 1 −  cσ 2
tion, namely K(W (ξ)), where K(·) can be a Gaussian ker- 1= √ exp(−ξ 2 /2)dξ + exp(−c2 /2) .
2πσ −cσ c
nel. The disadvantage of this strategy is the increase of
computational cost, which brings inefficiency in on-line pro- The idea here is to bound the error and discard the in-
cessing. fluence of outliers;85 it was also suggested by West [480],
timating the ratio of two pdf’s given limited observations is stochas- in which he developed a robust sequential approximate
tically ill-posed
 x[463] (chap. 7). This amounts to solve the inte- Bayesian estimation for some special non-Gaussian distri-
gral equation −∞ W (x)dQ(x) = P (x). Given Np simulated sam- bution families. In Fig. 15, some heavy-tailed densities
ples {x(i) }, it turns out to solve an approximated operator equation:
ANp W = 0x W (x)dQNp (x). 84 The relationship between loss function and likelihood is estab-
82 That is the reason we are recommended to choose a proposal with lished by E = − log p(y|x).
heavy tail. 85 The idea of “local search” in prediction [456] is close in spirit to
83 This can be arbitrary bad if W (x) is not upper bounded. this.
MANUSCRIPT 46

and Huber’s robust density are illustrated. Those den- weights and obtain
sity models are more insensitive to the outliers because
1 
Np
of their bounded activation function. In addition, there is
a large amount of literature on robust Bayesian analysis KL(q p) ≈ − log(W̃ (x(i) )) ≡ NKL , (157)
Np i=1
(e.g. [226]) in terms of robust priors, robust likelihoods,
and robust (minimax) risks, however, extended discussion min
which achieves the minimum value NKL = log(Np ) when
is beyond the scope of current paper.
all W̃ (x(i) ) = 1/Np . Equation (157) can be also used as a
P.4 Adaptive Procedure measure of effective samples (for reampling), which leads
the following adaptive procedure:
Another way to enhance robustness is the adaptive par- • If NKL (n) > κ log(Np )
ticle methods [262], [447], which allow to adjust the num- • resample and increase Np (i.e. prior boosting) via
ber of particles through the filtering process. The common • Np (n + 1) = κNp
criterion is based on the likelihoods (which are equal to im- • Else
portance weights if the proposal is transition prior) [262]. • Np (n + 1) = Np , and resample if N̂ef f < NT
The intuition behind that is if the samples are well suited • End
to the real posterior, each individual importance weight is
where κ > 1 is a threshold defined a priori. We can also
large, and the variance of the importance weights is large,
calculate the variance approximately by
which means the mismatch between proposal distribution
and true posterior is large, and we keep Np small. An-
1 
Np
other method proposed in [171] is based on the stochastic Var[− log(W̃ )] ≈ (log(W̃ (x(i) )))2 − (NKL )2 .
bounds on the sample-based approximation quality. The Np i=1
idea is to bound the error induced by the samples and se-
quentially approximate the upper bound with additional Although above adaptive procedure is sort of hindsight in a
computational overhead. sense that it can only boost the samples in next step based
To monitor the efficiency of sampling in each step, we on current NKL , while NKL (n + 1) may not be less than
propose another adaptive procedure as follows. Besides κ log(Np ). Our empirical results show that it is still a useful
effective sample number Nef f or Nef  measure for monitoring the sample efficiency. This proce-
f , another useful effi-
ciency measure will be W (x) = p(x)/q(x) itself. Since pro- dure is particularly useful for APF when the importance
posal q(·) is supposed to be close to posterior p(·), the close- weights are evaluated after the first stage.
ness of two probability distribution (density) is naturally
P.5 Evaluation and Implementation
the Kullback-Leibler (KL) divergence KL(q p),86 which is
approximated by We should keep in mind that designing particular parti-
cle filter is problem dependent. In other words, there is no
 q(x)  1 
Np
q(x(i) ) general rule or universal good particle filter. For instance,
KL(q p) = Eq log ≈ log (i) in certain case like robot global localization [332], we pre-
p(x) Np i=1 p(x )
fer to keep the spread of particles wide (to prevent missing
1 
Np
hypothesis), but in another case like target tracking [357],
= − log(W (x(i) )) (156) we instead prefer to keep the support of particles bounded
Np i=1
(to improve the accuracy). To give another example, in
(i) many cases we want the particle filter robust to the outliers,
when q(·) = p(·) and W (x ) = 1 for all i, KL(q p) = 0.
thereby an insensitive likelihood model is preferred, how-
From (156), we can also see that if the proposal is chosen as
ever in some case where the cost is unaffordable even the
transition prior, KL(q p) will only depend on the likelihood
Np (i)
likelihood is low, a risk-sensitive model is needed [448]. On
i=1 log p(y|x ), thus the KL divergence reduces to a log- the other hand, one particle filter Algorithm A works well
likelihood measure; in a sequential framework, (88) can be (better than another particle filter Algorithm B) doesn’t
rewritten as necessarily mean that it has the gain over Algorithm B

Np
Np

Np on the other problems - this is the spirit of no-free-lunch
(i) (i) (i)
− log W (xn ) = − log W (xn−1 ) − log p(yn |xn ). (NFL) theorem! (see Appendix F) Hence it is not fair to
i=1 i=1 i=1 conclude that Algorithm A is superior to Algorithm B for
only one particular problem being tested. Justification of
Generally, KL(q p) = 0, thus (156) can be used as a mea- the superiority of certain algorithm over the others even
sure to monitor the efficiency of proposal. Intuitively, if on a specific problem is also unfair without Monte Carlo
KL(q p) is small or decreases, we can remain or decrease simulations.
the particle number Np ; if KL(q p) is big or increases, we One of the merits about particle filter is the implementa-
can increase the Np . In order to let − log(W (x(i) )) be non- tion complexity is O(N ), independent of the state dimen-
p
negative (since KL(q p) ≥ 0), we calculate the normalized sion N . As to the evaluation criteria of Monte Carlo or
x
86 KL divergence can be viewed as the expected log-likelihood, particle filters, a straightforward indicator of performance
where the likelihood is defined by q(·)/p(·). of different algorithms can be seen from the MSE between
MANUSCRIPT 47

as well as result comparison and justification.


PF 1
x 0(1)
• Kalman filters and particle filters: We particularly
y 0:n PF 2 xn
refer the reader to a Kalman/particle filter Mat-
x 0(2)
lab87 toolbox “ReBEL” (Recursive Bayesian Esti-
mation Library), developed by Rudolph van der
Merwe, which is available on line for academic purpose
PF m
x 0(m) http://varsha.ece.ogi.edu/rebel/index.html. The tool-
box cover many state-of-the-art Kalman/particle fil-
Fig. 16. A Parallel particle filters structure. tering methods, including joint/dual estimation, UKF,
UPF and their extensions. Demos and data sets are
also available.
the estimate and true value. Due to the Monte Carlo na-
• Monte Carlo methods: A website dedicated to the
ture, variance is an important criterion, e.g. (co)variance
sequential Monte Carlo approaches (including soft-
of estimate and variance of importance weights, both of
wares), maintained by Nando de Freitas, is available on
which are calculated based on Monte Carlo averaging re-
line http://www.cs.ubc.ca/∼ nando/smc/index.html.
sults (say 100 ∼ 1000 independent runs). This requirement
A shareware package called BUGS (Bayesian infer-
is deemed necessary when comparing different particle fil-
ence Using Gibbs Sampling) is available on line
ters’ performance, otherwise it is unfair to say one is better
http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml. A
than the others or the opposite. Other evaluation issues in-
website dedicated to MCMC methods is available on
clude sampling and resampling efficiency, trade-off between
line http://www.statslab.cam.ac.uk/∼ mcmc.
performance and computational complexity, parallel archi-
tecture, ease of implementation, etc.
VII. Other Forms of Bayesian Filtering and
The implementation issue of particle filters also deserves
Inference
special attention, though it is not formally discussed before
in the literature. As discussed earlier, for certain particle A. Conjugate Analysis Approach
filter, e.g. SIS filter, does allow the parallel implementation
One of important Bayesian filtering techniques is the
since the simulated particles are independent, but the re-
conjugate method, which admits the nonlinear filter-
sampling step usually makes the parallelization unfriendly
ing/inference in a close finite-dimensional form. In par-
because it requests all of the information of importance
ticular, when prior information about the model is limited,
weights. Nevertheless, we do can consider parallel imple-
the prior distribution is often chosen from a parametric
mentation in another perspective. Let’s consider a parallel
family P. The families P that are closely under sampling
particle filter structure (see Fig. 16) that comprises of a
(that is for every prior p ∈ P, the posterior distribution
bunch of (say m) particle filters, each particle filter is run
also belongs to P) are of particular interest. These fam-
independently with different initial conditions (e.g., differ-
ilies are called conjugate families and the associated pri-
ent seeds for the same random generator, different dynamic
ors are called conjugate priors, which can only belong to
noises), different simulated samples for the same proposal
the exponential family according to the Pitman-Koopman
distribution, different proposal distributions, or different
Lemma. The main motivation for using conjugate priors is
resampling schemes. The estimated result is based on the
their analytical tractability and ease of interpretation.
average of the estimates from m particle filters, namely
In [469], Vidoni introduced a finite-dimensional nonlin-

m ear and non-Gaussian filtering method for exponential fam-
x̂n = ck x̂n(k) ily of state space models. Specifically, he defined a conju-
k=1 gate latent process, in which the likelihood belongs to an

m exponentially family, initial state density is conjugate to
where ck = 1, ck can be a same constant 1/m or be dif- the likelihood, and the transition prior also remains con-
k=1
ferent, which allows on-line estimation (for instance, ck can jugate in the prediction step. The update and inference
be associated to the filtered error of the k-th particle filter). in each step follows a Bayes rule. Examples of exponen-
The complexity is proportional to the number of particle tial families include Gaussian, Gamma, Poisson, binomial,
filters, but different particle filters can be implemented in inverse Gaussian, Laplace, etc.
different processors or computers. The structure of par-
allel particle filters is somewhat similar to the interacting B. Differential Geometrical Approach
multiple models (to be discussed in Section VII). Statistical inference has an intrinsic link with differential
Finally, we would like to point out couple research re- geometry [9], [10]. A family of probability distributions
sources about Kalman filter, particle filters, and Monte corresponds to a geometric structure as a certain manifold
Carlo methods available in the Internet, an increasingly with a Riemannian metric. By transforming the statistical
growing database and resource open for researchers. We models to the geometric manifold, information geometry
deem it very important for multidisciplinary research in-
tersection, quick access of research results, open discussion, c
87 Matlab  is the trade mark of MathWorks, Inc.
MANUSCRIPT 48

provides many new insights to Bayesian filtering and infer- soft and fit a perfect niche for Bayesian filtering.
ence. In the conventional IMM, the assumption is limited by
In a series of papers [276]-[281], Kulhavý explored the the linearity and Gaussianity which allows to use Kalman
idea of recursive Bayesian parameter estimation using dif- filter or EKF for each potential hypothesis. However,
ferential geometry method. The basic idea is to approxi- this is not realistic in the real world. For the nonlinear
mate the true point by orthogonal projection onto a tan- non-Gaussian multiple-model problem, the estimate from
gent surface. He suggested to use an invariant metric called EKF’s are not accurate. Naturally, particle filtering can
conditional inaccuracy as error criterion, and formulated be used straightforward in IMM for target tracking [326].
the inverse problem to an approximation problem; the true Applications of particle filters in multiple models were also
density is assumed to come from a parameterized known found in computer vision and visual tracking [43], [356].
family, and the filtered density is approximated by the em-
pirical density given the observations. This methodology D. Bayesian Kernel Approaches
was also further extended to state estimation problem [279],
[225]. In particular, Iltis [225] used the disjoint basis func- Recently, kernel methods have attracted much attention
tion (similar to the Haar basis) to represent the posterior in machine learning [405]. We will briefly discuss some
density, the filtering density is an affine transformation of popular Bayesian kernel methods, the reader is strongly
the state vector; and the filtering problem is reduced to fit referred to [405] for more details. The discussions here are
the model density in each step to the true posterior. In- applicable to parameter as well as state estimation.
stead of using L2 norm, the KL divergence (cross-entropy) From Bayesian point of view, instead of defining a prior
criterion is used to measure the approximation accuracy on the parameter space, kernel methods directly define a
with the reduced statistics.88 The algorithm works under prior on the functional space, choosing a kernel K is equiva-
several assumptions [225]: (i) the transition density is ap- lent to assuming a Gaussian prior on the functional, with a
proximated by a piecewise constant function; (ii) the arith- normalized covariance kernel being K. On the other hand,
metic mean of posterior is close to the geometric mean; and instead of working on raw data space, kernel learning works
(iii) the bias in the affine approximation is constant. in the high-dimensional feature space by a “kernel trick”.
Brigo [55]-[57], and Brigo et al. [53], [54] also applied the • Gaussian Process, as a well-studied stochastic pro-
differential geometry approach to the finite-dimensional fil- cess, is one of the popular kernel machines for regres-
tering. By using the notion of projection filter [202], they sion [489]. The covariance of the random variables
projected the infinite-dimensional Kushner-Stratonovich {f (x1 ), · · · , f (x )} are defined by a symmetric posi-
equation onto a tangent space of a finite-dimensional mani- tive definite kernel K ≈ Cov{f (x1 ), · · · , f (x )} with
fold of square root of probability density (from exponential Kij = Cov[f (xi ), f (xj )], (i, j = 1, · · · , ). An on-line
family) according to the Fisher information metric, where algorithm for Gaussian processes for sequential regres-
the optimal filter is further sought in the tangent space. sion has been developed [508], [109].
More details can be found in the thesis of Brigo [55]. • Laplacian Process, which uses the Laplacian prior as
regularization functional, admits a sparse approxima-
C. Interacting Multiple Models tion for regression. The kernel is a Laplacian radial
One of important Bayesian filtering methods in literature basis function.
is the multiple models, e.g., generalized pseudo-Bayesian • Relevance vector machine (RVM) [454], is a kind of
(GPB) [1], interacting multiple models (IMM) [27], which kernel method to obtain sparse solutions while main-
are widely used in the data association and target track- taining the Bayesian interpretability. The basic idea
ing [501], [28]. The intuition of using multiple models is to is the use the hyperparameters to determine the pri-
tackle the multiple hypotheses problem. For instance, in ors on the individual expansion coefficients. RVM also
target tracking, the dynamic system can switch under dif- allows on-line estimation.
ferent modes (so-called switching dynamics). A single lin-
ear/nonlinear filter thus is not sufficient to characterize the E. Dynamic Bayesian Networks
underlying dynamics, once the filter loses the target, the
risk might be unaffordable. In order to tackle this situation, In the Bayesian perspective, many dynamic state-space
multiple filters are run in parallel to track the target, each models can be formalized into the so-called belief networks
one responsible to match a different target motion. The or dynamic Bayesian networks (DBN) (e.g., [183], [184]),
final estimate is calculated based on the weighted results which covers the following HMM and switching state-space
from the multiple filters, with the weighting probability model as special cases.89 Bayesian statistics has provided a
determined by the posterior probability of each hypothe- principled approach for probabilistic inference, with incor-
sis. Usually it is assumed the target switch from one mode poration of prior, causal, or domain knowledge. Recently,
to another with a known transition probability (via prior particle filtering has been applied in DBN [262], [263], [145],
knowledge or estimatation from data), all of decisions are [344], a detailed treatment was also given in [162].
88 Opposed to the sufficient statistics for original posterior estima-
tion problem, reduced statistics is used for seeking an equivalent class 89 A Matlab toolbox of DBN is available on line
of posterior, thereby making the inference more flexible. http://www.cs.berkeley.edu/∼ murphyk/Bayes/bnt.html.
MANUSCRIPT 49

HMM Filters. Hidden Markov models (HMM), or HMM straint, here we can only shortly describe several represen-
filters [380], [379], 90 can be viewed as a finite discrete- tative and well-studied problems in Bayesian learning com-
valued state space model.91 Given continuous-valued ob- munity. However, the idea rooted in these applications can
servations y0:n , the HMM filters are anticipated to esti- be extended to many scientific and engineering problems.
mate the discrete state z n (z ∈ NNz = {1, 2, · · · , Nz })
given the model parameters (transition probability matrix A. Target Tracking
p(z n |z n−1 ), emission probability matrix p(yn |z n ), and ini- Target tracking is one of the most important applica-
tial state distribution p(z 0 )).92 In contrast to the Kalman tions of sequential state estimation, which naturally admits
filtering, there are two popular algorithms used to train Kalman filters and particle filters as the main tools. Many
HMM filters93 papers have been published with particle filtering applica-
• Viterbi algorithm [470], [170]: It is used to calculate tions in this field [193], [192], [24], [35], [48]. Bearings-only
the MAP estimate of the path through the trellis, that tracking and multiple-target tracking [313], [216], [217],
is, the sequence of discrete states that maximize the [302], [362] are both well addressed. Some performance
probability of the state sequence given the observa- bounds for multiple-target tracking were also given [218].
tions. In addition, particle filters were extensively used for visual-
• Baum-Welch algorithm [379], [381]: It is used to to based human motion tracking or audio-based speaker local-
calculate the probability of each discrete state at each ization/tracking. In [88], we give some quantitative com-
epoch given the entire data sequence. parisons of different particle filters on several tracking prob-
Recently, many algorithms have been developed for non- lems.
stationary HMM in Monte Carlo framework [?], [390],
[136]. Specific particle filtering algorithms were also de- B. Computer Vision and Robotics
veloped for HMM [142], [162]. The pioneering work applying particle filtering in com-
Switching State-Space Models. Switching state-space puter vision is due to Isard and Blake [229], [230], [228],
model share the same form as the general state-space model where they called CONDENSATION for their algorithm.
(1a)(1b) but with a jump Markov dynamics (either in Since then, many papers have been published along this
state model or measurement model), which can be lin- line [231], [232], [313], [44], [43], [131], [457], [458], [94].
ear/nonlinear and Gaussian/non-Gaussian. It might also The motion and sensor models correspond to the state and
have mixed states consisting of both continuous and dis- measurement equations, respectively.
crete components. Many exact or approximate inference Another important application area of particle filter in
methods were proposed: artificial intelligence is robot navigation and localization
• Exact inference: e.g. switching Kalman filter and [447], [448], [171], [332], [288], which refers to the ability of
switching AR model [343] via EM algorithm. a robot to predict and maintain its position and orientation
• Monte Carlo simulation: e.g., random sampling ap-
within its environment.
proach [6], state estimation of jump Markov linear C. Digital Communications
systems (JMLS) using [146], [147], multi-class mixed-
state dynamics [43], [356] via EM combined with par- Particle filter and Monte Carlo methods have also found
ticle filtering. numerous applications in digital communications, includ-
• Variational approximation [236], [241], [237] and ing blind deconvolution [303], [83], demodulation [378],
mean-field approximation [241], [401]: variational channel equalization [97], estimation and coding [84], [507],
Kalman filter [30], variational switching state space and wireless channel tracking [215], [88]. Some reviews of
models [213], variational DBN [183], [184], variational Monte Carlo methods in wireless communication are also
Bayesian inference [22], variational Rao-Blackwellized found in [415] and [477], [85].
particle filter [23], variational MCMC [121]. • In [98], a fixed-lag particle smoothing algorithm was

With no doubt, there is still much research space for used for blind deconvolution and equalization.
further exploration along these lines. • In [476], the delayed-pilot sampling (which uses future
observations for generating samples) was used in MKF
VIII. Selected Applications for detection and decoding in fading channels.
• In [499], particle filter was used as blind receiver for
Bayesian filtering and Bayesian inference have found nu- orthogonal frequency-division multiplexing (OFDM)
merous applications in different areas. Due to space con- system in frequency-selective fading channels.
90 Kalman filter is also a HMM filter, except that the state space is • The time-varying AR(p) process was used for Rayleigh

continuous-valued. fast-fading wireless channel tracking, where particle


91 An excellent review paper on hidden Markov processes was given filtering was applied for improving symbol detector
in [160]. [269]. In [93], APF was used for semi-blind MIMO
92 Note that particle filter is more computationally efficient than the
HMM. Suppose we discretize the continuous state-space for formulate channel tracking.
94
the HMM filter with Nz discrete states, the complexity of HMM filter • Jump Markov linear systems (JMLR) has many
is O(Nz2 ), as opposed to O(Nz ) for particle filter.
93 Some on-line algorithms were also developed for HMM [26], [429]. 94 Jump Markov system is referred to the system whose parameters
MANUSCRIPT 50

implications in communications, where particle filters • spectral estimation [148]


can be applied [147]. • positioning and navigation [35], [196]
• time series analysis [484], financial analysis [310]
D. Speech Enhancement and Speech Recognition • economics and econometrics [436], [437], [443]
The speech signal is well known for its non-Gaussianity • biology sequence alignment [306]
and non-stationarity, by accounting for the existence of • beamforming [478]
non-Gaussian noise in real life, particle filter seems a • source separation [23]
perfect candidate tool for speech/audio enhancement and • automatic control [200], [5], [6]
noise cancellation. Lately, many research results have been
G. An Illustrative Example: Robot-Arm Problem
reported within this framework [467], [466], [169], [500]. It
was also proposed for solving the audio source separation or At the end of this section, we present a simple example
(restricted and simplified version of ) cocktail party prob- to illustrate the practical use of the particle filter discussed
lem [4]. thus far. Consider the kinematics of a two-link robot arm,
It would be remiss of us to overlook the important ap- as shown in Fig. 17(a). For given the values of pair an-
plication of HMM filters in automatic speech recognition gles (α1 , α2 ), the end effector position of the robot arm is
(ASR). Within the Bayesian framework, HMM filters have described by the Cartesian coordinates as follows:
been extensively used in speech recognition (see e.g. [380],
[379], [381], [219], [220]) and speech enhancement [159], in y1 = r1 cos(α1 ) − r2 cos(α1 + α2 ), (158a)
which the latent states are discrete and finite, which corre- y2 = r1 sin(α1 ) − r2 sin(α1 + α2 ), (158b)
spond to the letters in the alphabet.
where r1 = 0.8, r2 = 0.2 are the lengths of the two links
E. Machine Learning of the robot arm; α1 ∈ [0.3, 1.2] and α2 ∈ [π/2, 3π/2] are
The Kalman filtering methodology has been extensively the joint angles restricted in specific region. The solid and
used in neural networks training (see [206] and the ref- dashed lines in Fig. 17(a) show the “elbow up” and “el-
erences therein), especially in the area of real-time sig- bow down” situation, respectively. Finding the mapping
nal processing and control. On the other hand, in recent from (α1 , α2 ) to (y1 , y2 ) is called as forward kinematics,
decade, Bayesian inference methods have been widely ap- whereas the inverse kinematics is referred to the mapping
plied to machine learning, probabilistic inference, and neu- from (y1 , y2 ) to (α1 , α2 ). The inverse kinematics is not a
ral networks. Many papers can be found in the literature one-to-one mapping, namely the solution is not unique (e.g.
[58], [317], [120], [323], including a number of Ph.D. theses the “elbow up” and “elbow down” in Fig. 17(a) both give
[316], [346], [118], [333]. Applying Monte Carlo methods es- the same position). Now we want to formulate the prob-
pecially sequential Monte Carlo techniques also attracted lem as a state space model and solve the inverse kinematics
many researchers’ attention [120], [145], [262], [263]. In problem. Let α1 and α2 are augmented into a state vec-
particular in [120], a novel hybrid SIR (HySIR) algorithm tor, denoted as x ≡ [α1 , α2 ]T , the measurement vector is
was developed for training neural networks, which used a given by y = [y1 , y2 ]T . Equations (158a) and (158b) are
EKF update to move the particles towards the gradient rewritten in the following form of state space model
descent direction and consequently speech up the conver-
xn+1 = xn + dn ,
gence. To generalize the generic state-space model, a more


powerful learning framework will be the dynamic Bayesian cos(α1,n ) − cos(α1,n + α2,n ) r1
yn = + vn .
networks that admit more complex probabilistic graphical sin(α1,n ) − sin(α1,n + α2,n ) r2
models and include Fig. 2 as a special case. Another in-
teresting branch is the Bayesian kernel machines that are The state equation is essentially a random-walk with as-
rooted in the kernel method [405], which can tackle the sumed white Gaussian noise d ∼ N (0, diag{0.0082 , 0.082 }),
high-dimensional data and don’t suffer the curse of dimen- the measurement equation is nonlinear with measurement
sionality. How to explore the (sequential) Monte Carlo noise v ∼ N (0, 0.005 × I). As observed in Fig. 17(b),
methods to this area is still an open topic. the state trajectories of α1 and α2 are independent, thus
p(α1 , α2 |y) = p(α1 |y)p(α2 |y). α1 is a a slowly increasing
F. Others process with periodic random walk, α2 is a periodic fast
linearly-increasing/decreasing process. The SIR filter are
It is impossible to include all of applications of Bayesian used in our experiment.95 Considering the fast monotoni-
filtering and sequential Monte Carlo estimation, the litera- cally increasing behavior of α2 , random walk model is not
ture of them is growing exponentially nowadays. We only efficient. To be more accurate, we can roughly model the
list some of them available within our reach: states as a time-varying first or second-order (or higher-
• fault diagnosis [119], [338] order if necessary) AR process with unknown parameter
• tempo tracking [76], speaker tracking [464], direction An , namely αn+1 = An αn + dn . The uncertainty of
of arrival (DOA) tracking [290]
95 The Matlab code for generating robot-arm prob-
evolve with time according to a finite-state Markov chain. It is also lem data and a SIR filter demo are available on line
called switching Markov dynamics or switching state space model. ∼
http://soma.crl.mcmaster.ca/ zhechen/demo robot.m.
MANUSCRIPT 51

1.6

1.4

1.2
0.6
1

α1
0.8
0.5
0.6
0.7
0.4
0.6
0.2 0.4
0 100 200 300 400 500 600 700
0.5
time
0.4

P(α2|y)
5 0.3
0.3
4.5
0.2
4
5 0.2
3.5 0.1 4.5

α2
3 0 4
200 3.5 0.1
2.5
150 3
2 100 2.5
1.5 50 2
0
0 100 200 300 400 500 600 700 1.5 α2
Time index 0
time

Fig. 17. Schematic illustration of a two-link robot arm in two dimensions. (a) Left: for given joint angles (α1 , α2 ), the position of the end
effector (circle symbol), described by the Cartesian coordinates (y1 , y2 ), is uniquely determined. (b) Middle: the state trajectories (solid)
of (α1 , α2 ) in experiment. The dotted lines are the estimates given by SIR filter (multinomial resampling) using a random-walk model with
Np = 200. (c) Right: the pdf evolution of α2 in the first 200 steps.

An = [a1,n , b1,n , a2,n , b2,n ]T is augmented into the state model to fit the observed data, and the Bayesian proce-
for joint estimation (to be discussed in next section). In dure is used for model selection (not discussed here), hy-
this context, the new augmented state equation becomes perparameter selection (specifying priors or regularization
coefficient, not discussed here), and probabilistic inference
xan+1 = Fn+1,n xan + dn (of the unknown parameters). Parameter estimation has
been extensively used in off-line Bayesian estimation [272],
where
Bayesian learning (e.g. for neural networks) [58], [316],
xa T
n+1 = [α1,n+1 , α1,n , α2,n+1 , α2,n , a1,n+1 , b1,n+1 , a2,n+1 , b2,n+1 ] , [346], [118], or Bayesian identification [366], [367], [280]. It
is also related to Bayesian modeling and time series analy-
and sis [480], [483], [484], [372], [373].
⎡ ⎤
a1,n b1,n 0 0 0 0 0 0 Parameter estimation can be also treated in an on-line
⎢ 1 0 0 0 0 0 0 0 ⎥ estimation context. Formulated in a state space model,
⎢ ⎥
⎢ 0 0 a2,n b2,n 0 0 0 0 ⎥ the transition density of the parameters is a random-walk
⎢ ⎥
⎢ 0 0 1 0 0 0 0 0 ⎥ (or random-field) model, the likelihood is often described
Fn+1,n =⎢

⎥.
⎥ by a parametric model (e.g. a neural network). It is
⎢ 0 0 0 0 1 0 0 0 ⎥
⎢ 0 0 0 0 0 1 0 0 ⎥ also possible to use the gradient information to change the
⎢ ⎥
⎣ 0 0 0 0 0 0 1 0 ⎦ random-walk behavior to accelerate the convergence in a
0 0 0 0 0 0 0 1 dynamic environment, as illustrated in [?]. Recently, many
authors have applied particle filters or sequential Monte
Since An doesn’t enter the likelihood, by condition- Carlo methods for parameter estimation or static model
ing on α, A is a linear Gaussian model, therefore it [310], [13], [95]. In many cases, particle filters are also com-
can be estimated separately by other methods, such as bined with other inference techniques such as data augmen-
gradient descent, recursive least-squares (RLS), or Rao- tation [13], EM [43], or gradient-based methods. However,
Blackwellization.96 Namely, the joint estimation problem there are two intrinsic open problems arising from param-
is changed to a dual estimation problem (see next section). eter estimation using particle filtering technique. (i) The
It can be also solved with the EM algorithm, in which E- pseudo state is neither “causal” nor “ergodic”, the con-
step uses Bayesian filtering/smoothing for state estimation, vergence property is lost; (ii) The state space can be very
and M-step estimates the AR parameters via ML principle. large (order of hundreds), where the curse of dimensional-
The marginalization approach allows particle filter to work ity problem might be very severe. These two problems can
in a lower-dimensional space, thereby reducing the variance somehow be solved with MCMC techniques, some papers
and increasing the robustness. Hence, the Kalman filter up- are devoted to this direction [13], [16].
date is embedded in every iteration for every particle. The
detailed derivation and comparative experimental results B. Joint Estimation and Dual Estimation
will be given elsewhere.
If one encounters some parameter uncertainty in state
IX. Discussion and Critique estimation, the problem of state estimation and parameter
(either fixed parameter or time-varying parameter) estima-
A. Parameter Estimation
tion simultaneously arises. Generally, there is no unique
The parameter estimation problem arises from the fact optimal solution for this problem. Hence we are turn into
that we want to construct a parametric or nonparametric finding a suboptimal solution. One way is to treat the un-
96 This arises from the fact that p(A |α
known parameters θ as part of the states, by this trick
n 0:n , y0:n ) is Gaussian dis-
tributed which can be estimated a Kalman filter, and p(An , αn |y0:n ) one can use conventional filtering technique to infer the
can be obtained from p(α0:n |y0:n ). parameter and state simultaneously. This is usually called
MANUSCRIPT 52

by a Riemannian metric with a natural length element


|H(θ)|1/2 , the natural length elements of Riemannian met-
ric are invariant to reparameterization. The Jeffrey’s prior
has a nice geometrical interpretation: the natural volume
elements generate “uniform” measures on the manifolds,
in the sense that equal mass is assigned to regions of equal
volume, which makes Lebesque measure intuitively appeal-
ing. Another approach to construct a noninformative prior
is the so-called “reference priors” [38], [389], which maxi-
Fig. 18. A suboptimal solution of dual estimation problem. mize asymptotically the expected KL divergence.
In order to use conjugate approach in Bayesian filtering
or inference, conjugate priors are often chosen [388], [38],
joint estimation [473]. The problem of joint estimation is which can be of a single or a mixture form. the mixture
to find out the joint probability distribution (density) of conjugate priors allows us to have much freedom in model-
the unknown parameters and states, p(xn , θ|y0:n ), which ing the prior distribution. Within the conjugate approach-
usually has no analytic form. Another problem of joint based filtering, the inference can be tackled analytically.
estimation using particle filtering is that, when the param- Dirichlet prior is an important conjugate prior in the ex-
eter is part of the state, the augmented state space model ponential family and widely used in Bayesian inference. In
is not ergodic, and the uniform convergence result doesn’t addition, priors can be designed in the robust priors frame-
hold any longer [102]. An alternative solution is dual esti- work [226], e.g. the -contaminated robust priors.
mation, which uses an iterative procedure to estimate the
state and parameters alternatingly. Dual estimation was D. Localization Methods
first suggested in [12], and was lately studied in detail in
The intuition of localization idea is that, realizing the
[473], [352], with some new development. The idea of dual
fact that it is infeasible to store the whole state trajectories
estimation is illustrated in Fig. 18, where a suboptimal se-
or data due to limited storage resource in practice, instead
quential estimation solution is sought. Dual estimation can
of ambitiously finding an optimal estimate in a global sense,
be understood as a generalized EM algorithm: E-step uses
we are turn to find a locally optimal estimate by taking ac-
Kalman or particle filter for state estimation; whereas M-
count of most important observations or simulated data.
step performs model parameter estimation. The iterative
Mathematically, we attempt to find a locally unbiased but
optimization process guarantees the algorithm to converge
with minimum variance estimator. This idea is not new
to the suboptimal solution.
and has been widely used in machine learning [50], [463],
C. Prior control [337], signal processing (e.g. forgetting factor), and
statistics (e.g. kernel smoothing). Localization can be ei-
In the Bayesian estimation (filtering or inference) con- ther time localization or space localization. By time local-
text, choosing an appropriate prior (quantitatively and ization, it is meant that in the time scale, a local model
qualitatively) is a central issue.97 In the case where no is sought to characterize the most recent observation data,
preferred prior is available, it is common to choose a non- or the data are introduced with an exponential discount-
informative prior. It was called because the prior can ing/forgetting factor. By space localization, it is referred
be merely determined from the data distribution which to in any time instant, the sparse data are locally repre-
is the only available information. The purpose of non- sented, or the data are smoothed in a predefined neigh-
informative priors is to attain an “objective” inference borhood around the current observation, among the whole
within the Bayesian framework.98 data space.
Laplace was among the first who used noninformative The localization idea has been used for Monte Carlo sam-
methods ([388], chap. 3). In 1961, Jeffrey first proposed a pling [304], [3]. In the context of filtering, the forgetting
kind of noninformative prior based on Fisher information, factor has been introduced for particle filter [137]. Bear-
which is the so-called Jeffrey’s prior [388], [38] ing in mind that we encounter the risk that the particle
π(θ) ∝ |H(θ)|1/2 , (159) filters might accumulate the estimate inaccuracy along the
time, it is advisable to take the localization approach w.r.t.
where the trajectory. Namely, in order to estimate x̂n at time
 n, we only use partial observations, i.e. the posterior re-
∂2
|H(θ)|ij = − p(x|θ) log p(x|θ)dx (160) duces to p(xn |yn−τ :n ) (1 ≤ τ ≤ n) instead of p(xn |y0:n ).
∂θi ∂θj Kernel-based smoothing is one of the popular localization
is a Fisher information matrix. The logarithmic divergence methods, and it is straightforward to apply it to particle
locally behaves like the square of a distance, determined filters. The candidate kernel can be Gaussian or Epanech-
nikov. In addition to the disadvantage of introducing bias
97 When a flat prior is chosen, the Bayesian result reduces to the
(see Section VI-G), another shortcoming of kernel smooth-
frequentist approach.
98 Maximum-likelihood based methods essentially ignore the priors, ing is the curse of dimensionality, and it cannot be updated
or regard the priors as uniform. sequentially.
MANUSCRIPT 53

E. Dimensionality Reduction and Projection


Many state space models usually satisfy Ny ≤ Nx . When p(x n- 1 |y 0: n- 1 )
p(x n |y 0: n )
Ny > Nx (e.g., the observation is an image), some di-
mensionality reduction or feature extraction techniques are
Bayes rule
necessary. In this case, the observation data are usually yn
yn+ 1
sparely distributed, we can thus project the original high- p(y n |x n ) p(x n |x n- 1 )

dimensional data to a low-dimensional subspace. Such q(x n |x 0: n- 1, y0: n ) q(x n+ 1 |x 0: n, y0: n+ 1 )


techniques include principal component analysis (PCA),
SVD, factor analysis, nearest-neighborhood model. For
example, in visual tracking, people attempted to perform Fig. 19. A geometrical illustration of projection/marginalization of
the sampling in a subspace, namely to find a 2D image Bayesian filtering.
space for the 3D object motion. Likewise in robot local-
ization, the sensor information is usually high-dimensional
have
with an unknown measurement model, in on-line process-

ing the sensor information arrives much faster than the
update of the filter, not to mention the audio-visual data q(xn |x0:n−1 , y0:n ) ≈ p̂(xn−1 |y0:n−1 , z)p(z)dz,
association problem. In order to handle such situation, di- 
mensionality reduction becomes a must-be,99 either for a p(z) = p(z|xn−1 )p̂(xn−1 |y0:n−1 )dxn−1
fixed measurement model or a nonparametric model [471].
1 
Np
Projection idea is to project the object (data, distribu- (i)
= p(z|xn−1 ),
tion, or function) to a subspace which is “well-posed”, this Np i=1
geometrical insight has been widely used in filtering, learn-
ing, and inference. The idea of projection can be also con- where p̂(xn−1 |y0:n−1 ) is the previous posterior estimate
sidered for the proposal distribution. The basic intuition is (i)
represented by a discrete set {xn−1 }i=1
Np
. Let z(0) = yn ,
to assume that the the current posterior p(xn |y0:n ) is close we can use the similar sampling procedure discussed in
to the previous posterior p(xn−1 |y0:n−1 ), the only update Section VI-H.2. The details of the methodology will be
arises from the new observation yn . In order to draw sam- presented elsewhere [?]. Our idea of projection filtering100
ples from proposal q(xn |x0:n−1 , y0:n ), we project the pre- is similar but not identical to the one in [51], in which they
vious posterior to the subspace (called proposal space) by used marginalization idea for the belief update in the DBN,
marginalization (see Fig. 19). In the subspace we draw the but their method involved neither data augmentation nor
(i)
samples {xn } and use Bayes rule to update the posterior. Bayesian sampling.
Usually the update will deviate again from the subspace
(but not too far away), hence it is hoped that in the next F. Unanswered Questions
step we can project it back to the proposal space. The rea-
son behind it is that the subspace is usually simpler than Having discussed many features of particle filters, at this
the true posterior space and it is also easy to sample. To position, a question naturally occurring to us is:
do this, we can use data augmentation technique discussed Does particle filtering have free lunch?
earlier in Section VI-H. Suppose at time step n we have the In particular, we feel that the following issues have not
approximate posterior p̂(xn−1 |y0:n−1 ), given new observa- been satisfactorily addressed in the literature.
tion yn , we use the marginalization approach to alternat- First, how to choose effective particles still lacks rigor-
ingly generate the augmented z(i) (they are thus called the ous theoretical justification. How many independent sam-
“imputations” of the observations). First we assume ples (or antithetic variables) are needed in the sequential
Monte Carlo methods? Is it possible to get some upper and
q(xn |x0:n−1 , y0:n ) = q(xn |x0:n−1 , y0:n−1 , yn )
lower bounds of necessity of number of particles (see an at-
≈ p̂(xn−1 |y0:n−1 , yn ). tempted effort in [171]), though they are usually quite loose
By viewing the new observation as an augmented data z, and are problem-dependent? Of course, we can blindly
we can draw the samples from the proposal through the increase the number of particles to improve the approxi-
marginalized density mation accuracy, however, it will also inevitably increase
 the variance (due to the bias-variance dilemma, we can-
q(xn |x0:n−1 , y0:n ) ≈ p̂(xn−1 |y0:n−1 , z)p(z|y0:n−1 )dz, not make bias and variance simultaneously small according
 to the Uncertainty Principle), not to mention the increas-
p(z|y0:n−1 ) = p(z|xn−1 , y0:n−1 )p̂(xn−1 |y0:n−1 )dxn−1 . ing computational effort and sampling inefficiency (No free
lunch!). Albeit many techniques were used to improve the
Since z is supposed to be independent of the previous obser- degenerate problem, it seems to the authors that none of
vations, hence p(z|y0:n−1 ) reduces to p(z) and we further them are totally satisfactory. On the other hand, how to
99 Another novel method called real-time particle filter [288] has 100 Note that the term “projection filter” has been abused in the
been lately proposed to address the same problem in a different way. literature with different meanings.
MANUSCRIPT 54

seek an adaptive procedure of choosing/adding informa- in the machine learning literature (e.g. [463]). In the con-
tive particles (or “support particles”), still remains an open text of Bayesian filtering, the quantitative prior will be the
problem.101 This issue becomes crucial when we encounter chosen proposal distribution, initial state density p(x0 ) and
the scaling problem: the algorithm remains computation- noise statistics. Unfortunately, none of them of is assured
ally feasible when dimensionality of Nx is order of hundreds in practice. To our best knowledge, this question has not
or thousands. In addition, the number of sufficient parti- been addressed appropriately in the literature. Neverthe-
cles depends largely on the chosen proposal distribution, less, it is suspected that we might benefit from the rigorous
with a good choice, the error might vanish at a linear rate theoretical results established in the dependency estima-
of the increasing Np ; with a bad choice, the error might tion and statistical/computational learning literature [463],
increase exponentially with increasing Nx no matter how many notions such as metric entropy, VC dimension, infor-
large Np is. mation complexity, are potentially useful for establishing
Second, the cumulative error due to the inaccuracy of strong mathematical results for Monte Carlo filtering. For
the simulated samples at each iteration may grow exponen- example, since the integrand is known, how do we incorpo-
tially. For SIR or SIS filters, bias and variance will both rate the prior knowledge into Monte Carlo sampling?102 Is
increases along the time; for rejection particle filter, the it possible to introduce structural hypothesis class for pro-
variance also increases given a moderate number of par- posal distribution? Is it possible to establish a upper bound
ticles. In addition, as recalled in the discussion of conver- or lower bound for particular Monte Carlo integration (i.e.
gence behavior, the uniform convergence cannot be assured a problem-dependent bound that is possibly much tighter
unless Np increases over the time or the particle filter has than the generic Cramér-Rao bound)?
the capability to forget the error exponentially. A good Particle filters certainly enjoy some free lunches in cer-
example is given in [361]: tain special circumstances, e.g. partially observable Gaus-
Suppose the transition density p(xn |xn−1 ) is sian model, decoupled weakly Gaussian model. However,
uniform and independent of xn−1 , the likelihood answering the all of concerns of a general problem, un-
is binary with p(yn = 1|xn ) if xn < 0.2 and fortunately, have no free lunch. It was felt that the cur-
p(yn = 0|xn ) otherwise. If the true states hap- rent status of particle filter research is very similar to the
pen to stay in [0, 0.2) so that yn = 1 for all n. situation encountered in the early 1990s of neural net-
However, the probability of having no particles works and machine learning. Such examples include the
(which are binomially distributed) within [0, 0.2) bootstrap technique, asymptotic convergence result, bias-
in any one of n time steps is 1 − (1 − 0.8Np )n , variance dilemma, curse of dimensionality, and NFL theo-
which converges to 1 exponentially with increas- rem. In no doubt, there are still a lot of space left for the-
ing n; in other words, the particle filter almost oretical work on particle filters. As firstly addressed in the
loses the true trajectory completely. theoretic exposition [128], the theories of interacting parti-
Although this is an extreme example which might never cle systems [300], large deviation theory [59], [126], Feller
happen in the real life, it does convince us that the inaccu- semigroups, limit theorem, etc. are the heart of Monte
racy will bring a “catastrophic” effect as time evolves such Carlo or particle filtering theory. But they are certainly
that the filter either diverges or deviates far away from not the whole story.
the true states. In this sense, “Bayesian statistics without One of theoretical issue, for example, is about the abuse
tears” will be probably rephrased as “particle filtering with the information in Monte Carlo simulation, since it is usu-
tears”. Although the above example is a special toy prob- ally hard to verify quantitatively the information we use
lem, it does make us realize the importance of the robust- and ignore. Recently, Kong et al. [267] have partially
ness issue posed earlier. On the other hand, it is noted that approached this question, in which they formulated the
convergence behavior is a transient phenomenon, nothing problem of Monte Carlo integration as a statistical model
is said about the error accumulation in a long run. Does with simulation draws as data, and they further proposed
error approach a steady state? How to characterize the a semi-parametric model with the baseline measure as a
steady-state behavior of particle filter? To our best knowl- parameter, which makes explicit what information is ig-
edge, theoretical results are still missing. nored and what information is retained in the Monte Carlo
Third, Bayesian principle is not the only induction prin- methods; the parameter space can be estimated by the ML
ciple for statistical inference. There might also exist approach.
other principles, e.g. minimax (worst case analysis), SRM It is also noteworthy to keep in mind that the classic
(structural risk minimization), MDL (minimum description Monte Carlo methods belong to the frequentist procedure,
length), or Occam’s razor. Is Bayesian solution always opti- a question naturally arising is: Can one seek a Bayesian
mal in any sense? The answer is no. The Bayesian method version of Monte Carlo method? [318]. Lately, this ques-
makes sense only when the quantitative prior is correct tion has been partially tackled by Rasmussen and Ghahra-
[463]. In other words, in the situation lack of a priori knowl- mani [382], in which they proposed a Bayesian Monte
edge, Bayesian solution is possibly misleading. In fact, the Carlo (BMC) method to incorporate prior knowledge (e.g.
conflict between SRM and Bayesianism has been noticed 102 As matter of fact, as we discussed earlier in importance sampling,
the proposal distribution can be chosen in a smart way to even lower
101 This issue was partially addressed in the paper [88]. down the true variance.
MANUSCRIPT 55

smoothness) of the integrand to the Monte Carlo inte- • Efficient use of simulated samples and monitoring the
gration: Given a large number of samples, the integrand sample efficiency;
Np
{f (x(i) )}i=1 is assumed to be a Gaussian process (i.e. the • Exploration of smoothing, regularization, data aug-
prior is defined in the functional space instead of data mentation, Rao-Blackwellization, and MCMC varia-
space) [489], their empirical experimental results showed tions.
that the BMC is much superior to the regular Monte Carlo • Exploration of of different (or new) Monte Carlo inte-
methods. It would be beneficial to introduce this tech- gration rules for efficient sampling.
nique to the on-line filtering context. Besides, in real-life Another promising future direction seems to be combin-
applications, the noise statistics of dynamical systems are ing particle filtering with other inference methods to pro-
unknown, which are also needed to be estimated within duce a fruitful outcome. The geometrical and conjugate
Bayesian framework via introducing hyperparameters; thus approaches provide many insights for application of Rao-
the hierarchical Bayesian inference are necessary. To sum- Blackwellization and data augmentation.
marize, there can be several levels of Bayesian analysis for In no doubt, modern Monte Carlo methods have opened
different objects: data space, parameter/hyperparameter the door to more realistic and complex probabilistic mod-
space, and functional space. els. For many complex stochastic processes or dynamics
Currently, we are investigating the average/worst case where the posterior distributions are intractable, various
of Monte Carlo filtering/inference. The objective is to at- approximate inference methods other than Monte Carlo ap-
tempt to find the upper/lower bounds using variational proximation come in (e.g., mean-field approximation, vari-
methods [241], [237], [236]. The potential applications ational approximation), or they can be combined to use
combining deterministic variational Bayesian approxima- together (e.g. [121]). Alternatively, one can also simplify
tion and stochastic Monte Carlo approximation are very the complex stochastic processes by the ways of decompo-
promising, which are also under investigation. sition, factorization, and modulation for the sake of infer-
ence tractability. For the higher-order Markov dynamics,
X. Summary and Concluding Remarks mixture or hierarchical structure seems necessary and ef-
ficient approximation inference are deemed necessary. To
In this paper, we have attempted to present a tutorial conclude, from the algorithm to practice, it is a rocky road,
exposition of Bayesian filtering, which covers such top- but there is no reason to disbelieve that we can pave the
ics as stochastic filtering theory, Bayesian estimation, and way forward.
Monte Carlo methods. Within the sequential state esti-
mation framework, Kalman filter reduces to be a special Appendix A: A Proof
case of Bayesian filtering in the LQG scenario; particle fil- Assuming that x(i) (i = 1, · · · , Np ) are Np i.i.d. samples,
ter, rooted deeply in Bayesian statistics and Monte Carlo Np (i)
μ = E[x] and μ̂ = N1p i=1 x are the expected mean
technique, comes up as a powerful solution candidate for
tackling the real-life problems in the physical world where and sample mean, respectively. The covariance of sample
the nonlinearity and non-Gaussianity abound. estimate μ̂ is calculated as
It is our purpose to provide the reader a complete pic- Cov[μ̂] = E[(μ̂ − μ)(μ̂ − μ)T ]
ture of particle filters originated from stochastic filtering = E[μ̂μ̂T ] − μμT
theory. Besides Monte Carlo filtering, other Bayesian fil-

1  (i) 1  (j) T
Np Np
tering or Bayesian inference procedures are also addressed.
= E ( x )( x ) − μμT
It is obvious that the theory of Bayesian filtering presented Np i=1 Np j=1
here has a lot of potentials in variety of scientific and engi-
1 
Np Np
neering areas, thus suitable for a wide circle of readers.
Certain applications in artificial intelligence, signal pro- = E[xxT ] − μμT
Np2 i=1 j=1
cessing, communications, statistics, and machine learning,
have been already mentioned in Section VIII. In addition to Np E[xxT ] + (Np2 − Np )μμT
= − μμT
the sequential Monte Carlo nature of estimation, another Np2
attractive property of particle filter is that it allows flex-
E[xxT ] − μμT 1
ibility design and parallel implementation. On the other = = Cov[x]
hand, it should be cautioned that particle filters are not Np Np
the panacea, designing special particle filter in practice is where Cov[x] is the covariance of random vector x, the
problem dependent and requires a good understanding of fourth step in above equation uses the independence as-
the problem at hand. We should also be borne in mind sumption of x
that this area is far from mature and has left a lot of space  T
  E[xxT ] i=j
for theoretical work. E (x(i) )(x(j) ) = (i) (j) T
E[x ]E[x ] = μμ T
i = j
In summary, most of research issues of particle filters
focused on (and will still concentrate on) the following: Appendix B: Convergence of Random Variables
• Choices of proposal distribution; Definition 8: Almost Convergence (or Convergence with
• Choices of resampling scheme and schedule; Probability 1): A sequence of {Xn } is said to converge to
MANUSCRIPT 56

a random variable X with probability 1 if for any ζ > 0, • Start with an arbitrary seed x0 ;
>0 • xn = (69069xn−1 + 1) mod 232 ,
• un = 2−32 xn .
Pr{ω : |Xn (ω) − X(ω)| < } > 1 − ζ
is satisfied for all n > N where N may depend on both ζ where the sequence un can be regarded as the i.i.d. uni-
and . Or equivalently, form random variables drawn from U(0, 1). Some uniform
distribution random number generator functions in Matlab
Pr{ω : | lim Xn (ω) = X(ω)} = 1. are rand, unifrnd, and unidrnd.
n→∞
Definition 9: Mean Square Convergence: A sequence of
{Xn } of random variables is said to converge to a random Normal (Gaussian) distribution
variable X in the mean-square sense if Suppose u1 and u2 are two random variables uniformly
2
E[(Xn (ω) − X(ω)) ] → 0 (n → ∞) distributed in U(0, 1), by taking

or lim E[(Xn (ω) − X(ω))2 ] = 0. x1 = μ + σ −2 log(u1 ) cos(2πu2 ),
n→∞ 
Definition 10: Convergence in Probability: A sequence x2 = μ + σ −2 log(u1 ) sin(2πu2 ),
of {Xn } of random variables converges in probability to
the random variable X if for every  > 0 then x1 and x2 can be regarded as two independent draws
from N (μ, σ 2 ); this algorithm is exact [389].
lim Pr{|Xn (ω) − X(ω)| ≥ } = 0. It can be also generated by the transformation method
n→∞
Definition 11: Convergence in Distribution: A sequence by calculating the cdf
of {Xn } of random variables is said to converge to a ran-  x
1 (ξ − μ)2
dom variable X in distribution if the distribution functions F (x) = √ exp(− )dξ
Fn (x) of Xn converge to the distribution function F (x) of 0 2πσ 2 σ2
1  x−μ 
X at all points of continuity of F , namely, = 1 + erf √ ,
2 2σ 2
lim Fn (x) = F (x)
n→∞
then the random number can be generated by the inverse
for all x at which F (x) is continuous. function

Appendix C: Random Number Generator x = F −1 (u) = μ + 2σ 2 erf−1 (2u − 1).
In what follows, we briefly discuss some popular random Some normal distribution random number generator func-
number generators. Strictly speaking, we can only con- tions in Matlab include mvnrnd or normrnd or randn (for
struct the pseudo-random or quasi-random number gener- N (0, I)).
ators, which are deterministic in nature but the samples
they generated exhibit the same or similar statistical prop- Exponential and Logistic distribution
erties as the true random samples. For standard distribu-
Let u be one random variable uniformly distributed in
tions such as uniform, Gaussian, exponential, some exact
U(0, 1), by taking x = − log(u)/λ, then x can be regarded
random sampling algorithms exist. Other standard dis-
as a draw from exponential distribution Exponential(λ);
tributions are generally obtained by passing an inverse of u
by calculating x = log 1−u , then x can be regarded as a
the cumulative distribution function (cdf) with a pseudo-
draw from logistic distribution Logistic(0, 1) [389]. An
random sequence, the resulting distributions are mostly ap-
exponential distribution random number generator func-
proximate rather than exact.
tion in Matlab is exprnd.
Theorem 6: [168] Let {F (z), a ≤ z ≤ b} denote a distri-
bution function with an inverse distribution function as Cauchy distribution
−1
F (z) = inf{z ∈ [a, b] : F (z) ≥ u, 0 ≤ u ≤ 1}. To generate the Cauchy distribution, we can use the
Let u denote a random variable from U(0, 1), then z = transformation method. The pdf of zero-mean Cauchy dis-
F −1 (u) has the distribution function F (z). tribution is given by
Reader is referred to [168], [389], [386], [132] for more σ 1
p(x) =
information. For simulation purpose, the Matlab user can π σ + x2
2
find many random number generators for various distribu-
tions in the Statistics Toolbox (MathWorks Inc.). where σ 2 is the variance. The cdf of Cauchy distribution is
 x
σ 1 1 x 1
Uniform distribution F (x) = 2 2
dξ = arctan( ) + .
−∞ π σ + ξ π σ 2
The uniform random variable is the basis on which the
other random number generators (other than uniform dis- The transformation is then given by the inverse transform
tribution) are constructed. There are many uniform ran- x = F −1 (u):
dom number generators available [386]. The following rou- 1
tine is a one based on the congruencial method F −1 (u) = σ tan(π(u − )) = −σ cot(πu).
2
MANUSCRIPT 57

Hence given some uniform random numbers u ∈ U(0, 1), we show Var[f (x)] ≥ Var[f (x) − h(x)], which is equivalent to
can use above relationship to produce the Cauchy random Var[h(x)] < 2Cov[f (x), h(x)], where
numbers by x = −σ cot(πu). The acceptance-rejection 
sampling approach to generate Cauchy distribution pro- Cov[f (x), h(x)] = (f (x) − θ)(h(x) − μ)dx.
ceeds as follows [168]:
• repeat
Suppose θ̂ is an unbiased Monte Carlo estimate obtained
• generate u1 and u2 from U(−1/2, 1/2) from exact draws, namely E[θ̂] = θ. We can find another
2
• until u1 + u2 ≤ 1/4
unbiased estimator μ̂ (E[μ̂] = μ), as a control variate, to
• return u1 /u2 .
construct a new estimator
Laplace distribution θ = θ̂ + μ − μ̂.
Laplace distribution is also called double exponential dis- It is obvious that θ is also an unbiased estimate of θ. The
tribution. It is the distribution of differences between two variance of this new estimator is given by
independent variates with identical exponential distribu-
tions. The pdf of Laplace distribution is given by Var[θ ] = Var[θ̂ − μ̂]

1 |x| = Var[θ̂] + Var[μ̂] − 2Cov[θ̂, μ̂],


p(x) = exp −
2σ σ hence Var[θ ] < Var[θ̂] if Var[μ̂] < 2Cov[θ̂, μ̂]. In some
where σ is a positive constant. The distribution function sense, controlled variate can be understood as a kind of
of Laplace distribution is variational method.
Antithetic variate is a variance-reduction method ex-
 1 x ploiting the negative correlation. Suppose θ̂ and θ are two
F (x) = 2 exp( σ ) x<0
,
1 − 2 exp( −x
1
σ ) x≥0 unbiased estimates of θ, we construct another unbiased es-
timate as
and the inverse transform x = F −1 (u) is given by θ̂ + θ
 μ̂ = ,
σ ln(2u) 0 < u < 1/2 2
F −1 (u) = .
−σ ln(2 − 2u) 1/2 ≤ u < 1 whose variance is given by
1 1 1
Given some uniform random numbers u ∈ U(0, 1), we can Var[μ̂] = Var[θ̂] + Var[θ ] + Cov[θ̂, θ ].
use above relationship to produce the Laplace distributed 4 4 2
random variables x = F −1 (u). Suppose θ̂ and θ are two Monte Carlo estimates obtained
from exact draws, if θ is chosen s.t. Cov[θ̂, θ ] < 0 (i.e. the
Appendix D: Control Variate and Antithetic Monte Carlo samples are negatively correlated instead of
Variate independent; a.k.a. correlated sampling), variance reduc-
Control variate and antithetic variate are two useful tion is achieved.
variance-reduction techniques by exploring the knowledge For example, if the integrand is a symmetric function
of integrand. To illustrate the idea, only one-dimensional w.r.t. a+b 2 over the region [a, b], we can write f (x) =
f (x)+f (a+b−x)
variable is considered here.
2 (when −a = b, it reduces to an even func-
Suppose we want to estimate an integral of interest tion). Thus we can introduce negative correlation since
  generally Cov[f (x), f (a + b − x)] < 0; if a = 0, b = 1 and
θ = φ(x)p(x)dx ≡ f (x)dx. f (x) ∼ U(0, 1), then Cov[f (x), f (1 − x)] = −1.
More generally, if f (·) is a monotonically increas-
ing/decreasing function, then f (x) and f (1 − x) are neg-
To achieve this, we use another known statistics
atively correlated. Hence in order to reduce the variance,
  one may construct a Monte Carlo estimate
μ = φ(x)q(x)dx ≡ h(x)dx
1 
Np
(f (x(i) ) + f (1 − x(i) )),
to further construct an equivalent integral 2Np i=1
 Np
θ = μ + (f (x) − h(x))dx, instead of using the naive estimates N1p i=1 f (x(i) or
1
 N (i)
i=1 f (1 − x .
p
Np
where μ is a known constant, h(x) is called as a “control Example 2: To give a more specific example, consider
variate”, which is usually chosen to be close to f (x). drawing the samples from a zero mean Cauchy distribu-
In order to reduce the variance (i.e. the right-hand tion discussed in Appendix C. Given uniform random vari-
side is no more than the left-hand side), we need to ables u ∼ U(0, 1), we can produce the Cauchy random
MANUSCRIPT 58

TABLE X
numbers by x1 = −σ cot(πu). On the other hand, not-
ing that 1 − u are also uniformly distributed that is neg- The SVD-based Derivative-free Kalman Filtering for State
Estimation.
atively correlated with u. Utilizing this symmetry prop-
erty, we can generate another set of Cauchy random num-
bers x2 = −σ cot(π(1 − u)) = σ tan(πu). Obviously, x1 Initialization
and x2 are slightly negatively correlated, their covariance
T
x̂0 = E[x0 ], P̂0 = E[(x0 − x̂0 )(x0 − x̂0 )
is also usually negative. By drawing Np /2 samples of ].

x1 and Np /2 samples of x2 , we obtain some negatively Compute the SVD and eigen-point covariance matrix
correlated samples from Cauchy distribution. Alterna-
T
tively, by constructing Np samples x = (x1 + x2 )/2, we Pn = Un Sn Vn
χ0,n−1 = x̂n−1
have Var[x] < max{Var[x1 ], Var[x2 ]}, and Var[x] is ex- 
χi,n−1 = x̂n−1 + ρUi,n si,n , i = 1, · · · , Nx
pected to reduce, compared to the two independent runs χi,n−1 = x̂n−1 − ρUi,n

si,n , i = Nx + 1, · · · , 2Nx
for x1 and x2 . The sample estimate of x is unbiased, i.e.,
E[x] = E[x1 ] = E[x2 ]. Also note that when x1 and x2 are Time updates
negatively correlated, f (x1 ) and f (x2 ) are usually nega- χi,n|n−1 = f (χi,n−1 , un ), i = 0, 1, · · · , 2Nx
tively correlated when f (·) is a monotonic function. 2N
x (m)
x̂n|n−1 = χ0,n|n−1 + W (χi,n|n−1 − χ0,n|n−1 )
This approach can be utilized in any transformation- i=1
i

based random number generation technique (Appendix C) 2N


x (c) T
Pn|n−1 = W (χi,n|n−1 − x̂n|n−1 )(χi,n|n−1 − x̂n|n−1 ) + Σd
whenever applicable (i.e., using uniform random variable, i=0
i

and F −1 being a monotonic function). Such examples in- Yi,n|n−1 = g(χi,n|n−1 , un ), i = 0, 1, · · · , 2Nx

clude exponential distribution, logistic distribution, and ŷn|n−1 = Y0,n|n−1 +


2N
x
W
(m)
(Yi,n|n−1 − Y0,n|n−1 )
i
Laplace distribution. i=1

Measurement updates
Appendix E: Unscented Transformation Based on
SVD 2N
x (c) T
Pŷ ŷ = W (Yi,n|n−1 − ŷn|n−1 )(Yi,n|n−1 − ŷn|n−1 ) + Σv
n n i
i=0
There are many types of matrix factorization techniques 2N
x
[42], e.g. Cholesky factorization, U-D factorization, LDU T Px̂ ŷ
n n
= W
i
(c)
(χi,n|n−1 − x̂n|n−1 )(Yi,n|n−1 − ŷn|n−1 )
T

factorization.103 Hence we can use different factorization


i=0
−1
Kn = Px̂ ŷ P
n n ŷn ŷn
methods to implement the unscented transformation (UT). x̂n = x̂n|n−1 + Kn (yn − ŷn|n−1 )
The basic idea here is to use singular value decomposition T
Pn = Pn|n−1 − Kn Pŷ ŷ Kn
(SVD) instead of Cholesky factorization in the UT. In Ta- n n

ble X, the state estimation procedure is given, the extension


to parameter estimation is straightforward and is omitted Weights : W
(m)
=
1
,
(c)
W0 =
κ
, W
(c)
=
1
i i
2Nx Nx + κ 2Nx + 2κ
here. As to the notations, P denotes the state-error corre-
lation matrix, K denotes the Kalman √ gain, ρ is a scaling
parameter (a good choice is 1 ≤ ρ ≤ 2) for controlling the
extent of covariance,104 κ is a small tuning parameter. The
computational complexity of SVD-KF is the same order of on some set of target S, then Algorithm B must be supe-
O(Nx3 ) as UKF. rior to Algorithm A if averaging over all target not in S.
Such examples also include sampling theory and Bayesian
Appendix F: No Free Lunch Theorem analysis [491].
The no-free lunch (NFL) 105 theorems basically claim For the particle filters (which certainly belong to random
that no learning algorithms can be universally good; in based algorithm class), the importance of prior knowledge
other words, an algorithm that performs exceptionally well is very crucial. Wolpert [491], [492] has given a detailed
in certain situations will perform comparably poorly in mathematical treatment of the issues of existence and lack
other situations. For example, NFL for optimization [493], of prior knowledge in machine learning framework. But
for cross-validation, for noise prediction, for early stopping, the discussions can be certainly borrowed to stochastic fil-
for bootstrapping, to name a few (see also [87] for some tering context. In Monte Carlo filtering methods, the most
discussions on NFL in the context of regularization the- valuable and important prior knowledge is the proposal dis-
ory). The implication of NFL theorem is that, given two tribution. No matter what kind of particle filter is used, an
random based algorithms Algorithm A and Algorithm B, appropriately chosen proposal is directly related to the fi-
suppose Algorithm A is superior to Algorithm B averaged nal performance. The choice of proposal is further related
to the functions f and g, the likelihood model or mea-
103 The factorization is not unique but the factorization techniques surement noise density. Another crucial prior knowledge
are related, they can be used to develope various forms of square-root is the noise statistics, especially the dynamical noise. If
Kalman filters [42], [247].
104 In one-dimensional Gaussian distribution, variance σ 2 accounts the Σd is small, the weight degeneracy problem is severe,
for 95% covering region of data (2σ 2 for 98%, 3σ 2 for 99%). which requires us to either add “jitter” or choose regular-
105 The term was first used by David Haussler. ization/smoothing technique. Also, the prior knowledge of
MANUSCRIPT 59

the model structure is helpful for using data augmentation Xn equivalent to x0:n ≡ {x0 , · · · , xn }
and Rao-Blackwellization techniques. Yn equivalent to y0:n ≡ {y0 , · · · , yn }
X sigma points of x in unscented transformation
Y sigma points of y in unscented transformation
Appendix G: Notations W sigma weights in unscented transformation
E[·] mathematical expectation
Symbol Description Var[·], Cov[·] variance, covariance
tr(·) trace of matrix
N integer number set diag diagonal matrix
R(R+ ) (positive) real-valued number set AT transpose of vector or matrix A
u input vector as driving force |·| determinant of matrix
x continuous-valued state vector · norm operator
z discrete-valued state vector  · A weighted norm operator
y measurement vector E loss function
z augmented (latent) variable vector Ψ(·) sufficient statistics
e state-error error (innovations) N (μ, Σ) Normal distribution with mean μ and covariance Σ
ω Wiener process U (0, 1) uniform distribution in the region (0, 1)
d dynamical noise vector (Ω, F , P ) probability space
v measurement noise vector O(·) order of
Σd , Σv covariance matrices of noises ∼ sampled from
P correlation matrix of state-error A operator
I identity matrix à adjoint operator
J Fisher information matrix L differential operator
K Kalman gain T integral operator
f (·) nonlinear state function a.k.a. also known as
g(·) nonlinear measurement function a.s. almost sure
F state transition matrix e.g. exempli gratia
G measurement matrix i.e. id est
H Hessian matrix i.i.d. identically and independently distributed
l(x) logarithm of optimal proposal distribution s.t. such that
μ true mean E[x] w.r.t. with respect to
μ̂ sample mean from exact sampling
Σ true covariance
Σ̂ sample covariance Acknowledgement
fˆNp Monte Carlo estimate from exact sampling
fˆ Monte Carlo estimate from importance sampling
This paper would not be possible without the contribu-
x(i) the i-th simulated sample (particle) tions of numerous researchers in this ever growing field.
x̃n prediction error xn − x̂n (Yn ) The author would like to thank Drs. Simon Haykin and
∅ empty set Thia Kirubarajan (McMaster) for reading the manuscript
S set
f, g, φ generic nonlinear functions
and providing many feedbacks. We are also grateful to Dr.
F distribution function Fred Daum (Raytheon) for sharing his unpublished papers
sgn(·) signum function with us as well as many helpful comments, and Dr. Yuguo
erf(·) error function Chen (Duke) for providing his Stanford Ph.D. thesis in
· floor function
δ(·) Dirac delta function the early stage for better understanding sequential Monte
I(·) indicator function Carlo methods. We also thank Dr. David J.C. MacKay
K(·, ·) kernel function (Cambridge) for allowing us to reproduce a figure in his pa-
α(·, ·) probability of move
Pr(·) probability per. Finally, special thanks are also due to Prof. Rudolph
P parametric probability function family E. Kalman for his seminal contribution that directly moti-
P, Q probability distribution vated the writing of this paper.
p probability density (mass) function
q proposal distribution, importance density
π (unnormalized) density/distribution
E energy
K kinetic energy
Nx the dimension of state
Ny the dimension of measurement
Np the number of particles
Nz the number of discrete states

Nef f , Nef the number of effective particles
f
NT the threshold of effective particles
NKL KL(qp) estimate from important weights
m the number of mixtures
c mixture coefficient
C constant
W importance weight
W̃ normalized importance weight
ξ auxiliary variable
t continuous-time index
n discrete-time index
τ time delay (continuous or discrete)
X, Y, Z sample space
MANUSCRIPT 60

References [28] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with


Applications to Tracking and Navigation: Theory, Algorihtms,
[1] G. A. Ackerson and K. S. Fu, “On state estimation in switching and Software, New York: Wiley, 2001.
environments,” IEEE Trans. Automat. Contr., vol. 15, pp. 10–17,
[29] T. R. Bayes, “Essay towards solving a problem in the doctrine
1970.
of chances,” Phil. Trans. Roy. Soc. Lond., vol. 53, pp. 370–418,
[2] S. L. Adler, “Over-relaxation method for the Monte-Carlo evalua-
1763. Reprinted in Biometrika, vol. 45, 1958.
tion of the partition function for multiquuadratic actions,” Phys.
[30] M. Beal and Z. Ghahramani, “The variational Kalman
Rev. D, vol. 23, no. 12, pp. 2901–2904, 1981.
smoother,” Tech. Rep., GCNU TR2001-003, Gatsby Computa-
[3] M. Aerts, G. Claeskens, N. Hens, and G. Molenberghs, “Local
tional Neuroscience Unit, Univ. College London, 2001.
multiple imputation,” Biometrika, vol. 89, no. 2, pp. 375–388.
[31] E. R. Beadle and P. M. Djurič, “Fast weighted boodstrap fil-
[4] A. Ahmed, “Signal separation,” Ph.D. thesis, Univ.
ter for non-linear state estimation,” IEEE Trans. Aerosp. Elect.
Cambridge, 2000. Available on line http://www-
Syst., vol. 33, pp. 338–343, 1997.
sigproc.eng.cam.ac.uk/publications/theses.html
[32] Ya. I. Belopolskaya and Y. L. Dalecky, Stochastic Equations and
[5] H. Akashi and H. Kumamoto, “Construction of discrete-time non-
Differential Geometry, Kluwer Academic Publishers, 1990.
linear filter by Monte Carlo methods with variance-reducing tech-
niques,” Systems and Control, vol. 19, pp. 211–221, 1975 (in [33] V. E. Beneš, “Exact finite-dimensional for certain diffusions with
Japanese). nonlinear drift,” Stochastics, vol. 5, no. 1/2, pp. 65–92, 1981.
[6] ———, “Random sampling approach to state estimation in [34] ———, “New exact nonlinear filters with large Lie algebras,”
switching environments,” Automatica, vol. 13, pp. 429–434, 1977. Syst. Contr. Lett., vol. 5, pp. 217-221, 1985.
[7] D. F. Allinger and S. K. Mitter, “New results in innovations prob- [35] N. Bergman, “Recursive Bayesian estimation: Navigation and
lem for nonlinear filtering,” Stochastics, vol. 4, pp. 339–348, 1981. tracking applications,” Ph.D. thesis, Linköping Univ., Sweden,
[8] D. L. Alspach, and H. W. Sorenson, “Nonlinear Bayesian estima- 1999.
tion using gaussian sum approximation,” IEEE Trans. Automat. [36] ———, “Posterior Cramér-Rao bounds for sequential estima-
Contr., vol. 20, pp. 439–447, 1972. tion,” in Sequential Monte Carlo Methods in Practice, A. Doucet,
[9] S. Amari, Differential Geometrical Methods in Statistics, Lecture J. F. G. de Freitas, N. J. Gordon, Eds. Berlin: Springer Verlag,
Notes in Statistics, Berlin: Springer, 1985. 2001.
[10] S. Amari and H. Nagaoka, The Methods of Information Geom- [37] N. Bergman, A. Doucet, and N. Gordon, “Optimal estimation
etry, New York: AMS and Oxford Univ. Press, 2000. and Cramer-Rao bounds for partial non-Gaussian state space
[11] B. D. O. Anderson and J. B. Moore, “The Kalman-Bucy filter models,” Ann. Inst. Statist. Math., vol. 53, no. 1, pp. 97–112,
as a true time-varying Wiener filter,” IEEE Trans. Syst., Man, 2001.
Cybern., vol. 1, pp. 119–128, 1971. [38] J. M. Bernardo and A. F. M. Smith, Bayesian Theory, 2nd ed.,
[12] ———, Optimal Filtering, Prentice-Hall, 1979. New York: Wiley, 1998.
[13] C. Andrieu and A. Doucet, “Recursive Monte Carlo algorithms [39] D. P. Bertsekas and I. B. Rhodes, “Recursive state estimation
for parameter estimation in general state space models” in Proc. for a set-membership description of uncertainty,” IEEE Trans.
IEEE Signal Processing Workshop on Statistical Signal Process- Automat. Contr., vol. 16, pp. 117–128, 1971.
ing, pp. 14–17, 2001. [40] C. Berzuini, N. G. Best, W. Gilks, and C. Larizza, “Dynamic
[14] ———, “Particle filtering for partially observed Gaussian state conditional independent models and Markov chain Monte Carlo
space models,” J. Roy. Statist. Soc., Ser. B, vol. 64, pp. 4, pp. methods,” J. Amer. Statist. Assoc., vol. 92, pp. 1403–1412, 1997.
827–836, 2002. [41] C. Berzuini and W. Gilks, “RESAMPLE-MOVE filtering with
[15] C. Andrieu, N. de Freitas, and A. Doucet, “Rao-Blackwellised cross-model jumps,” in Sequential Monte Carlo Methods in Prac-
particle filtering via data augmentation,” in Adv. Neural Inform. tice, A. Doucet, J. F. G. de Freitas, N. J. Gordon, Eds. Berlin:
Process. Syst. 14, Cambridge, MA: MIT Press, 2002. Springer Verlag, 2001.
[16] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan, “An [42] G. J. Bierman, Factorization Methods for Discrete Sequential
introduction to MCMC for machine learning,” Machine Learning, Estimation, New York: Academic Press, 1977.
vol. 50, no. 1/2, pp. 5–43, 2003. [43] A. Blake, B. North, and M. Isard, “Learning multi-class dy-
[17] C. Andrieu, M. Davy, and A. Doucet, “Efficient particle filtering namics,” in Adv. Neural Inform. Process. Syst. 11, pp. 389–395,
for jump Markov systems”, in Proc. IEEE ICASSP2002, vol. 2, Cambridge, MA: MIT Press, 1999.
pp. 1625–1628. [44] A. Blake, B. Bascle, M. Isard, and J. MacCormick, “Statistical
[18] ———, “Improved auxiliary particle filtering: Application to models of visual shape and motion,” Proc. Roy. Soc. Lond. Ser.
time-varying spectral analysis”, in Proc. IEEE Signal Processing A, vol. 356, pp. 1283–1302, 1998.
Workshop on Statistical Signal Processing, 2001, pp. 14–17. [45] B. Z. Robrovsky, E. Mayer-Wolf, and M. Zakai, “Some classes
[19] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, of global Cramér-Rao bounds,” Ann. Statist., vol. 15, pp. 1421–
“A tutorial on particle filters for online nonlinear/non-Gaussian 1438, 1987.
Bayesian tracking,” IEEE Trans. Signal Processing, vol. 50, no. [46] H. W. Bode and C. E. Shannon, “A simplified derivation of linear
2, pp. 174–188, Feb. 2002. least square smoothing and prediction theory,” Proc. IRE, vol. 38,
[20] M. Athans, R. P. Wishner, and A. Bertolini, “Suboptimal state pp. 417–425, 1950.
estimation for continuous time nonlinear systems from discrete [47] Y. Boers, “On the number of samples to be drawn in parti-
noisy measurements,” IEEE Trans. Automat. Contr., vol. 13, pp. cle filtering,” Proc. IEE Colloquium on Target Tracking: Algo-
504–514, 1968. rithms and Applications, Ref. No. 1999/090, 1999/215, pp. 5/1–
[21] H. Attias, “Inferring parameters and structure of latent variable 5/6, 1999.
models by variational Bayes,” in Proc. 15th Conf. UAI, UAI’99, [48] Y. Boers and J. N. Driessen, “Particle filter based detection for
1999. tracking,” in Proc. Amer. Contr. Conf., vol. 6, pp. 4393–4397,
[22] ———, “A variational Bayesian framework for graphical mod- 2001.
els,” in Adv. Neural Inform. Process. Syst., 12, Cambridge, MA: [49] E. Bølviken, P. J. Acklam, N. Christophersen, J-M. Størdal,
MIT Press, 2000. “Monte Carlo filters for nonlinear state estimation,” Automatica,
[23] ———, “Source separation with a microphone array using vol. 37, pp. 177–183, 2001.
graphical models and subband filtering,” in Adv. Neural Inform. [50] L. Bottou and V. Vapnik, “Local learning algorithms,” Neural
Process. Syst., 15, Cambridge, MA: MIT Press, 2003. Comput., vol. 4, pp. 888–900, 1992.
[24] D. Avitzour, “A stochastic simulation Bayesian approach to mul- [51] X. Boyen, and D. Koller, “Tractable inference for complex
titarget tracking,” IEE Proc.-F, vol. 142, pp. 41–44, 1995. stochastic process,” in Proc. 14th Conf. Uncertainty in AI,
[25] B. Azimi-Sadjadi, “Approximate nonlinear filtering with appli- UAI’98, pp. 33–42, 1998.
cations to navigation,” Dept. Elect. Comput. Engr., Univ. Mary- [52] P. Boyle, M. Broadie, and P. Glasserman, “Monte Carlo methods
land, College Park, 2001. for security pricing,” J. Economic Dynamics and Control, vol. 3,
[26] P. Baldi and Y. Chauvin, “Smooth on-line learning algorithms pp. 1267–1321, 1998.
for hidden Markov models,” Neural Comput., vol. 6, pp. 307–318, [53] D. Brigo, B. Hanzon, and F. LeGland, “A differential geometric
1994. approach to nonlinear filtering: the prjection filter,” IEEE Trans.
[27] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Asso- Automat. Contr., vol. 43, no. 2, pp. 247–252, 1998.
ciation, New York: Academic Press, 1988. [54] ———, “Approximate nonlinear filtering by projection on ex-
MANUSCRIPT 61

ponential manifolds of densities,” Bernoulli, vol. 5, no. 3, pp. ing,” IEEE Trans. Informa. Theory, vol. 46, no. 6, pp. 2079–2094,
495–534, 1999. 2000.
[55] D. Brigo, “Filtering by projection on the manifold of exponential [85] R. Chen, J. S. Liu, and X. Wang, “Convergence analyses and
densities,” Ph.D. thesis, Dept. Economics and Econmetrics, Free comparison of Markov chain Monte Carlo algorithms in digital
University of Amsterdam, the Netherlands, 1996. Available on communications,” IEEE Trans. Signal Processing, vol. 50, no. 2,
line http://www.damianobrigo.it/. pp. 255–269, Feb. 2002.
[56] ———, “Diffusion processes, manifolds of exponential densities, [86] Y. Chen, “Sequential importance sampling with resampling:
and nonlinear filtering,” in Geometry in Present Day Science, O. Theory and applications,” Ph.D. thesis, Stanford Univ., 2001.
E. Barndorff-Nielsen and E. B. V. Jensen, Eds., World Scientific, [87] Z. Chen and S. Haykin, “On different facets of regularization
1999. theory,” Neural Comput., vol. 14, no. 12, pp. 2791–2846, 2002.
[57] ———, “On SDE with marginal laws evolving in finite- [88] Z. Chen and K. Huber,, “Robust particle filters with applications
dimensional exponential families,” Statist. Prob. Lett., vol. 49, in tracking and communications”, Tech. Rep., Adaptive Systms
pp. 127–134, 2000. Lab, McMaster University, 2003.
[58] W. L. Bruntine and A. S. Weigend, “Bayesian back- [89] J. Cheng and M. J. Druzdzel, “AIS-BN: An adaptive importance
propagation,” Complex Syst., vol. 5, pp. 603–643, 1991. sampling algorithm for evidential reasoning in large Bayesian net-
[59] J. A. Bucklew, Large Deviation Techniques in Decision, Simu- works,” J. Artif. Intell. Res., vol. 13, pp. 155–188, 2000.
lations, and Estimation, Wiley, 1990. [90] S. Chib and E. Greenberg, “Understanding the Metropolis-
[60] R. S. Bucy and P. D. Joseph, Filtering for Stochastic Processes Hastings algorithm,” Am. Stat., vol. 49, pp. 327–335, 1995.
with Applications to Guidance, New York: Wiley, 1968. [91] Y. T. Chien and K. S. Fu, “On Bayesian learning and stochastic
[61] R. S. Bucy, “Linear and nonlinear filtering,” Proc. IEEE, vol. approximation,” IEEE Trans. Syst. Sci. Cybern., vol. 3, no. 1,
58, no. 6, pp. 854–864, 1970. pp. 28-38.
[62] ———, “Bayes theorem and digital realization for nonlinear fil- [92] W. H. Chin, D. B. Ward, and A. G. Constantinides, “Semi-
ters,” J. Astronaut. Sci., vol. 17, pp. 80–94, 1969. blind MIMO channel tracking using auxiliary particle filtering,”
[63] R. S. Bucy and K. D. Senne, “Digital synthesis of non-linear in Proc. GLOBECOM, 2002.
filters,” Automatica, vol. 7, pp. 287–298, 1971. [93] K. Choo and D. J. Fleet, “People tracking with hybrid Monte
[64] R. S. Bucy and H. Youssef, “Nonlinear filter representation via Carlo filtering,” in Proc. IEEE Int. Conf. Comp. Vis., vol. II, pp.
spline functions,” in Proc. 5th Symp. Nonlinear Estimation, 51– 321–328, 2001.
60, 1974. [94] N. Chopin, “A sequential particle filter method for static mod-
[65] Z. Cai, F. LeGland, and H. Zhang, “An adaptive local grid refine- els,” Biometrika, vol. 89, no. 3, pp. 539–552, Aug. 2002.
ment method for nonlinear filtering,” Tech. Rep., INRIA, 1995. [95] C. K. Chui and G. Chen, Kalman Filtering: With Real-Time
[66] F. Campillo, F. Ceŕou, and F. LeGland, “Particle and cell ap- Applications, 2nd ed., Berlin: Springer-Verlag, 1991.
proximation for nonlinear filtering,” Tech. Rep. 2567, INIRA, [96] T. Clapp, “Statistical methods in the processing of
1995. communications data,” Ph.D. thesis, Dept. Eng., Univ.
[67] B. P. Carlin, N. G. Polson, and D. S. Stoffer, “A Monte Carlo Cambridge, U.K., 2000. Available on line http://www-
approach to non-normal and non-linear state-space modelling,” sigproc.eng.cam.ac.uk/publications/theses.html
J. Amer. Statist. Assoc., vol. 87, pp. 493–500, 1992. [97] T. Clapp and S. J. Godsill, “Fixed-lag smoothing using se-
[68] C. Cargnoni, P. Müller, and M. West, “Bayesian forecasting of quential importance sampling,” in Bayesian Statistics 6, J. M.
multinomial time series through conditionally gaussian dynamic Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith Eds.
models,” J. Amer. Statist. Assoc., vol. 92, pp. 587–606, 1997. pp. 743–752, Oxford: Oxford Univ. Press, 1999.
[69] J. Carpenter, P. Clifford, and P. Fearnhead, “Improved particle [98] M. K. Cowles and B. P. Carlin, “Markov chain Monte Carlo con-
filter for nonlinear problems,” IEE Proc. -F Radar, Sonar Navig., vergence diagnostics — A comparative review,” J. Amer. Statist.
vol. 146, no. 1, pp. 2–7, 1999. Assoc., vol. 91, pp. 883–904, 1996.
[70] ———, “Building robust simulation-based filters for evolving [99] F. G. Cozman, “An informal introduction to quasi-Bayesian the-
data sets,” Tech. Rep., Statist. Dept., Oxford Univ., 1998. Avail- ory,” Tech. Rep., CMU-RI-TR 97-24, Robotics Institute, Carnegie
able on line http://www.stats.ox.ac.uk/∼ clifford/particles/. Mellon Univ., 1997.
[71] C. K. Carter and R. Kohn, “On Gibbs sampling for state space [100] ———, “Calculation of posterior bounds given convex sets of
models,” Biometrika, vol. 81, no. 3, pp. 541–553, 1994. prior probability measures and likelihood functions,” J. Comput.
[72] ———, “Markov chain Monte Carlo in conditionally Gaussian Graph. Statist., vol. 8, no. 4, pp. 824–838, 1999.
state-space models,” Biometrika, vol. 83, no. 3, pp. 589–601, 1996. [101] D. Crisan and A. Doucet, “A survey of convergence results on
[73] G. Casella and E. George, “Explaining the Gibbs sampler,” Am. particle filtering methods for practioners,” IEEE Trans. Signal
Statist., vol. 46, pp. 167–174, 1992. Processing, vol. 50, no. 3, pp. 736–746, 2002.
[74] G. Casella and C. P. Robert, “Rao-Blackwellization of sampling [102] D. Crisan, J. Gaines, T. Lyons, “Convergence of a branching
schemes,” Biometrika, vol. 83, no. 1, pp. 81–94, 1996. particle method to the solution of the Zakai equation,” SIAM J.
[75] G. Casella, “Statistical inference and Monte Carlo algorithms,” Appl. Math., vol. 58, no. 5, pp. 1568–1598, 1998.
Test, vol. 5, pp. 249–344, 1997. [103] D. Crisan, P. Del Moral, T. Lyons, “Interacting particle sys-
[76] A. T. Cemgil and B. Kappen, “Rhythm quantization and tempo tems approximations of the Kushner Stratonovitch equation,”
tracking by sequential Monte Carlo,” in Adv. Neural Inform. Pro- Adv. Appl. Prob., vol. 31, no. 3, pp. 819–838, 1999.
cess. Syst. 14, Cambridge, MA: MIT Press, 2002. [104] ———, “Non-linear filtering using branching and interacting
[77] F. Cérou and F. LeGland, “Efficient particle methods for resid- particle systems,” Markov Processes Related Fields, vol. 5, no. 3,
ual generation in partially observed SDE’s,” in Proc. 39th Conf. pp. 293–319, 1999.
Decision and Control, pp. 1200–1205, 2000. [105] D. Crisan, “Particle filters - A theoretical perspective,” in Se-
[78] S. Challa and Y. Bar-Shalom, “Nonlinear filter design using quential Monte Carlo Methods in Practice, A. Doucet, J. F. G.
Fokker-Planck-Kolmogorov probability density evolutions,” IEEE de Freitas, N. J. Gordon, Eds. Berlin: Springer Verlag, 2001.
Trans. Aero. Elect. Syst., vol. 36, no. 1, pp. 309–315, 2000. [106] ———, “Exact rates of convergence for a branching particle
[79] C. D. Charalambous and S. M. Diouadi, “Stochastic nonlinear approximation to the solution of the Zakai equation,” Ann. Prob.,
minimax filtering in continous-time,” in Proc. 40th IEEE Conf. vol. 32, April 2003.
Decision and Control, vol. 3, pp. 2520–2525, 2001. [107] ———, “A direct computation of the Benes filter conditional
[80] G. Chen, Ed. Approximate Kalman Filtering, Singapore: World density,” Stochastics and Stochastic Reports, vol. 55, pp. 47–54,
Scientific, 1993. 1995.
[81] M.-H. Chen and B. W. Schmeiser, “Performances of the Gibbs, [108] L. Csató and M. Opper, “Sparse on-line Gaussian process,”
hit-and-run, and Metropolis samplers,” J. Comput. Graph. Stat., Neural Comput., vol. 14, pp. 641–668, 2002.
vol. 2, pp. 251–272, 1993. [109] A. I. Dale, A History of Inverse Probability: From Thomas
[82] M.-H. Chen, Q.-M. Shao, and J. G. Ibrahim, Monte Carlo Meth- Bayes to Karl Pearson, New York: Springer-Verlag, 1991.
ods in Bayesian Computation, Springer, 2000. [110] F. E. Daum, “Exact finite dimensional nonlinear filters,” IEEE
[83] R. Chen and J. S. Liu, “Mixture Kalman filters,” J. Roy. Statist. Trans. Automat. Contr., vol. 31, no. 7, pp. 616–622, 1986.
Soc., Ser. B, vol. 62, pp. 493–508, 2000. [111] ———, “New exact nonlinear filters,” in Bayesian Analysis of
[84] R. Chen, X. Wang, and J. S. Liu, “Adaptive joint detection Time Series and Dynamic Models, J. C. Spall, Ed. New York:
and decoding in flat-fading channels via mixture Kalman filter- Marcel Dekker, 1988, pp. 199–226.
MANUSCRIPT 62

[112] ———, “Industrial strength nonlinear filters,” in Proc. Estima- gence of the Markov chain simulation,” Ann. Statist., vol. 24, pp.
tion, Tracking, and Fusion Workshop: A Tribute to Prof. Yaakov 69–100, 1996.
Bar-Shalom, 2001. [140] A. Doucet, N. de Freitas, and N. Gordon, Eds. Sequential
[113] ———, “Solution of the Zakai equation by separation of vari- Monte Carlo Methods in Practice, Springer, 2001.
ables,” IEEE Trans. Automat. Contr., vol. 32, no. 10, pp. 941– [141] A. Doucet, “Monte Carlo methods for Bayesian estimation of
943, 1987. hidden Markov models: Application to radiation signals,” Ph.D.
[114] ———, “Dynamic quasi-Monte Carlo for nonlinear filters,” in thesis, Univ. Paris-Sud Orsay, 1997.
Proc. SPIE, 2003. [142] ———, “On sequential simulation-based methods for Bayesian
[115] F. E. Daum and J. Huang, “Curse of dimensionality for particle filtering,” Tech. Rep., Dept. Engineering, CUED-F-TR310, Cam-
filters,” submitted paper preprint. bridge Univ., 1998.
[116] P. J. Davis and P. Rabinowitz, Methods of Numerical Integra- [143] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte
tion, 2nd ed. New York: Academic Press, 1984. Carlo sampling methods for Bayesian filtering,” Statist. Comput.,
[117] J. F. G. de Freitas, “Bayesian methods for neural networks,” vol. 10, pp. 197–208, 2000.
Ph.D. thesis, Dept. Eng., Univ. Cambridge, 1998. Available on [144] A. Doucet, N. de Freitas, K. Murphy, and S. Russell, “Rao-
line http://www.cs.ubc.ca/∼ nando/publications.html. Blackwellised particle filtering for dynamic Bayesian networks,”
[118] ———, “Rao-Blackwellised particle filtering for fault diagno- in Proc. UAI2000, pp. 176–183, 2000.
sis,” in Proc. IEEE Aerospace Conf., vol. 4, pp. 1767–1772, 2002. [145] A. Doucet, N. Gordon, and V. Krishnamurthy, “Stochastic
[119] J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet, sampling algorithms for state estimation of jump Markov linear
“Sequential Monte Carlo methods to train neural network mod- systems,” IEEE Trans. Automat. Contr., vol. 45, pp. 188– , Jan.
els,” Neural Comput., vol. 12, no. 4, pp. 955–993, 2000. 2000.
[120] J. F. G. de Freitas, P. Højen-Sørensen, M. Jordan, and S. Rus- [146] ———, “Particle filters for state estimation of jump Markov
sell, “Variational MCMC,” Tech. Rep., UC Berkeley, 2001. linear systems,” IEEE Trans. Signal Processing, vol. 49, pp. 613–
[121] P. Del Moral, “Non-linear filtering using random particles,” 624, Mar. 2001.
Theo. Prob. Appl., vol. 40, no. 4, pp. 690–701, 1996. [147] A. Doucet, S. J. Godsill, and M. West, “Monte Carlo filtering
[122] ———, “Non-linear filtering: Interacting particle solution,” and smothing with application to time-varying spectral estima-
Markov Processes Related Fields, vol. 2, no. 4, pp. 555–580, 1996. tion,” in Proc. ICASSP2000, vol. 2, pp. 701–704, 2000.
[123] P. Del Moral and G. Salut, “Particle interpretation of non-linear [148] ———, “Maximum a posteriori sequence estimation using
filtering and optimization,” Russian J. Mathematical Physics, Monte Carlo particle filters,” Ann. Inst. Stat. Math., vol. 52, no.
vol. 5 , no. 3, pp. 355–372, 1997. 1, pp. 82–96, 2001.
[124] P. Del Moral and A. Guionnet, “Central limit theorem for [149] A. Doucet and V. B. Tadic, “Parameter estimation in gen-
nonlinear filtering and interacting particle systems,” Ann. Appl. eral state-space models using particle methods,” Ann. Inst. Stat.
Prob., vol. 9, pp. 275–297, 1999. Math., 2003.
[125] ———, “Large deviations for interacting particle systems: Ap- [150] A. Doucet, C. Andrieu, and M. Davy, “Efficient particle fil-
plications to nonlinear filtering problems” Stochast. Process. Ap- tering for jump Markov systems - Applications to time-varying
plicat., vol. 78, pp. 69–95, 1998. autoregressions,” IEEE Trans. Signal Processing, 2003.
[126] P. Del Moral and M. Ledoux, “On the convergence and the [151] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth,
applications of empirical processes for interacting particle systems “Hybrid Monte Carlo,” Phys. Lett. B, vol. 195, pp. 216–222, 1987.
and nonlinear filtering,” J. Theoret. Prob., vol. 13, no. 1, pp. 225– [152] J. Durbin and S. J. Koopman, “Monte Carlo maximum
257, 2000. likelihood estimation for non-Gaussian state space models,”
[127] P. Del Moral and L. Miclo, “Branching and interacting particle Biometrika, vol. 84, pp. 1403–1412, 1997.
systems approximations of Feynamkac formulae with applications [153] ———,, “Time series analysis of non-gaussian observations
to nonlinear filtering,” in Seminaire de Probabilities XXXIV, Lec- based on state space models from both classical and Bayesian
ture Notes in Mathematics, no. 1729, pp. 1–145, Berlin: Springer- perspectives,” J. Roy. Statist. Soc., Ser. B, vol. 62, pp. 3–56,
Verlag, 2000. 2000.
[128] P. Del Moral, J. Jacod, and Ph. Protter, “The Monte-Carlo [154] B. Efron, “Bootstrap methods: Another look at the jackknife,”
method for filtering with discrete-time observations,” Probability Ann. Statist., vol. 7, pp. 1–26, 1979.
Theory and Related Fields, vol. 120, pp. 346–368, 2001. [155] ———, The Bookstrap, Jackknife and Other Resampling
[129] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Plans, Philadephia: SIAM, 1982.
likelihood from incomplete data via the EM algorihtm,” J. Roy. [156] B. Efron and R. J. Tibshirani, An Introduction to the Book-
Statist. Soc., Ser. B, vol. 39, pp. 1–38, 1977. strap, London: Chapman & Hall, 1994.
[130] J. Deutscher, A. Blake and I. Reid, “Articulated body motion [157] G. A. Einicke and L. B. White, “Robust extended Kalman fil-
capture by annealed particle filtering,” in Proc. Conf. Computer ter,” IEEE Trans. Signal Processing, vol. 47, no. 9, pp. 2596–
Vision and Pattern Recognition (CVPR), 2000, vol. 2, pp. 126– 2599, Sept. 1999.
133. [158] Y. Ephraim, “Bayesian estimation approach for speech en-
[131] L. Devroye, Non-uniform Random Variate Generation, Berlin: hancement using hidden Markov models,” IEEE Trans. Signal
Springer, 1986. Processing, vol. 40, no. 4, pp. 725–735, April 1992.
[132] X. Dimakos, “A guide to exact simulation,” Int. Statist. Rev., [159] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE
vol. 69, 27–48, 2001. Trans. Informat. Theory, vol. 48, no. 6, pp. 1518–1569, June 2002.
[133] P. M. Djurić, Y. Huang, and T. Ghirmai, “Perferct sampling: A [160] R. Everson and S. Roberts, “Particle filters for non-stationary
review and applications to signal processing,” IEEE Trans. Signal ICA,” in Advances in Independent Component Analysis, pp. 23-
Processing, vol. 50, no. 2, pp. 345–356, 2002. 41, Springer, 2000.
[134] P. M. Djurić, J. H. Kotecha, J.-Y. Tourneret, and S. Lesage, [161] P. Fearnhead, “Sequential Monte Carlo methods in filter
“Adaptive signal processing by particle filters and discounting of theory,” Ph.D. thesis, Univ. Oxford, 1998. Available on line
old measurements,” in Proc. ICASSP’01, vol. 6, pp. 3733–3736, http://www.stats.ox.ac.uk/∼ fhead/thesis.ps.gz.
2001. [162] ———,, “Particle filters for mixture models with unknown
[135] P. M. Djurić and J-H. Chun, “An MCMC sampling approach number of components,” paper preprint, 2001. Available on line
to estimation of nonstationary hidden Markov models,” IEEE http://www.maths.lancs.ac.uk/∼ fearnhea/.
Trans. Signal Processing, vol. 50, no. 5, pp. 1113–1122, 2002. [163] ———, “MCMC, sufficient statistics, particle filters,” J. Com-
[136] P. M. Djurić and J. H. Kotecha, “Estimation of non-Gaussian put. Graph. Statist., vol. 11, pp. 848–862, 2002.
autoregressive processes by particle filter with forgetting factors,” [164] P. Fearnhead and P. Clifford, “Online inference for well-log
in Proc. IEEE-EURASIP Workshop on Nonlinear Signal and Im- data,” J. Roy. Statist. Soc. Ser. B., paper preprint, 2002.
age Processing, 2001. [165] L. A. Feldkamp, T. M. Feldkamp, and D. V. Prokhorov, “Neu-
[137] P. C. Doerschuk, “Cramér-Rao bounds for discrete-time non- ral network training with the nprKF,” in Proc. IJCNN01, pp.
linear filtering problems,” IEEE Trans. Automat. Contr., vol. 40, 109–114.
no. 8, pp. 1465–1469, 1995. [166] M. Ferrante and W. J. Runggaldier, “On necessary conditions
[138] J. L. Doob, Stochastic Processes. New York: Wiley, 1953. for existence of finite-dimensional filters in discrete time,” Syst.
[139] H. Doss, J. Sethuraman, and K. B. Athreya, “On the conver- Contr. Lett., vol. 14, pp. 63–69, 1990.
MANUSCRIPT 63

[167] G. S. Fishman, Monte Carlo - Concepts, Algorithms and Ap- ing, navigation, and tracking,” IEEE Trans. Signal Processing,
plications, New York, Springer, 1996. vol. 50, no. 2, pp. 425–436, 2002.
[168] W. Fong, S. J. Godsill, A. Doucet, and M. West, “Monte Caro [196] J. H. Halton, “A retrospective and prospective survey of the
smoothing with application to audio signal processing,” IEEE Monte Carlo method,” SIAM Rev., vol. 12, pp. 1–63, 1970.
Trans. Signal Processing, vol. 50, no. 2, pp. 438–448, Feb. 2002. [197] J. M. Hammersley and K. W. Morton, “Poor man’s Monte
[169] G. D. Forney, “The Viterbi algorithm,” Proc. IEEE, vol. 61, Carlo,” J. Roy. Statist. Soc. Ser. B, vol. 16, pp. 23–38, 1954.
pp. 268–278, Mar. 1973. [198] J. M. Hammersley and D. C. Hanscomb, Monte Carlo Methods,
[170] D. Fox, “KLD-sampling: Adaptive particle filters,” in Adv. London: Chapman & Hall, 1964.
Neural Inform. Process. Syst. 14, Cambridge, MA: MIT Press, [199] J. E. Handschin and D. Q. Mayne, “Monte Carlo techniques to
2002. estimate conditional expectation in multi-state non-linear filter-
[171] S. Frühwirth-Schnatter, “Applied state space modelling of non- ing,” Int. J. Contr., vol. 9, no. 5, pp. 547–559, 1969.
Gaussian time series using integration-based Kalman filtering,” [200] J. E. Handschin, “Monte Carlo techniques for prediction and
Statist. Comput., vol. 4, pp. 259–269, 1994. filtering of non-linear stochastic processes,” Automatica, vol. 6,
[172] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simu- pp. 555–563, 1970.
lation for Bayesian Inference, London: Chapman & Hall, 1997. [201] B. Hanzon, “A differential-geometric approach to approximate
[173] A. Gelb, Ed. Applied Optimal Estimation, Cambridge, MA: nonlinear filtering,” in Geometrization of Statistical Theory, C.
MIT Press, 1974. T. J. Dodson, Ed., Univ. Lancaster: ULMD Pub., pp. 219–223,
[174] A. Gelfand and A. F. M. Smith, “Sampling-based approaches 1987.
to calculating mariginal densities,” J. Amer. Statist. Assoc., vol. [202] P. J. Harrisons and C. F. Stevens, “Bayesian forecasting (with
85, pp. 398–409, 1990. discussion),” J. Roy. Statist. Soc. Ser. B, vol. 38, pp. 205–247,
[175] A. Gelman and D. B. Rubin, “Inference from iterative algo- 1976.
rithms (with discussions),” Statist. Sci., vol. 7, pp. 457–511, 1992. [203] W. K. Hastings, “Monte Carlo sampling methods using Markov
[176] A. Gelman and X.-L. Meng, “Simulating normalizing constants: chains and their applications,” Biometrika, vol. 57, pp. 97–109,
From importance sampling to bridge sampling to path sampling,” 1970.
Statist. Sci., vol. 13, pp. 163–185, 1998. [204] S. Haykin, Adaptive Filter Theory, 4th ed. Upper Saddle River,
[177] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distri- NJ: Prentice-Hall, 2002.
butions, and the Bayesian restoration of images,” IEEE Trans. [205] ———, Ed., Kalman Filtering and Neural Networks, New
Pattern Anal. Machine Intell., vol. 6, pp. 721–741, 1984. York: Wiley, 2001.
[178] J. E. Gentle, Random Number Generation and Monte Carlo, [206] S. Haykin and B. Widrow, Eds., Least-Mean-Squares Filters,
2nd ed., Berlin: Springer-Verlag, 2002. New York: Wiley, 2003.
[179] J. Geweke, “Bayesian inference in Econometrics models using [207] S. Haykin and N. de Freitas, Eds, Sequential State Estimation,
Monte Carlo integration,” Econometrica, vol. 57, pp. 1317–1339, forthcoming special issue Proc. IEEE, 2003.
1989.
[208] S. Haykin, P. Yee, and E. Derbez, “Optimum nonlinear filter,”
[180] J. Geweke and H. Tanizaki, “On Markov chain Monte Carlo
IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2774-2786,
methods for nonlinear and non-gaussian state-space models,”
1997.
Commun. Stat. Simul. C, vol. 28, pp. 867–894, 1999.
[209] S. Haykin, K. Huber, and Z. Chen, “Bayesian sequential state
[181] C. Geyer, “Practical Markov chain Monte Carlo (with discus-
estimation for MIMO wireless communication,” submitted to
sions),” Statist. Sci., vol. 7, pp. 473–511, 1992.
Proc. IEEE.
[182] Z. Ghahramani, “Learning dynamic Bayesian networks,” in
[210] D. M. Higdon, “Auxiliary variable methods for Markov chain
Adaptive Processing of Sequence and Data Structure, C. L.
Monte Carlo with applications,” J. Amer. Statist. Assoc., vol. 93,
Giles and M. Gori, Eds. Lecture Notes in Artificial Intelligence,
pp. 585–595, 1998.
Springer-Verlag, 1998, pp. 168–197.
[183] ———, “An introduction to hidden Markov models and [211] T. Higuchi, “Monte Carlo filter using the genetic algorithm
Bayesian networks,” Int. J. Pattern Recognition and Artificial operators,” J. Statist. Comput. Simul., vol. 59, no. 1, pp. 1–23,
Intelligience, vol. 15, no. 1, pp. 9–42, 2001. 1997.
[184] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, Eds. [212] Y. C. Ho and R. C. K. Lee, “A Bayesian approach to prob-
Markov Chain Monte Carlo Methods in Practice, London: Chap- lems in stochastic estimation and control,” IEEE Trans. Automat.
man & Hall, 1996. Contr., vol. 9, pp. 333–339, Oct. 1964.
[185] W. R. Gilks and C. Berzuini, “Following a moving target – [213] A. Honkela, “Nonlinear switching state-space models,” Master
Monte Carlo inference for dynamic Bayesian models,” J. Roy. Thesis, Helsinki Univ. Technology, 2001.
Statist. Soc., Ser. B, vol. 63, pp. 127–1546, 2001. [214] P. Huber, Robust Statistics, New York: Wiley, 1981.
[186] W. R. Gilks and P. Wild, “Adaptive rejection sampling for [215] K. Huber and S. Haykin, “Application of particle filters to
Gibbs sampling,” J. Roy. Statist. Soc. Ser. C, vol. 41, pp. 337– MIMO wireless communications,” in Proc. IEEE Int. Conf.
348, 1992. Commu., ICC2003, pp. 2311–2315.
[187] R. D. Gill and B. Y. Levit, “Application of the van Trees in- [216] C. Hue, J. Le Cadre, and P. Pérez, “Sequential Monte Carlo
equality: A Bayesian Cramér-Rao bound,” Bernoulli, vol. 1, no. methods for multiple target tracking and data fusion,” IEEE
1/2, pp. 59–79, 1995. Trans. Signal Processing, vol. 50, no. 2, pp. 309–325, 2002.
[188] S. Godsill and T. Clapp, “Improved strategies for Monte Carlo [217] ———, “Tracking multiple objects with particle filtering,”
particle filters,” in Sequential Monte Carlo Methods in Practice, IEEE Trans. Aero. Electr. Syst., vol. 38, no. 3, pp. 791-812, 2002.
A. Doucet, J. F. G. de Freitas, N. J. Gordon, Eds. Berlin: Springer [218] ———, “Performance analysis of two sequential Monte Carlo
Verlag, 2001. methods and posterior Cramér-Rao bounds for multi-target track-
[189] S. Godsill, A. Doucet, and M. West, “Maximum a posteriori ing,” Tech. Rep., no. 4450, INRIA, 2002.
sequence estimation using Monte Carlo particle filters,” in Ann. [219] Q. Huo, and C.-H. Lee, “On-line adaptive learning of the con-
Inst. Statist. Math., vol. 53, no. 1, pp. 82–96, 2001. tinuous density hidden Markov model based on approximate re-
[190] N. Gordon, “Bayesian methods for tracking,” Ph.D. thesis, cursive Bayes estimate,” IEEE Trans. Speech Audio Processing,
Univ. London, 1993. vol. 5, pp. 161–172, 1997.
[191] ———, “A hybrid bootstrap filter for target tracking in clut- [220] ———, “A Bayesian predictive approach to robust speech
ter,” IEEE Trans. Aerosp. Elect. Syst., vol. 33, pp. 353–358, 1997. recognition,” IEEE Trans. Speech Audio Processing, vol. 8, pp.
[192] N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach 200–204, 2000.
to nonlinear/non-gaussian Bayesian state estimation,” IEE Proc. [221] M. Hürzeler, “Statistical methods for general state-space mod-
-F Radar, Sonar Navig., vol. 140, pp. 107–113, 1993. els,” Ph.D. thesis, Dept. Math., ETH Zürich, Zürich, 1998.
[193] P. J. Green, “Reversible jump Markov chain Monte Carlo com- [222] M. Hürzeler and H. R. Künsch, “Monte Carlo approximations
putation and Bayesian model determination” Biometrika, vol. 82, for general state-space models ,” J. Comput. Graphical Statist.,
pp. 711–732, 1995. vol. 7, no. 2, pp. 175–193, 1998.
[194] M. S. Grewal, Kalman Filtering: Theory and Practice, Engle- [223] ———, “Approximating and Maximising the likelihood for a
wood Cliffs, NJ: Prentice-Hall, 1993. general state-space model,” in Sequential Monte Carlo Methods
[195] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssel, J. Jans- in Practice, A. Doucet, J. F. G. de Freitas, N. J. Gordon, Eds.
son, R. Karlsson, and P.-J. Nordlund, “Particle filters for position- Berlin: Springer Verlag, 2001.
MANUSCRIPT 64

[224] Y. Iba, “Population Monte Carlo algorihtms,” Trans. Japanese tion algorithms for dynamic probabilistic networks,” in Proc. 11th
Soc. Artificial Intell., vol. 16, no. 2, pp. 279–286, 2001. Conf. UAI, pp. 346–351, 1995.
[225] R. A. Iltis, “State estimation using an approximate reduced [255] S. A. Kassam and H. V. Poor, “Robust statistics for signal
statistics algorithm,” IEEE Trans. Aero. Elect. Syst., vol. 35, no. processing,” Proc. IEEE, vol. 73, no. 3, pp. 433–481, 1985.
4, pp. 1161–1172, Oct. 1999. [256] J. K. Kim, “A note on approximate Bayesian bootstrap impu-
[226] D. R. Insua and F. Ruggeri, Eds. Robust Bayesian Analysis, tation,” Biometrika, vol. 89, no. 2, pp. 470–477, 2002.
Lecture Note in Statistics 152, Berlin: Springer, 2000. [257] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi, “Opti-
[227] M. Irwin, N. Cox, and A. Kong, “Sequential imputation for mization by simulated annealing,” Science, vol. 220, pp. 671–680,
multilocus linkage analysis,” Proc. Natl. Acad. Sci., vol. 91, pp. 1983.
11684–11688, 1994. [258] G. Kitagawa, “Non-Gaussian state-space modeling of non-
[228] M. Isard, “Visual motion analysis by probabilistic propagation stationary time series,” J. Amer. Statist. Assoc., vol. 82, pp. 503–
of conditional density,” D.Phil. Thesis, Oxford Univ., 1998. Ava- 514, 1987.
iable on line http://research.microsoft.com/users/misard/ [259] ———, “Monte Carlo filter and smoother for non-gaussian non-
[229] M. Isard and A. Blake, “Contour tracking by stochastic propa- linear state space models,” J. Comput. Graph. Statist., vol. 5, no.
gation of conditional density,” in Proc. 4th European Conf. Com- 1, pp. 1–25, 1996.
puter Vision, vol. 1, pp. 343–356, 1996. [260] ———, “Self-organising state space model,” J. Amer. Statist.
[230] ———, “CONDENSATION: conditional density propagation Assoc., vol. 93, pp. 1203–1215, 1998.
for visual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5– [261] G. Kitagawa and W. Gersch, Smoothness Priors Analysis
28, 1998. of Time Series, Lecture Notes in Statistics, 116, New York:
[231] ———, “ICONDENSATION: Unifying low-level and high-level Springer-Verlag, 1996.
tracking in a stochastic framework,” in Proc. 5th European Conf. [262] D. Koller and R. Fratkina, “Using learning for approximation in
Computer Vision, vol. 1, pp. 893–908, 1998. stochastic processes,” in Proc. 15th Int. Conf. Machine Learning,
[232] ———, “A smoothing filter for Condensation,” in Proc. 5th 1998, pp. 287–295.
European Conf. Computer Vision, vol. 1, pp. 767–781, 1998. [263] D. Koller and U. Lerner, “Sampling in Factored Dynamic Sys-
[233] Kiyosi Itô, “On a formula concerning stochastic differentials,” tems,” in Sequential Monte Carlo Methods in Practice, A. Doucet,
Nagoya Math. J., vol. 3, pp. 55–65, 1951. J.F.G. de Freitas, and N. Gordon, Eds., Springer-Verlag, 2001.
[234] K. Ito and K. Xiong, “Gaussian filters for nonlinear filtering [264] A. N. Kolmogorov, “Stationary sequences in Hilbert spaces,”
problems,” IEEE Trans. Automat. Contr., vol. 45, no. 5, pp. 910- Bull. Math. Univ. Moscow (in Russian), vol. 2, no. 6, p. 40, 1941.
927, 2000. [265] ———, “Interpolation and extrapolation of stationary random
[235] K. Ito, “Approximation of the Zakai equation for nonlinear fil- sequences,” Izv. Akad. Nauk USSR, Ser. Math., vol. 5, no. 5, pp.
tering,” SIAM J. Contr. Optim., vol. 34, pp. 620–634, 1996. 3–14, 1941.
[236] T. Jaakkola, “Tutorial on variational approximation methods,” [266] A. Kong, J. S. Liu, and W. H. Wong, “Sequential imputations
in Advanced Mean Field Methods: Theory and Practice, D. Saad and Bayesian missing data problems,” J. Amer. Statist. Assoc.,
and M. Opper, Eds. Cambridge, MA: MIT Press, 2001. vol. 89, pp. 278–288, 1994.
[237] T. Jaakkola and M. Jordan, “Bayesian parameter estimation [267] A. Kong, P. McCullagh, D. Nicolae, Z. Tan and X.-L. Meng,
via variational methods,” Statist. Comput., vol. 10, pp. 25–37, “A theory of statistical models for Monte Carlo integration,” J.
2000. Roy. Statist. Soc. Ser. B, vol. 65, 2003.
[238] A. H. Jazwinski, Stochastic Processes and Filtering Theory, [268] J. H. Kotecha and P. M. Djurić, “Gaussian sum particle filtering
New York: Academic Press, 1970. for dynamic state space models,” in Proc. ICASSP2001, pp. 3465–
[239] F. V. Jensen, An Introduction to Bayesian Networks, New 3468, 2001.
York: Springer-Verlag, 1996. [269] ———, “Sequential Monte Carlo sampling detector for
[240] ———, Bayesian Networks and Decision Graphs, Berlin: Rayleigh fast-fading channels,” in Proc. ICASSP2000, vol. 1, pp.
Springer, 2001. 61–64, 2000.
[241] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An in- [270] S. C. Kramer, “The Bayesian approach to recursive state esti-
troduction to variational methods for graphical models,” Machine mation: Implementation and application,” Ph.D. thesis, UC San
Learning, vol. 37, no. 2, pp. 183–233, 1999. Diego, 1985.
[242] S. Julier and J. Uhlmann, “A new extension of the Kalman [271] S. C. Kramer and H. W. Sorenson, “Recursive Bayesian esti-
filter to nonlinear systems,” in Proc. AeroSense, 1997. mation using piece-wise constant approximations,” Automatica,
[243] S. Julier, J. Uhlmann, and H. F. Durrant-Whyte, “A new vol. 24, pp. 789–901, 1988.
method for nonlinear transformation of means and covariances [272] ———, “Bayesian parameter estimation,” IEEE Trans. Au-
in filters and estimators,” IEEE Trans. Automat. Contr., vol. 45, tomat. Contr., vol. 33, pp. 217–222, 1988.
no. 3, pp. 477–482, 2000. [273] A. J. Krener, “Kalman-Bucy and minimax filtering,” IEEE
[244] T. Kailath, “A view of three decades of linear filtering theory,” Trans. Automat. Contr., vol. 25, pp. 291–292, 1980.
IEEE Trans. Inform. Theory, vol. 20, no. 2, pp. 146–181, 1974. [274] R. Kress, Linear Integral Equations (2nd ed.), Berlin: Springer-
[245] ———, “The innovations approach to detection and estimation Verlag, 1999.
theory,” Proc. IEEE, vol. 58, pp. 680–695, 1970. [275] V. Krishnan, Nonlinear Filtering and Smoothing: An Intro-
[246] ———, Lecture on Wiener and Kalman Filtering, New York: duction to Martingales, Stochastic Integrals and Estimation, New
Springer-Verlag, 1981. York: Wiley, 1984.
[247] T. Kailath, A. H. Sayed and B. Hassibi, Linear Estimation, [276] R. Kulhavý, “Recursive nonlinear estimation: A geometric ap-
Upper Saddle River, NJ: Prentice-Hall, 2000. proach,” Automatica, vol. 26, no. 3, pp. 545–555, 1990.
[248] G. Kallianpur, Stochastic Filtering Theory, New York: [277] ———, “Recursive nonlinear estimation: Geometry of a space
Springer-Verlag, 1980. of posterior densities,” Automatica, vol. 28, no. 2, pp. 313–323,
[249] R. E. Kalman and R. S. Bucy, “New results in linear filtering 1992.
and prediction theory,” Trans. ASME, Ser. D, J. Basic Eng., vol. [278] ———, Recursive Nonlinear Estimation: A Geometric Ap-
83, pp. 95–107, 1961. proach. Lecture Notes in Control and Information Sciences, 216,
[250] R. E. Kalman, “A new approach to linear filtering and predic- London: Springer-Verlag, 1996.
tion problem” Trans. ASME, Ser. D, J. Basic Eng., vol. 82, pp. [279] ———, “On extension of information geometry of parameter
34–45, 1960. estimation to state estimation,” in Mathematical Theory of Net-
[251] ———, “When is a linear control system optimal?” Trans. works and Systems, A. Beghi, L. Finesso and G. Picci (Eds), pp.
ASME, Ser. D, J. Basic Eng., vol. 86, pp. 51–60, 1964. 827–830, 1998.
[252] ———, “Mathematical description of linear dynamical sys- [280] ———, “Quo vadis, Bayesian identification?” Int. J. Adaptive
tems” SIAM J. Contr., vol. 1, pp. 152–192, 1963. Control and Signal Processing, vol. 13, pp. 469–485, 1999.
[253] ———, “New methods in Wiener filtering theory,” in Proc. 1st [281] ———, “Bayesian smoothing and information geometry,” in
Symp. Engineering Applications of Random Function Theory and Learning Theory and Practice, J. Suykens Ed, IOS Press, 2003.
Probability J. Bogdanoff and F. Kozin, Eds., pp. 270–388, New [282] H. J. Kushner, “On the differential equations satisfied by con-
York: Wiley, 1963. ditional probability densities of Markov processes with applica-
[254] K. Kanazawa, D. Koller, and S. Russel, “Stochastic simula- tions,” SIAM J. Contr., vol. 2, pp. 106–119, 1965.
MANUSCRIPT 65

[283] ———, “Approximations to optimal nonlinear filters,” IEEE in Practice, A. Doucet, N. de Freitas, and N. J. Gordon, Eds. New
Trans. Automat. Contr., vol. 12, pp. 546–556, Oct. 1967. York: Springer, 2001.
[284] ———, “Dynamical equations for optimal nonlinear filtering,” [311] S. V. Lototsky, and B. L. Rozovskii, “Recursive nonlinear filter
J. Differential Equations, vol. 3, pp. 179–190, 1967. for a continuous-discrete time model: Separation of parameters
[285] ———, Probability Methods for Approximations in Stochastic and observations,” IEEE Trans. Automat. Contr., vol. 43, no. 8,
Control and for Elliptic Equations, New York: Academic Press, pp. 1154–1158, 1996.
1977. [312] S. V. Lototsky, R. Mikulevicius, and B. L. Rozovskii, “Non-
[286] H. J. Kushner and P. Dupuis, Numerical Methods for Stochas- linear filtering revisited: A spectral approach,” SIAM J. Contr.
tic Control Problems in Continuous Time, New York: Springer- Optim., vol. 35, pp. 435–461, 1997.
Verlag, 1992. [313] J. MacCormick and A. Blake, “A probabilistic exclusion prin-
[287] H. Kushner and A. S. Budhiraja, “A nonlinear filtering algo- ciple for tracking multiple objects,” in Proc. Int. Conf. Comput.
rithm based on an approximation of the conditional distribution,” Vision, 1999, pp. 572–578.
IEEE Trans. Automat. Contr., vol. 45, no. 3, pp. 580–585, March [314] J. MacCormick and M. Isard, “Partitioned sampling, articu-
2000. lated objects, and interface-quality hand tracking,” Tech. Rep.,
[288] C. Kwok, D. Fox, and M. Meila, “Real-time particle filter,” Dept. Eng. Sci., Univ. Oxford, 2000.
in Adv. Neural Inform. Process. Syst. 15, Cambridge, MA: MIT [315] S. N. MacEachern, M. Clyde, and J. S. Liu, “Sequential im-
Press, 2003. portance sampling for nonparametric Bayes models: The next
[289] D. G. Lainiotis, “Optimal nonlinear estimation,” Int. J. Contr., generation,” Canadian J. Statist., vol. 27, pp. 251–267, 1999.
vol. 14, no. 6, pp. 1137–1148, 1971. [316] D. J. C. MacKay, “Bayesian methods for adaptive models,”
[290] J-R. Larocque, J. P. Reilly, and W. Ng, “Particle filters for Ph.D. thesis, Dept. Computation and Neural Systems, Caltech,
tracking an unknown number of sources,” IEEE Trans. Signal 1992. Available on line http://wol.ra.phy.cam.ac.uk/mackay/.
Processing, vol. 50, no. 12, pp. 2926–2937, 2002. [317] ———, “Probable networks and plausible predictions - A re-
[291] D. S. Lee and N. K. Chia, “A particle algorithm for sequen- view of practical Bayesian methods for supervised neural net-
tial Bayesian parameter estimation and model selection,” IEEE works,” Network, vol. 6, pp. 469–505, 1995.
Trans. Signal Processing, vol. 50, no. 2, pp. 326–336, Feb. 2002. [318] ———, “Introduction to Monte Carlo methods,” in Learning in
[292] F. LeGland, “Monte-Carlo methods in nonlinear filtering,” in Graphical Models, M. Jordan Ed., Kluwer Academic Publishers,
Proc. IEEE Conf. Decision and Control, pp. 31–32, 1984. 1998.
[293] ———, “Stability and approximation of nonlinear filters: An [319] ———, “Choice of basis for Laplace approximation,” Machine
information theoretic approach,” in Proc. 38th Conf. Decision Learning, vol. 33, no. 1, pp. 77–86, 1998.
and Control, pp. 1889–1894, 1999. [320] D. M. Malakoff, “Bayes offers ‘new’ way to make sense of num-
[294] F. LeGland, and N. Oudjane, “Stability and uniform approxi- bers,” Science, vol. 286, pp. 1460–1464, 1999.
mation of nonlinear filters using the Hilbert metric, and applica- [321] B. Manly, Randomization, Bootstrap and Monte Carlo Methods
tion to particle filters,” in Proc. 39th Conf. Decision and Control, in Biology, 2nd ed., CRC Press, 1997.
pp. 1585-1590, 2000. [322] Z. Mark and Y. Baram, “The bias-variance dilemma of
[295] P. L’Ecuyer and C. Lemieux, “Variance reduction via lattice the Monte Carlo method,” in Artificial Neural Networks
rules,” Management Sci., vol. 46, pp. 1214–1235, 2000. (ICANN2001), G. Dorffner, H. Bischof, and K. Hornik, Eds.
[296] C. Lemieux and P. L’Ecuyer, “Using lattice rules for variance Berlin: Springer-Verlag, 2001.
reduction in simulation,” in Proc. 2000 Winter Simulation Conf., [323] ———, “Manifold stochastic dynamics for Bayesian learning,”
509–516, 2000. Neural Comput., vol. 13, pp. 2549–2572, 2001.
[297] N. Levinson, “The Wiener rms (root-mean-square) error crite- [324] A. Marshall, “The use of multi-stage sampling schemes in
rion in filter design and prediction,” J. Math. Phys., vol. 25, pp. Monte Carlo computations,” in Symposium on Monte Carlo
261–278, Jan. 1947. Methods, M. Meyer Ed. New York: Wiley, pp. 123–140, 1956.
[298] F. Liang, “Dynamically weighted importance sampling in [325] S. Maskell, Orton, and N. Gordon, “Efficient inference for con-
Monte Carlo computation” J. Amer. Statist. Assoc., vol. 97, 2002. ditionally Gaussian Markov random fields”, Tech. Rep. CUED/F-
[299] J. G. Liao, “Variance reduction in Gibbs sampler using quasi INFENG/TR439, Cambridge Univ., August 2002.
random numbers,” J. Comput. Graph. Statist., vol. 7, no. 3, pp. [326] S. McGinnity and G. W. Irwin, “Manoeuvring target track-
253–266, 1998. ing using a multiple-model bootstrap filter” in Sequential Monte
[300] T. M. Liggett, Interacting Particle Systems, Springer-Verlag, Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. J.
1985. Gordon, Eds. New York: Springer, 2001.
[301] T-T. Lin and S. S. Yau, “Bayesian approach to the optimization [327] I. W. McKeague and W. Wefelmeyer, “Markov Chain Monte
of adaptive systems,” IEEE Trans. Syst. Sci. Cybern., vol. 3, no. Carlo and Rao-Blackwellization,” Statistical Planning and Infer-
2, pp. 77–85. ence, vol. 85, pp. 171–182, 2000.
[302] X. Lin, T. Kirubarajan, Y. Bar-Shalom, and S. Maskell, “Com- [328] X.-L. Meng and D. A. van Dyk, “Seeking efficient data aug-
parison of EKF, pseudomeasurement and particle filters for a mentation schemes via conditional and marginal augmentation,”
bearings-only target tracking problem,” in Proc. SPIE on Sig- Biometrika, vol. 86, pp. 301–320, 1999.
nal and Data Processing of Small Targets, vol. 4728, 2002. [329] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.
[303] J. S. Liu and R. Chen, “Blind deconvolution via sequential Teller, and E. Teller, “Equations of state calculations by fast
imputation,” J. Amer. Statist. Assoc., vol. 90, pp. 567–576, 1995. computing machines,” J. Chem. Phys., vol. 21, pp. 1087–1091,
[304] ———, “Sequential Monte Carlo methods for dynamical sys- 1953.
tems,” J. Amer. Statist. Assoc., vol. 93, pp. 1032–1044, 1998. [330] N. Metropolis and S. Ulam, “The Monte Carlo method,” J.
[305] J. S. Liu, “Metropolized independent sampling with compar- Amer. Statist. Assoc., vol. 44, pp. 335–341, 1949.
isons to rejection sampling and importance sampling,” Statist. [331] J. Miguez and P. M. Djuric, “Blind equalization by sequen-
Comput., vol. 6, pp. 113–119, 1996. tial importance sampling,” in Proc. IEEE Symp. Circuit Syst.,
[306] ———, Monte Carlo Strategies in Scientific Computing, ISCAS’02, vol. 1, pp. 845–848, 2002.
Berlin: Springer, 2001. [332] A. Milstein, J. Sánchez, and E. T. Williamson, “Robust global
[307] J. S. Liu, R. Chen, and W. H. Wong, “Rejection control and localization using clustered particle filtering,” in Proc. 8th AAAI,
sequential importance sampling,” J. Amer. Statist. Assoc., vol. 2002.
93, pp. 1022–1031, 1998. [333] T. Minka, “A family of algorithms for approximate Bayesian
[308] J. S. Liu, R. Chen, and T. Logvinenko, “A theoretical frame- inference,” Ph.D. thesis, Department of Computer Science
work for sequential importance sampling with resampling,” in Se- and Electrical Engineering, MIT, 2001. Available on line
quential Monte Carlo Methods in Practice, A. Doucet, J. F. G. http://www.stat.cmu.edu/∼ minka/.
de Freitas, N. J. Gordon, Eds. Berlin: Springer Verlag, 2001. [334] ———, “Expectation propagation for approximate Bayesian
[309] J. S. Liu, F. Liang, and W. H. Wong, “A theory for dynamic inference,” in Proc. UAI’2001, 2001.
weighting in Monte Carlo computation,” J. Amer. Statist. Assoc., [335] ———, “Using lower bounds to approximate integrals,” Tech.
vol. 96, pp 561–573, 2001. Rep., Dept. Statist., CMU, 2001.
[310] J. Liu and M. West, “Combined parameter and state estimation [336] A. Mira, J. Møller, and G. Roberts, “Perfect slice samplers,”
in simulation-based filtering,” in Sequential Monte Carlo Methods J. Roy. Statist. Soc., Ser. B, vol. 63, pp. 593–606, 2001.
MANUSCRIPT 66

[337] A. W. Moore, C. G. Atkeson, and S. A. Schaal, “Locally [366] V. Peterka, “Bayesian approach to system identification,” in
weighted learning for control,” Artificial Intell. Rev., vol. 11, pp. Trends and Progress in System Identification, pp. 239–304, Perg-
75–113, 1997. amon Press, 1981.
[338] R. Morales-Menendez, N. de Freitas, and D. Poole, “Real-time [367] ———, “Bayesian system identification,” Automatica, vol. 17,
monitoring of complex industrial processes with particle filters,” pp. 41–53, 1981.
in Adv. Neural Info. Process. Syst. 15, Cambridge, MA: MIT [368] V. Philomin, R. Duraiswami, and L. Davis, “Quasi-random
Press, 2003. sampling for condesation,” in Proc. Euro. Conf. Comp. Vis., vol.
[339] D. R. Morrell and W. C. Stirling, “Set-valued filtering and II, pp. 134–149, 2000.
smoothing,” IEEE Trans. Syst. Man Cybern., vol. 21, pp. 184– [369] M. Pitt and N. Shephard, “A fixed lag auxillary particle filter
193, 1991. with deterministic sampling rules,” unpublished paper, 1998.
[340] K. Mosegaard and M. Sambridge, “Monte Carlo analysis of [370] ———, “Filtering via simulation: Auxillary particle filter,” J.
inverse problems,” Inverse Problems, vol. 18, pp. 29–54, 2002. Amer. Statist. Assoc., vol. 94, pp. 590–599, 1999.
[341] P. Müller, “Monte Carlo integration in general dynamic mod- [371] ———, “Auxiliary variable based particle filters,” in Sequential
els,” Contemporary Mathematics, vol. 115, pp. 145–163, 1991. Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas,
[342] ———, “Posterior integration in dynamic models,” Comput. N. J. Gordon, Eds. Berlin: Springer Verlag, 2001.
Sci. Statist., vol. 24, pp. 318–324, 1992. [372] A. Pole, M. West, and P. J. Harrison, Applied Bayesian Fore-
[343] K. Murphy, “Switching Kalman filter,” Tech. Rep., Dept. Com- casting and Time Series Analysis. New York: Chapman-Hall,
put. Sci., UC Berkeley, 1998. 1994.
[344] ———, “Dynamic Bayesian networks: Represen- [373] ———, “Non-normal and non-linear dynamic Bayesian mod-
tation, inference and learning,” Ph.D. thesis, Dept. elling,” in Bayesian Analysis of Time Series and Dynamic Mod-
Comput. Sci., UC Berkeley, 2002. Available on line els, J. C. Spall Ed., pp. 167–198, New York: Marcel Dekker, 1988.
http://www.ai.mit.edu/∼ murphyk/papers.html. [374] N. G. Polson, B. P. Carlin, and D. S. Stoffer, “A Monte-Carlo
[345] C. Musso, N. Oudjane, and F. LeGland, “Improving regularised approach to non-normal and nonlinear state-space modelling,” J.
particle filters,” in Sequential Monte Carlo Methods in Practice, Amer. Statist. Assoc., vol. 87, pp. 493–500, 1992.
A. Doucet, N. de Freitas, and N. J. Gordon, Eds. New York: [375] S. J. Press, Subjective and Objective Bayesian Statistics: Prin-
Springer, 2001. ciples, Models, and Applications (2nd ed.), New York: Wiley,
[346] R. Neal, Bayesian Learning for Neural Networks. Lecture Notes 2003.
in Statistics, 118, Berlin: Springer, 1996. [376] W. H. Press and G. R. Farrar, “Recursive stratified sampling
[347] ———, “An improved acceptance procedure for the hybrid for multidimensional Monte Carlo integration,” Computers in
Monte Carlo,” J. Comput. Phys., vol. 111, pp. 194–203, 1994. Physics, vol. 4, pp. 190–195, 1990.
[348] ———, “Sampling from multimodal distributions using tem- [377] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flan-
pered transitions,” Statist. Comput., vol. 6, pp. 353–366, 1996. nery, Numerical Recipes in C, 2nd ed., Cambridge Univ. Press,
[349] ———, “Suppressing random walks in Markov chain Monte 1997.
Carlo using ordered overrelaxation,” in Learning in Graphical [378] E. Punskaya, C. Andrieu, A. Doucet, and W. J. Fitzgerald,
Models, M. I. Jordan, Ed, pp. 205–228, Kluwer Academic Pub- “Particle filtering for demodulation in fading channels with non-
lishers, 1998. Gaussian additive noise,” IEEE Trans. Commu., vol. 49, no. 4,
[350] ———, “Annealed importance sampling,” Statist. Comput., pp. 579–582, Apr. 2001.
vol. 11, pp. 125–139, 2001.
[379] L. R. Rabiner, “A tutorial on hidden Markov models and se-
[351] ———, “Slice sampling (with discussions),” Ann. Statist., vol. lected applictions in speech recognition,” Proc. IEEE, vol. 77, no.
31, no. 3, June 2003. 2, pp. 257–285, Feb. 1989.
[352] A. T. Nelson, “Nonlinear estimation and modeling of noisy time
[380] L. R. Rabiner and B.-H. Juang, “An introduction to hidden
series by dual Kalman filtering methods,” Ph.D. thesis, Dept.
Markov models,” IEEE Acoust., Speech, Signal Processing Mag.,
Elect. Comput. Engin., Oregon Graduate Institute, 2000.
pp. 4–16, Jan. 1986.
[353] H. Niederreiter, Random Number Generation and Quasi-Monte
[381] ———, Fundamentals of Speech Recognition, Englewood Cliffs,
Carlo Methods, Philadelphia, PA: SIAM, 1992.
NJ: Prentice Hall, 1993.
[354] H. Niederreiter and J. Spanier, Eds. Monte Carlo and Quasi-
Monte Carlo Methods, Berlin: Springer-Verlag, 2000. [382] C. E. Rasmussen and Z. Ghahramani, “Bayesian Monte Carlo,”
in Adv. Neural Inform. Process. Syst. 15, Cambridge, MA: MIT
[355] M. Norgaard, N. Poulsen, and O. Ravn, “Adavances in
Press, 2003.
derivative-free state estimation for nonlinear systems,” Tech.
Rep., Technical Univ. Denmark, 2000. Available on-line [383] H. E. Rauch, “Solutions to linear smoothing problem,” IEEE
http://www.imm.dtu.dk/nkp/. Trans. Automat. Contr., vol. 8, pp. 371–372, 1963.
[356] B. North, A. Blake, M. Isard, and J. Rittscher, “Learning and [384] H. E. Rauch, T. Tung, and T. Striebel, “Maximum likelihood
classification of complex dynamics,” IEEE Trans. Pattern Anal. estimates of linear dynamic systems,” AIAA J., vol. 3, pp. 1445–
Mach. Intel., vol. 22, no. 9, pp. 1016–1034, Sept. 2000. 1450, 1965.
[357] J.P. Norton and G. V. Veres, “Improvement of the particle filter [385] I. B. Rhodes, “A tutorial introduction to estimation and filter-
by better choice of the predicted sample set,” in Proc. 15th IFAC, ing,” IEEE Trans. Automat. Contr., vol. 16, pp. 688–707, 1971.
pp. 904–909, 2002. [386] B. Ripley, Stochastic Simulation, New York: Wiley, 1987.
[358] G. W. Oehlert, “Faster adaptive importance sampling in low di- [387] H. Risken, The Fokker-Planck Equation (2nd ed.), Berlin:
mensions,” J. Comput. Graph. Statist., vol. 7, pp. 158–174, 1998. Springer-Verlag, 1989.
[359] M.-S. Oh, “Monte Carlo integration via importance sampling: [388] C. P. Robert, The Bayesian Choice: A Decision-Theoretic Mo-
Dimensionality effect and an adaptive algorithm,” Contemporary tivation (2nd ed.), New York: Springer, 2001.
Mathematics, vol. 115, pp. 165–187, 1991. [389] C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
[360] B. Oksendal, Stochastic Differential Equations (5th ed.), Berlin: Springer, 1999.
Berlin: Springer, 1998. [390] C. P. Robert, T. Rydén, and D. M. Titterington, “Bayesian
[361] D. Ormoneit, C. Lemieux and D. Fleet, “Lattice particle fil- inference in hidden Markov models through the reverse jump
ters,” in Proc. UAI2001, 2001, pp. 395–402. Markov chain Monte Carlo method,” J. Roy. Statist. Soc., Ser.
[362] M. Orton and W. Fitzgerald, “A Bayesian approach to tracking B, vol. 62, pp. 57–75, 2000.
multiple targets using sensor arrays and particle filters,” IEEE [391] C. P. Robert, C. Celeux, and J. Diebolt, “Bayesian estimation
Trans. Signal Processing, vol. 50, no. 2, pp. 216–223, Feb. 2002. of hidden Markov chains: A stochastic implementation,” Statist.
[363] M. Ostland and B. Yu, “Exploring quasi Monte Carlo for Probab. Lett., vol. 16, pp. 77–83, 1993.
marginal density approximation,” Statist. Comput., vol. 7, pp. [392] G. O. Roberts and J. S. Rosenthal, “Markov chain Monte Carlo:
217–228, 1997. Some practical implications of theoretical results,” Can. J. Stat.,
[364] N. Oudjane and C. Musso, “Progressive correction for reguarl- vol. 25, pp. 5–31, 1998.
ized particle filters,” in Proc. 3rd Int. Conf. Inform. Fusion, 2000, [393] M. N. Rosenbluth and A. W. Rosenbluth, “Monte Carlo cal-
Paris, ThB2-2. culation of the average extension of molecular chains,” J. Chem.
[365] N. Oudjane, “Stabilité et approximations particulaires en fil- Phys., vol. 23, pp. 356–359, 1955.
trage non-linéaire. Application au pistage,” Ph.D. thesis (in [394] D. B. Rubin, “Multiple imputations in sample survey: A phe-
French), Université de Rennes, 2000. nomeonological Bayesian approach to nonresponse,” in Proc. Sur-
MANUSCRIPT 67

vey Res. Meth. Sect. Am. Statist. Assoc., Washington DC: Amer- [421] H. W. Sorenson and D. L. Alspach, “Recursive Bayesian esti-
ican Statistical Association, pp. 20–28, 1978. mation using Gaussian sums,” Automatica, vol. 7, pp. 465–479,
[395] ———, Multiple Imputation for Nonresponse in Surveys, New 1971.
York: Wiley, 1987. [422] H. W. Sorenson, “On the development of practical nonlinear
[396] ———, “Comment on ‘The calculation of posterior distribu- filters,” Inform. Sci., vol. 7, pp. 253–270, 1974.
tions by data augmentation’ by M. A. Tanner and W. H. Wong,” [423] ———, Ed. Kalman Filtering: Theory and Application, IEEE
J. Amer. Statist. Assoc., vol. 82, pp. 543–546, 1987. Press, 1985.
[397] ———, “Using the SIR algorithm to simulate posterior distri- [424] ———, “Recursive estimation for nonlinear dynamic systems,”
butions,” in Bayesian Statistics 3, J. M. Bernardo, M. H. De- in Bayesian Analysis of Time Series and Dynamic Models, J. C.
Groot, D. V. Lindley, and A. F. M. Smith, Eds. pp. 395–402, Spall, Ed., pp. 127–165, New York: Marcel Dekker, 1988.
Oxford Univ. Press, 1988. [425] J. Spanier and E. H. Maize, “Quasi-random methods for esti-
[398] Y. Rui and Y. Chen, “Better proposal distributions: Object mating integrals using relatively small samples,” SIAM Rev., vol.
tracking using unscented particle filter,” in Proc. CVPR 2001, 33, no. 1, pp. 18–44, 1994.
vol. II, pp. 786–793, 2001. [426] J. Spragins, “A note on the iterative application of Bayes’ rule.
[399] W. J. Runggaldier and F. Spizzichino, “Finite dimensionality in IEEE Trans. Informa. Theory, vol. 11, no. 4, pp. 544–549, 1965.
discrete time nonlinear filtering from a Bayesian statistics view- [427] K. Sprinivasan, “State estimation by orthogonal expansion of
point,” in Stochastic Modeling and Filtering, A. German Ed., probability distributions,” IEEE Trans. Automat. Contr., vol. 15,
Lecture Notes in Control and Information Science, 91, pp. 161– no. 1, pp. 3–10, 1970.
184, Berlin: Springer, 1987. [428] P. Stavropoulos and D. M. Titterington, “Improved particle
[400] J. S. Rustagi, Variational Methods in Statistics, New York: filters and smoothing,” in Sequential Monte Carlo Methods in
Academic Press, 1976. Practice, A. Doucet, J. F. G. de Freitas, N. J. Gordon, Eds. Berlin:
[401] D. Saad and M. Opper, Eds. Advanced Mean Field Method — Springer Verlag, 2001.
Theory and Practice, Cambridge, MA: MIT Press, 2001. [429] J. C. Stiller and G. Radons, “Online estimation of hidden
[402] A. P. Sage and J. L. Melsa, Estimation Theory with Applica- Markov models,” IEEE Signal Process. Lett., vol. 6, no. 8, pp.
tions to Communications and Control, McGraw-Hill, 1973. 213–215, 1999.
[403] A. H. Sayed and T. Kailath, “A state-space approach to adap- [430] R. L. Stratonovich, “Conditional Markov processes,” Theor.
tive RLS filtering,” IEEE Signal Processing Mag., vol. 11, pp. Prob. Appl. (USSR), vol. 5, pp. 156–178, 1960.
18–60, 1994. [431] ———, Conditional Markov Processes and Their Application
[404] M. Schetzen, “Nonlinear system modeling based on the Wiener to the Theory of Optimal Control, New York: Elsevier, 1968.
theory,” Proc. IEEE, vol. 69, pp. 1557–1572, 1981. [432] G. Storivik, “Particle filters for state-space models with the
[405] B. Schölkopf, and A. Smola, Learning with Kernels: Sup- presence of unknown statistic parameters,” IEEE Trans. Signal
port Vector Machines, Regularization, Optimization and Beyond. Processing, vol. 50, no. 2, pp. 281–289, Feb. 2002.
Cambridge, MA: MIT Press, 2002. [433] V. B. Svetnik, “Applying the Monte Carlo method for optimum
[406] D. Schulz, W. Burgard, D. Fox, and A. B. Cremers, “Tracking estimation in systems with random disturbances,” Automation
multiple moving targets with a mobile robot using particle filters and Remote Control, vol. 47, pp. 818–825, 1986.
and statistical data association,” in Proc. 2001 IEEE Int. Conf. [434] P. Swerling, “A proposed stagewise differential correction pro-
Robotics & Automation, pp. 1665-1670, 2001. cedure for satellite tracking and prediction,” Tech. Rep. P-1292,
[407] L. Shan and P. C. Doerschuk, “Performance bounds for nonlin- Rand Corporation, 1958.
ear filters,” IEEE Trans. Aerosp. Elect. Syst., vol. 33, no. 1, pp. [435] ———, “Modern state estimation methods from the viewpoint
316–318, 1997. of the method of least squares,” IEEE Trans. Automat. Contr.,
[408] J. Shao and D. Tu. The Jackknife and the Bootstrap. Springer, vol. 16, pp. 707–720, 1971.
1996. [436] H. Tanizaki and R. S. Mariano, “Prediction, filtering and
[409] N. Shephard and M. K. Smith, “Likelihood analysis of non- smoothing in non-linear and non-normal cases using Monte Carlo
Gaussian measurement time series,” Biometrika, vol. 84, pp. 653– integration,” J. Appl. Econometrics, vol. 9, no. 2, pp. 163–179,
667, 1997. 1994.
[410] M. Šimandl, J. Královec, and P. Tichavský, “Filtering, predic- [437] ———, “Nonlinear filters based on Taylor series expansion,”
tive, and smoothing Cramér-Rao bounds for discrete-time nonlin- Commu. Statist. Theory and Methods, vol. 25, no. 6, pp. 1261–
ear dynamic systems,” Automatica, vol. 37, pp. 1703–1716, 2001. 1282, 1996.
[411] M. Šimandl and O. Straka, “Nonlinear estimation by particle [438] ———, “Nonlinear and non-Gaussian state-space modeling
filters and Cramér-Rao bound,” in Proc. 15th IFAC’2002, 2002. with Monte Carlo integration,” J. Econometrics, vol. 83, no. 1/2,
[412] I. N. Sinitsyn, “Ill-posed problems of on-line conditionally op- pp. 263–290, 1998.
timal filtering,” in Ill-Posed Problems in Natural Sciences, A. [439] H. Tanizaki, Nonlinear Filters: Estimation and Applications,
Tikhonov, Ed., VSP/TVP, The Netherlands, 1992. 2nd ed., New York: Springer-Verlag, 1996.
[413] I. H. Sloan and S. Joe, Lattice Methods for Multiple Integration, [440] ———, “Nonlinear and non-Gaussian state estimation: A
Oxford: Clarendon Press, 1994. quasi-optimal estimator,” Commu. Statist. Theory and Methods,
[414] A. F. M. Smith and A. E. Gelfand, “Bayesian statistics without vol. 29, no. 12, 1998.
tears: A sampling-resampling perspective,” Am. Stat., vol. 46, no. [441] ———, “On the nonlinear and non-normal filter using rejection
4, pp. 84–88, 1992. sampling,” IEEE Trans. Automat. Contr., vol. 44, no. 2, pp. 314–
[415] P. J. Smith, M. Shafi, and H. Gao, “Quick simulation: A review 319, 1999.
of importance sampling techniques in communications systems,” [442] ———, “nonlinear and non-normal filter using importance
IEEE J. Selected Areas Commu., vol. 15, no. 4, pp. 597–613, sampling: Antithetic Monte-Carlo integration,” Commu. Statist.
1997. Simu. and Comput., vol. 28, no. 2, pp. 463–486, 1999.
[416] A. F. M. Smith and G. Roberts, “Bayesian computation via the [443] ———, “Nonlinear and non-Gaussian state-space modeling
Gibbs sampler and related Markov chain Monte Carlo methods,” with Monte Carlo techniques: A survey and comparative study,”
J. Roy. Statist. Soc., Ser. B, vol. 55, no. 1, pp. 3–23, 1993. in Handbook of Statistics, C. R. Rao and D. N. Shanbhag, Eds.,
[417] R. L. Smith, “The hit-and-run sampler: A globally reaching North-Holland, 2000.
Markov chain sampler for generating arbitrary multivariate dis- [444] ———, “Nonlinear and non-Gaussian state space modeling us-
tributions,” in Proc. 28th Conf. Winter Simulation, pp. 260–264, ing sampling techniques,” Ann. Inst. Statist. Math., vol. 53, no.
New York: ACM Press, 1996. 1, pp. 63–81, 2001.
[418] K. Sobczyk, Stochastic Differential Equations: With Applica- [445] M. A. Tanner and W. H. Wong, “The calculation of posterior
tions to Physics and Engineering, Kluwer Academic Publishers, distributions by data augmentation (with discussions),” J. Amer.
1991. Statist. Assoc., vol. 82, pp. 528–550, 1987.
[419] Solomon, H. “Buffon needle problem, extensions, and estima- [446] M. A. Tanner, Tools for Statistical Inference: Methods for Ex-
tion of π,’ in Geometric Probability, chap. 1, pp. 1–24, Philadel- ploration of Posterior Distributions and Likelihood Functions,
phia, PA: SIAM, 1978. 3rd ed., Berlin: Springer Verlag, 1996.
[420] H. W. Sorenson and A. R. Stubberud, “Nonlinear filtering by [447] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust Monte
approximation of the a posteriori density,” Int. J. Contr., vol. 8, Carlo localization for mobile robots,” Artificial Intelligence, vol.
pp. 33–51, 1968. 128, no. 1-2, pp. 99–141, May 2001.
MANUSCRIPT 68

[448] S. Thrun, J. Langford, and V. Verma, “Risk sensitive particle [474] E. Wan and R. van der Merwe, “The unscented Kalman filter,”
filters,” in Adv. Neural Inform. Process. Syst. 14, Cambridge, in Kalman Filtering and Neural Networks (chap. 7), S. Haykin
MA: MIT Press, 2002. Ed. New York: Wiley, 2001.
[449] S. Thrun, J. Langford, and D. Fox, “Monte Carlo hidden [475] A. H. Wang and R. L. Klein, “Optimal quadrature formula
Markov models: Learning non-parameteric models of partially ob- nonlinear estimators,” Inform. Sci., vol. 16, pp. 169–184, 1978.
servable stochastic processes,” in Proc. Int. Conf. Machine Learn- [476] X. Wang, R. Chen, and D. Guo, “Delayed-pilot sampling for
ing, 1999. mixture Kalman filter with application in fading channels,” IEEE
[450] S. Thrun, “Particle filters in robotics,” in Proc. UAI02, 2002. Trans. Signal Processing, vol. 50, no. 2, pp. 241–253, Feb. 2002.
[451] P. Tichavský, C. Muravchik, and A. Nehorai, “Posterior [477] X. Wang, R. Chen, and J. S. Liu, “Monte Carlo signal process-
Cramér-Rao bounds for discrete-time nonlinear filtering,” IEEE ing for wireless communications,” J. VLSI Signal Processing, vol.
Trans. Signal Processing, vol. 46, no. 5, pp. 1386–1396, 1998. 30, no. 1–3, pp. 89-105, 2002.
[452] L. Tierney, “Markov chains for exploring posterior distributions [478] D. B. Ward and R. C. Williamson, “Particle filter beamforming
(with discussion),” Ann. Statist., vol. 22, pp. 1701–1762, 1994. for acoustic source localization in a reverberant environment,” in
[453] L. Tierney, R. E. Kass, and J. B. Kadane, “Approximate Proc. ICASSP’2002, vol. II, pp. 1777–1780, 2002.
marginal densities of nonlinear functions,” Biometrika, vol. 76, [479] D. B. Ward, E. A. Lehmann, and R. C. Williamson,“Particle
pp. 425–433, 1989. filtering algorithms for acoustic source localization,” IEEE Trans.
[454] M. E. Tipping, “Sparse Bayesian learning and the relevance Speech Audio Processing, 2003 (to appear).
vector machine,” J. Machine Learning Research, vol. 1, pp. 211– [480] M. West, “Robust sequential approximate Bayesian estima-
244, 2001. tion,” J. Roy. Statist. Soc., Ser. B, vol. 43, pp. 157–166, 1981.
[455] E. Tito, M. Vellasco, and M. Pacheco, “Genetic par- [481] ———, “Mixture models, Monte Carlo, Bayesian updating and
ticle filter: An evolutionary perspective of SMC meth- dynamic models,” Comput. Sci. Statist., vol. 24, pp. 325–333,
ods,” Paper preprint, available on line http://www.ica.ele.puc- 1992.
rio.br/cursos/download/TAIC-GPFilter.pdf [482] ———, “Modelling with mixtures,” in Bayesian Statistics 4,
[456] P. Torma and C. Szepesvári, “LS-N-IPS: An improvement of London: Clarendon Press, 1992.
particle filters by means of local search,” in Proc. Nonlinear Con- [483] M. West, P. J. Harrison, and H. S. Migon, “Dynamic gener-
trol Systems, 2001. alised linear models and Bayesian forecasting (with discussion),”
[457] ———, “Combining local search, neural networks J. Amer. Statist. Assoc., vol. 80, pp. 73–97, 1985.
and particle filters to achieve fast and realiable con- [484] M. West and J. Harrison, Bayesian Forecasting and Dynamic
tour tracking,” Paper preprint, 2002. Available on line Models, 2nd ed., New York: Springer, 1997.
http://www.mindmaker.hu/∼ szepes/research/onlinepubs.htm [485] B. Widrow and M. E. Hoff, Jr., “Adaptive switching circuits,”
[458] ———, “Sequential importance sampling for visual tracking in IRE Wescon Conv. Record, Pt. 4, pp. 96–104, 1960.
reconsidered,” in Proc. 9th Workshop AI and Statistics, 2003. [486] B. Widrow and S. D. Stearns, Adaptive Signal Processing,
[459] R. van der Merwe, J. F. G. de Freitas, A. Doucet, and Prenitice-Hall, 1985.
E. Wan, “The unscented particle filter,” Tech. Rep. CUED/F- [487] N. Wiener and E. Hopf, “On a class of singular integral equa-
INFENG/TR 380, Cambridge Univ. Engineering Dept., 2000. tions,” in Proc. Prussian Acad. Math. – Phys. Ser., p. 696, 1931.
Also in Adv. Neural Inform. Process. Syst. 13, Cambridge, MA: [488] N. Wiener, Extrapolation, Interpolation and Smoothing of
MIT Press, 2001. Time Series, with Engineering Applications, New York: Wiley,
1949. Originally appears in 1942 as a classified National Defense
[460] R. van der Merwe and E. Wan, “The square-root unscented
Research Council Report. Also published under the title Time
Kalman filter for state and parameter estimation,” in Proc.
Series Analysis by MIT Press.
ICASSP’01, vol. 6, pp. 3461–3464.
[489] C. K. I. Williams, “Prediction with Gaussian processes: From
[461] D. A. van Dyk and X.-L. Meng, “The art of data augmentation
linear regression to linear prediction and beyond,” in Learning in
(with discussion),” J. Comput. Graph. Statist., vol. 10, pp. 1–111,
Graphical Models, M. Jordan Ed., Kluwer Academic Publishers,
2001.
1998.
[462] H. L. Van Trees, Detection, Estimation and Modulation The- [490] D. B. Wilson, “Annotated bibliography of perfectly random
ory, New York: Wiley, 1968. sampling with Markov chain,” in Microsurveys in Discrete Prob-
[463] V. Vapnik, Statistical Learning Theory, New York: Wiley, 1998. ability, D. Aldous and J. Propp Eds., pp. 209–220, Providence:
[464] J. Vermaak, M. Gangnet, A. Blake, and P. Pérez, “Sequential American Math. Society, 1998.
Monte Carlo fusion of sound and vision for speaker tracking,” in [491] D. Wolpert, “The lack of a priori distinctions between learning
Proc. 8th IEEE Int. Conf. Comput. Vision, ICCV’01, 2001, vol. algorithms,” Neural Comput., vol. 8, pp. 1341–1390, 1996.
1, pp. 741–746. [492] ———, “The existence of a priori distinctions between learning
[465] J. Vermaak and A. Blake, “Nonlinear filtering for speaker track- algorithms,” Neural Comput., vol. 8, pp. 1391–1420, 1996.
ing in noisy and reverberant environment,” in Proc. ICASSP01, [493] ———, “No free lunch theorems for optimization,” IEEE
2001, vol. 5, pp. 3021–3024. Trans. Evolu. Comput., vol. 1, pp. 77–82, 1997.
[466] J. Vermaak, C. Andrieu, A. Doucet, and S. J. Godsill, “Particle [494] W. H. Wong and F. Liang, “Dynamic importance weighting in
methods for Bayesian modelling and enhancement of speech sig- Monte Carlo and optimization,” Proc. Natl. Acad. Sci., vol. 94,
nals,” IEEE Trans. Audio Speech Processing, vol. 10, no. 3, pp. pp. 14220–14224, 1997.
173–185, March 2002. [495] W. S. Wong, “New classes of finite-dimensional nonlinear fil-
[467] J. Vermaak, “Bayesian modelling and enhancement of speech ters,” Syst. Contr. Lett., vol. 3, pp. 155–164, 1983.
signals,” Ph.D. thesis, Cambridge Univ., 2000. Available on line [496] W. M. Wonham, “Some applications of stochastic differential
at http://svr-www.eng.cam.ac.uk/∼ jv211/publications.html. equations to optimal nonlinear filtering,” SIAM J. Contr., vol. 2,
[468] J. Vermaak, N. D. Lawrence, and P. Pérez, “Variational infer- pp. 347–369, 1965.
ence for visual tracking,” paper preprint, 2002. [497] ———, “Random differential equations in control theory,” in
[469] P. Vidoni, “Exponential family state space models based on a Probabilistic Methods in Applied Mathematics, A. T. Bharucha-
conjugate latent process,” J. Roy. Statist. Soc., Ser. B, vol. 61, Reid Ed., vol. 2, pp. 131–212, New York: Academic Press, 1970.
pp. 213–221, 1999. [498] H. Wozniakowski, “Average case complexity of mulitvariate in-
[470] A. J. Viterbi, “Error bounds for convolutional codes and an tegration,” Bull. Amer. Math. Soc., vol. 24, pp. 185–194, 1991.
asymptotically optimal decoding algorithm,” IEEE Trans. In- [499] Z. Yang and X. Wang, “A sequential Monte Caro blind receiver
forma. Theory, vol. 13, pp. 260–269, 1967. for OFDM systems in frequency-selective fading channels,” IEEE
[471] N. Vlassis, B. Terwijn, and B. Kröse, “Auxiliary particle filter Trans. Signal Processing, vol. 50, no. 2, pp. 271–280, Feb. 2002.
robot localization from high-dimensional sensor observations,” in [500] K. Yao and S. Nakamura, “Sequential noise compensation by
Proc. 2002 IEEE Int. Conf. Robot. Automat., pp. 7–12, 2002. sequential Monte Carlo method,” in Adv. Neural Inform. Process.
[472] J. von Neurmann, “Various techniques used in connection with Syst. 14, Cambridge, MA: MIT Press, 2002.
random digits,” National Bureau of Standards Applied Mathe- [501] M. Yeddanapudi, Y. Bar-Shalom, and K. R. Pattipati, “IMM
matics, vol. 12, pp. 36–38, 1959. estimation for multitarget-multisensor air traffic surveillance,”
[473] E. Wan and A. Nelson, “Dual extended Kalman filter meth- Proc. IEEE, vol. 85, no. 1, 80–94, 1997.
ods,” in Kalman Filtering and Neural Networks (chap. 5), S. [502] L. A. Zadeh and J. R. Ragazzini, “An extension of Wiener’s
Haykin Ed. New York: Wiley, 2001. theory of prediction,” J. Appl. Phys., vol. 21, pp. 644–655, 1950.
MANUSCRIPT 69

[503] L. A. Zadeh, “Optimum nonlinear filters,” J. Appl. Phys., vol.


24, pp. 396–404, 1953.
[504] M. Zakai and J. Ziv, “Lower and upper bounds on the optimal
filtering error of certain diffusion processes,” IEEE Trans. Inform.
Theory, vol. 18, no. 3, pp. 325–331, 1972.
[505] M. Zakai, “On the optimal filtering of diffusion processes,”
Zeitschrift für Wahrscheinlichkeitstheorie und verwande Gebiete,
vol. 11, no. 3, pp. 230–243, 1969.
[506] V. S. Zaritskii, V. B. Svetnik, and L. I. Shimelevich, “Monte
Carlo technique in problems of optimal data processing,” Autom.
Remote Control, vol. 12, pp. 95-103, 1975.
[507] J. Zhang and P. M. Djuric, “Joint estimation and decoding of
space-time Trellis codes,” EURASIP J. Appl. Signal Processing,
no. 3, pp. 305–315, March 2002.
[508] H. Zhu and R. Rohwer, “Bayesian regression filters and the
issue of priors,” Neural Comput. Appl., vol. 4, pp. 130–142, 1996.

Potrebbero piacerti anche