Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
Rodrigo Labouriau
December, 1996
Acknowledgement
I have learned from many people to do research: my supervisor Professor Ole E. Barndorff-
Nielsen, my previous suprevisor at IMPA and co-worker Professor Bent Jørgensen, Pro-
fessor Richard Gill, Professor Amari, from my childhood Professor Luiz Fernando Gouvêa
Labouriau and Professor Maria Lea Salgado Labouriau, among others. To all of them I
would like to thank.
My research on estimating functions, one of the main topics of this thesis, was started
under supervision of Professor Bent Jørgensen at IMPA. When it was not possible to
continue my formation at IMPA Professor Barndorff-Nielsen offered me to stay under his
umbrella at the University of Aarhus. There my work gained special momentum specially
when working with Professor Barndorff-Nielsen and Professor Amari.
This work was conducted initially with finantial support of the Conselho Nacional de
Pesquisa - CNPq (Brazil), latter on I have received a grant from the european program
Human and Capital Mobility (HCM) at the MRI under supervision of Professor Richard
Gill (University of Utrecht). Partial finantial support was also provided by Fundação
Apolodoro Plausônio. I would like to thank the Research Centre Foulum, specially the
Department of Biometry and Informatic, for providing facilities and for beeing flexible in
allowing me time off for preparation of this thesis. In particular Aage Nielsen was always
supportive and a good friend.
Contents
1 Introduction 7
1.1 Semiparametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Classical optimality theory and estimating functions for semipara-
metric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Some classes of semiparametric models . . . . . . . . . . . . . . . . 10
1.2 Description of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Basic set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3
4 CONTENTS
Introduction
7
8 CHAPTER 1. INTRODUCTION
on some very basic aspects of the phenomena in study. In other words, the specificity of
interpretation of parametric models is necessarily lost when using classic nonparametric
models. That is the price one sometimes has to pay for dealing with extremely large
families of distributions.
Recently some attention has been devoted to a kind of nonparametric model that can
be placed in an intermediate position between the two extreme situations described above.
They are called by the suggestive name of semiparametric models. Roughly speaking,
semiparametric models are families of distributions indexed by a parameter that can
be decomposed into two sub-parameters: the first sub-parameter is called the interest
parameter and belongs to a finite dimensional space, the second sub-parameter is termed
the nuisance parameter and lives in a infinite dimensional space. Semiparametric models
are genuine nonparametric models in the sense that they cannot be indexed by any finite
dimensional parameter. What distinguish the semiparametric models from the classic
nonparametric models is that in the first we can identify a finite dimensional interest
parameter. This identification of the interest parameter is in principle guided by our
interest. In this way, we incorporate in the model parameters that reflect characteristics
of the phenomena under study via the interest parameter and keep the flexibility of the
nonparametric models to adapt to the unknown peculiarities of the phenomena studied.
Another distinctive characteristic of the theory of semiparametric models are the meth-
ods of statistical inference that are used. Typically when dealing with semiparametric
models one tries to take advantage of the (partially) parametric structure of the model.
We discuss this point more precisely below.
In the second part of this thesis we carry out all the computations necessary to apply
the optimality theory of regular asymptotic linear estimating sequences to some classes
of semiparametric models.
In this thesis we study also the use, in a context of semiparametric models, of a rela-
tively well developed technique for parametric models: the so called estimating functions.
In the approach of estimating functions we consider estimators which can be expressed
as solutions of an equation such as
Ψ(x; θ) = 0 . (1.1)
Here Ψ is a function of the given data, say x, and the parameter, say θ, of a certain
statistical model. We call Ψ an estimating function, also known as an inference function
(the precise definition will be given later). Following the same procedure as in the clas-
sical theories, one introduces some constraints in the class of inference functions to be
considered, a criterion for ordering the estimators obtained from the estimating functions
is given in the restricted class and the uniformly best estimator is chosen. It is clear that
in most of the classical “well-behaved” cases, the maximum likelihood estimator is given
by the solution of an estimating equation. Moreover, the criterion for ordering inference
functions is closely related to the asymptotic variance of the associated estimators. In
that way, the approach of estimating equations can be viewed as a generalization of the
maximum likelihood theory. Due to the optimal behavior of the maximum likelihood in
“regular” cases, it is not surprising that the optimal inference function will give us ex-
actly the maximum likelihood estimator. However, there are some situations in which the
maximum likelihood theory fails and the estimating equation theory works well. More-
over, the theory of estimating equations provides alternative justifications for important
statistical techniques for parametric models, such as conditional inference (see Jørgensen
and Labouriau, 1995).
Estimating functions have been used for inference with parametric models for a long
time. The earliest mention of the idea of estimating equations is probably due to Fisher
(1935) (he used the term “equation of estimation”). A remarkable example of an early
non-trivial use of inference functions can be found in Kimball (1946), where estimating
equations were used to give confidence regions for the parameters of the family of Gumbel
distributions (or extreme value distributions). There, the idea of “stable” estimating
equations, i.e. inference functions whose expectations are independent of the parameter,
was introduced, anticipating the theory of sufficiency and ancillarity for inference functions
proposed by McLeish and Small (1987) and Small and McLeish (1988). The theory of
optimality of inference functions appears in the pioneering paper of Godambe (1960). In
the same year Durbin (1960) introduced the notion of unbiased linear inference function
and proved some optimality theorems particularly suited to applications in time series
analysis. Since that time, the theory of inference functions has been developed a great
10 CHAPTER 1. INTRODUCTION
deal, both by Godambe (c.f. Godambe, 1976, 1980, 1984; Godambe and Thompson, 1974,
1976), and by others in different contexts and with different names and approaches. We
mention, for instance, the so-called theory of M -estimators developed in the Seventies in
order to obtain robust estimators, and the quasi-likelihood methods used in generalized
linear models. As one can see, the theory of inference functions was not only inspired by
an alternative optimality theory for point estimation. One could say that there is now a
firm and well established theory of inference functions for parametric models, with many
branches, some of them based on very deep mathematical foundations.
Estimating functions have been applied with relative success to estimation under para-
metric models with nuisance parameters (see Jørgensen and Labouriau, 1995 and the
references therein). It is then natural to ask whether this technique produces reasonable
results in a context of semiparametric models. We will show in this thesis that in fact the
class of estimators derived from estimating functions is rather limited for semiparametric
models. However, estimating functions will prove to be of utility as auxiliary for obtaining
efficient regular asymptotic linear estimating sequences.
Z Z
xz(x)dµ(x) = 0 or equivalently xz(x − θ)dµ(x) = θ (1.5)
IR IR
and
Z
x2 z(x)dµ(x) < ∞. (1.6)
IR
Conditions (1.3) and (1.4) ensure that z is a density of a probability measure with support
equal to the whole real line. From condition (1.5) the parametrization (θ, z) 7→ Pθz is
identifiable, i.e. this map is one-to-one.
1.1. SEMIPARAMETRIC MODELS 11
Clearly, the model described above is a semiparametric extension of the socalled lo-
cation model. This model is an example contained in the main class of models studied
in this thesis. The models in the class referred to are constructed by assuming that the
expectation of a number of given square integrable functions are equal to some fixed func-
tions of a finite dimensional interest parameter. In the example above this corresponds to
condition (1.5). These models are termed “L2 - restricted models” and they incorporate
semiparametric extensions of many important statistical models such as: multivariate
location and shape models, growth curve models, linear relationships models (LISREL)
and some graphical models.
Two other closely related classes of models are studied also. The first is obtained
by considering L2 - restricted models that coincide entirely with a parametric model
(parametrized exclusively by the interest parameter) in some given region of the sam-
ple space. The distributions are assumed to vary freely (only subject to the L2 - restricted
model constraints) out of those regions. Here an example is the semiparametric location
model described above with the additional condition that the densities of the distribu-
tions in the model coincide with the density φ of a normal distribution with mean θ and
variance σ 2 := IR x2 φ(x)dµ(x) in the interval [θ − 2σ, θ + 2σ]. These models are called
R
“partial parametric models”. The best known examples are the trimmed models and the
location model with free tails.
The second class of models are constructed by adding to a L2 - restricted model con-
straints obtained by assuming that a (non-linear) function of the expectation of some
square integrable functions equals a function of the interest parameter. These models are
referred to as “extended L2 - restricted models”. For example, one could introduce addi-
tionally the assumption that the coefficient of variation is constant in the semiparametric
location model described above, i.e. insert the condition
qR
IR x2 z(x)dµ(x)
R = k.
IR xz(x)dµ(x)
Examples of extended L2 - restricted models are regression models with link and some
types of covariance selection models.
12 CHAPTER 1. INTRODUCTION
Chapter 1 This chapter contains some introductory material, an overview of the thesis
and some notational conventions.
Chapter 2 In this chapter the classic optimality theory for non- and semiparametric
models is studied. A range of notions of path differentiability and tangent spaces are
introduced and their inter-relations studied. Next it is studied some concepts of statistical
functional differentiability. Here the differentiability is considered relatively to a pointed
cone contained in the tangent space and not relatively to the whole tangent space, as is
current in the literature. These cones are referred as the tangent cones. The optimality
theory of differentiable functionals is reviewed next. Again, the results are stated relative
to the tangent cone and not with respect to the whole tangent space, as is usual. The
estimation of the interest parameter of semiparametric models is studied by applying
the optimality theory to a specially designed functional called the interest parameter
functional, which associates to any probability measure in the model in play the value of
the interest parameter associated to it. An increasing range of tangent cones is considered.
Here, the larger is the tangent cone used, the sharper is the bound for the concentration
of regular estimators obtained. However, too large tangent cones may imply that the
interest parameter functional is differentiable only under somehow stringent regularity
1.2. DESCRIPTION OF THE THESIS 13
conditions on the model. It is shown how the imposition of such conditions usually done
in the literature can be avoided by using adequate choices of tangent cones. The bound
for the concentration for regular estimating sequences obtained with this choice of the
tangent cone is referred to as the semiparametric Cramér-Rao bound.
Chapter 3 The chapter extends the theory of estimating functions classically consid-
ered for parametric models to a context of semiparametric models. A class of regular
estimating functions (REF) is defined and characterized in two alternative forms. The
first characterization says essentially that the components of any REF are in the intersec-
tion (over the values of the nuisance parameter) of the so called (strong or L2 ) nuisance
tangent spaces. This result is original, even though it can be found already in Jørgensen
and Labouriau (1995, chapter 4). The second characterization, shown to be equivalent
to the first, is a modification of the (informal) characterization recently given by Amari
and Kawanabe (1996) based on differential geometric considerations. The first character-
ization is used to obtain an optimality theory of REFs by using a projection technique.
It is proved that the semiparametric Cramér-Rao bound coincides with the bound for
the concentration of estimators based on REF if and only if the orthogonal complement
of the nuisance tangent space does not depend on the nuisance parameter. This result
is original and is used in the subsequent chapters to check in a range of semiparametric
models whether the semiparametric Cramér-Rao bound is attained by estimators based
on REFs. The chapter closes with a discussion of a generalization of the notion of REFs
in which the dependence on the nuisance parameter is allowed.
Most of the material presented in this chapter is result of the work of the author in
the last years. The result concerning the coincidence of the semiparametric Cramér-Rao
bound and the bound for the concentration of estimators based on REFs is original.
Chapter 4 The chapter studies a one dimensional semiparametric extension of the lo-
cation and scale model. The goal there is to gain intuition, and restrictions are introduced
largely in order to make the basic theory work in a simple way. The treatment of the
location and scale model is refined latter in chapter 5. The basic computations necessary
for obtaining efficient regular asymptotic linear estimating sequences (RALES) are given
in detail for location and scale models where a number of standardized cumulants are
fixed. The main point there is the calculation of a version of the efficient score function.
The efficient score function does not depend on the nuisance parameter in the case where
only the first two standardized cumulants are fixed and its roots are the sample mean
and the sample standard deviation. In that case, the efficient score function is a regular
estimating function (REF) and its roots provide efficient RALES, or in other words attain
14 CHAPTER 1. INTRODUCTION
the semiparametric Cramér-Rao bound. In the case where more than two standardized
cumulants are fixed the semiparametric Cramér-Rao bound is not always attained by
estimators based on REFs. Moreover, the efficient score function does depend on the
nuisance parameter but only through an intermediate finite dimensional parameter, sug-
gesting the use of some “plug-in” estimating procedure for obtaining efficient estimation.
The location and scale models considered in this chapter do not incorporate the assump-
tion of symmetry of the distributions, as occurs in the previous treatments of similar
models found in the literature. However, strong conditions on the behavior of the tails
of the distributions and the Laplace transform of some given functions of the density are
imposed. This will allow the calculations to be performed with the help of polynomial
expansions, a technique used in the early stages of the work. These technical restrictions
are eliminated in chapter 5 in a more general context. An auxiliary technical condition
for having the class of polynomials dense in L2 is given in the appendices of the chapter.
The chapter reports essentially a joint work with Professor Shun-Ichi Amari (University
of Tokyo) and Professor Ole E. Barndorff-Nielsen (University of Aarhus), developed in a
preliminary stage of the thesis work.
Chapter 5 This chapter studies the main classes of models considered in the thesis,
the L2 - restricted semiparametric models which we define next. The results presented in
the chapter are original and will be presented next with relatively more detail than the
previous chapters.
Consider a measurable space (X , B) on which a σ- finite measure λ and a family of
probability measures P are defined. The sample space X is assumed to be a locally
compact Hausdorff space, typically a Euclidean space.
The family P is parametrized in the following way
P = {Pθz : θ ∈ Θ , z ∈ Z} . (1.7)
Here Θ is an open set of IRq and Z is a class of arbitrary nature, typically infinite
dimensional. We consider z as a nuisance parameter and θ as a parameter of interest
which we want to estimate. It is assumed that the parametrization of P is identifiable,
i.e. the mapping (θ, z) 7→ Pθz is a bijection between Θ × Z and P. Each element of the
family P is dominated by λ and has support equal to the whole sample space X . We can
then represent P alternatively by
( )
∗ dPθz
P = p( · ; θ, z) = (·) : θ ∈ Θ, z ∈ Z . (1.8)
dλ
It is assumed without loss of generality that the versions of the Radon-Nikodym derivative
used in (1.8) to define p( · ; θ, z) : X −→ IR+ are strictly positive, i.e. for all x ∈ X , θ ∈ Θ
and z ∈ Z, p(x; θ, z) > 0.
1.2. DESCRIPTION OF THE THESIS 15
p is of class C l ; (1.11)
Z
p(x)λ(dx) = 1; (1.12)
X
Z
fj2 (x)p(x)λ(dx) < ∞; (1.13)
X
Z
fj (x)p(x)λ(dx) = Mj (θ0 ); (1.14)
X
Z
for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (1.15)
X
where Mj (θ0 ) ∈ IR and Bi (θ0 ) is a set given real open. We refer to the models of the form
described above as L2 - restricted semiparametric models or simply L2 - restricted models.
The conditions (1.10) and (1.12) ensure that p is a probability density of a distribution
with support equal to the whole sample space X . Conditions (1.13)-(1.15) are used to
restrict each submodel Pθ0 (and consequently shrink the model P). These condition
could be used to express a partial a priori knowledge about the phenomena we study or
to ensure some desirable mathematical characteristics of the model, such as identifiability
and regularity of the partial score functions. For instance, the conditions (1.15) can be
used to ensure that the partial score functions are in L2 . Condition (1.11) can be assumed
to hold apart from a λ- null set.
The following examples of L2 - restricted models are considered in detail in the thesis:
location and scale (without the restrictive conditions on the tails and existence of the
16 CHAPTER 1. INTRODUCTION
Laplace transform used in chapter 4), multivariate location and shape models, covariance
selection models defined on the location and shape model, linear structural relationship
models (LISREL) and some growth curve models with modeled covariance structure.
The covariance selection models are distinguished among the other examples because the
functions used to restrict are not polynomials.
It is shown that all the notions of nuisance tangent spaces considered in the first
part coincide and are equal to the orthogonal complement of the L2 - closure of the space
spanned by the functions f1 − E(f1 ), . . . , fk − E(fk ). More precisely, define
The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z) and it is equal
to the nuisance tangent space. Note that Hk depends only on θ, which in the light of
the theory developed in chapter 3 implies that the semiparametric Cramér-Rao bound
coincides with the bound for REFs.
The next step is to calculate the efficient score function by projecting the partial
score function onto the orthogonal complement of the nuisance tangent space. To do so,
consider the result of a Gram-Schmidt orthonormalization process, in the space L2 (Pθz ),
applied to the functions 1, f1 , . . . , fk and denoted by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form
an orthonormal basis on Hk = TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for
i = 1, . . . , q,
k
E
hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,
X
l/θ i ( · , θ, z) = (1.16)
j=1
where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be
written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,
The inverse of the matrix A(θ, z) provides a lower bound for the asymptotic variance
of regular asymptotic linear estimating sequences i.e. the semiparametric Cramér-Rao
bound. Here we use the partial order of matrices (i.e. , A ≥ B iff A − B is positive
semidefinite).
1.2. DESCRIPTION OF THE THESIS 17
The class of regular estimating functions is studied next. Using the first characteri-
zation of estimating functions given in chapter 3 it is shown that any regular estimating
function can be represented in the following way,
Ψ( · ; θ) = α(θ){f ( · ) − M (θ)} , (1.19)
where α(θ) is a q×k matrix, f ( · ) = (f1 ( · ), . . . , fk ( · ))T and M (θ) = (M1 (θ), . . . , Mk (θ))T .
It can be shown from the properties of the regular estimating functions that the only pos-
sible root of any REF, under a repeated sampling scheme with sample x = (x1 , . . . , xn )T ,
is the solution of the system
0 = IP n f (x) − M (θ̂) .
In other words, there is one and only one moment estimator associated with any REF. The
basic properties of this moment estimator, such as consistency and asymptotic normality,
are studied. The Hampel influence function of the moment estimator referred to is given
by
Chapter 6 In this chapter two extensions of the L2 - restricted models are studied. We
term these: extended L2 - restricted models and partial parametric models.
18 CHAPTER 1. INTRODUCTION
and
Z Z
∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (1.22)
X X
The result is used to characterize the class of REFs for the covariance selection models
on which the covariance matrix is not part of the interest parameter. Moreover, the
estimators derived from REFs are shown to be necessarily moment estimators of the
same type as the moment estimators derived from REFs for the case of the L2 - restricted
model considered in chapter 5. The following result proved in the chapter can be used
to calculate precisely the nuisance tangent spaces for the example of regression with link
function. If additionally to the previous assumptions one assumes that the function b is
injective, then
TN (θ, z) = [span{f1 − E(f1 ), . . . , fk − E(fk ), g1 − E(g1 ), . . . , gs − E(gs )}]⊥ .
n o⊥
Hk (θ, z) = Hk = Hk⊥ (θ, z) .
20 CHAPTER 1. INTRODUCTION
Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions
that are almost surely equal and adopt the convention that supp(ν) ⊆ A means that
ν( · )χAc ( · ) = 0, λ-almost everywhere.
Theorem 1 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z
and m ∈ [1, ∞],
m
i) T N (θ, z) = Hk⊥ (θ, z);
m⊥
ii) T N (θ, z) = Hk (θ, z). Moreover,
" #
{fi ( · )χIθc ( · ) − Ei (θ)} : i = 1, . . . , k}
Hk (θ, z) = span ,
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }
Note that the function Ei (θ) in fact does not depend on the nuisance parameter and we
write sometimes simply Hk (θ).
The regular estimating function Ψ in part ii) of the corollary above has matricial
representation
Ψ( · ; θ) = ξ( · ; θ) + α(θ){f ( · ) − E(θ)} , (1.25)
where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .
Theorem 2 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (1.25)
and for each θ ∈ Θ and z ∈ Z:
1.2. DESCRIPTION OF THE THESIS 21
−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (1.26)
b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking
Finally, it is proved the following theorem, which solves the problem of optimality.
Theorem 3 Consider a partial parametric model. Suppose that the function E is differ-
entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating
functions at (θ, z) ∈ Θ × Z if and only if,
The usual norm of Lq (p) will be denoted by k · kLq (p) . In the special case of the Hilbert
space L2 (p), the natural inner product will be denoted, for all f, g ∈ L2 (p), by
Z
< f, g >P = < f, g >p = f (x)g(x)p(x)λ(dx) .
X
Given a set A ⊆ L20 (P ), we denote the closure of A with respect to the topology of L2 (P )
by clL2 (P ) (A) = clL2 (p) (A), and the orthogonal complement of A in L20 (P ) by A⊥ . Finally,
we consider the space L∞ (P ) = L∞ (p) of functions from X to IR essentially bounded with
respect to the probability P ∈ P (or p ∈ P ∗ ) equipped with the norm given, for each
f ∈ l∞ (p), by
(see Dunford and Schwartz, 1964). We stress that for 1 ≤ r ≤ q ≤ ∞, Lq (p) ⊆ Lr (p).
Moreover, for all f ∈ Lq (p),
∀x ∈ X , p(x; θ0 , z0 ) > 0 .
25
Chapter 2
2.1 Introduction
We consider in this chapter some aspects of the general theory of non-parametric statistical
models which will be useful for the theory of semiparametric models. The key notions
introduced here are the path differentiability, the associated concept of tangent spaces
and tangent sets, and the notions of functional differentiability.
In section 2.2 we study a range of concepts of path differentiability and comparisons
of those notions are provided. An important point there is the equivalence between the
Hellinger differentiability, often used in the literature (see Bickel et al., 1993), and the
weak differentiability (see Pfanzagl, 1982, 1985 and 1990). Two auxiliary notions of path
differentiability are introduced: strong and mean differentiability. It is proved that weak
(or Hellinger differentiability) is an intermediate notion of path differentiability, weaker
than strong differentiability and stronger than mean differentiability. A new notion of
path differentiability, called essential differentiability, is introduced. We will interpret the
tangents of essential differentiable paths as score functions of one dimensional “regular
submodels” in the classical sense. Since the essential differentiability is weaker than the
other notions provided, this interpretation extends immediately to all the other path
differentiability notions considered.
In section 2.3 some differentiability notions of functionals are studied. In the approach
given a cone contained in the tangent set (i.e. the class of tangents of differentiable paths)
is chosen and the differentiability of the functional in question will be defined relatively
to this cone (termed tangent cone). Alternative notions of functional differentiability are
given by adopting different notions of path differentiability and/or using different tangent
cones. As we will see, the stronger the path differentiability notion used and the smaller is
the tangent cone, the weaker the notion of differentiable functionals induced, in the sense
that more statistical functionals are differentiable. We provide next some lower bounds
27
28 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
The main purpose of this section is to introduce the mathematical machinery necessary to
extend the notion of score function, classically defined for parametric models, to a context
where no (or only a partial) finite dimensional parametric structure is assumed. The key
idea here is to consider one-dimensional submodels of the family P of probability measures
(typically infinite dimensional). These submodels will be called paths. Following the steps
of Stein (1956), one should consider a class of submodels (or paths) sufficiently regular
in order to have a score function well defined and well behaved for each submodel, in the
sense that, at least, each score function should be unbiased (i.e. have expectation zero)
and have finite variance. Stein’s idea is to use the worst possible regular submodel to
assess the difficulty of statistical inference procedures for the entire family P. Evidently,
if the class of ”regular submodels” is too small, no sensible results are to be expected
from that procedure. On the other hand, if the class of ”regular submodels” is too large,
the Stein’s procedure can become intractable or no simplification is really gained, which
is not in the spirit of the method proposed. Hence, when applying the Stein procedure it
is our task to find a class of ”regular submodels” with the adequate size.
The idea of ”regular submodel” mentioned will be formalized by introducing the no-
tion of path differentiability. A range of concepts of path differentiability are studied
in this section, all of them fulfilling the minimal requirement for a ”regular submodel”,
i.e. the score functions of the differentiable paths (viewed as submodels) will be auto-
matically well defined, unbiased and possess finite variances. The strongest notion of
path differentiability considered is the L∞ differentiability (or pointwise differentiability)
and the weakest notion is the essential differentiability. It will turn out that a notion
of path differentiability called “Hellinger differentiability” (or “weak differentiability”) is
the weakest notion that captures some important essential statistical properties of the
model P. Another distinguished notion considered is the L2 differentiability which will
involve calculations with Hilbert spaces, simplifying all the computations required. The
L2 differentiability coincides with the Hellinger differentiability in most of the examples
considered in this thesis. It turns that the L2 differentiability will be useful in the theory
of estimating functions.
This section is organized as follows. Subsection 2.2.1 studies the basic notion of path
differentiability and some general properties of differentiable paths. Some specific concepts
of differentiability are introduced in the subsections 2.2.2, 2.2.3 and 2.2.4 where weak
or Hellinger, Lq and essential differentiability are studied, respectively. The associated
notions of tangent sets and tangent spaces are discussed in subsection 2.2.5.
30 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
with representation given by (2.1), with ν ∈ L20 (p). Then we have, for each t ∈ V
pt ( · ) − p( · )
rt ( · ) = − ν( · ) (2.3)
tp( · )
and
Z ( )
Z
pt (x) − p(x)
rt (x)p(x)λ(dx) = − ν(x) p(x)λ(dx) = 0 .
X X tp(x)
and
1Z
|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.4)
t {x:t|rt (x)|>1}
Z
|rt (x)|2 p(x)λ(dx) −→ 0 , as t ↓ 0 . (2.5)
{x:t|rt (x)|≤1}
Let us introduce now the Hellinger differentiability of paths. The key idea in this ap-
proach is to characterize the family P of probability measures by the class of square roots
of the densities, instead of the densities. The advantage of this alternative characterization
is that the square roots of the densities are in the Hilbert space
Z
L2 (λ) = f : X −→ IR : f 2 (x)λ(dx) < ∞ .
X
In this way the statistical model in play is naturally embedded into a space with a rich
mathematical structure. Using the usual topology of L2 (λ) one defines the differentiability
of paths in the sense of Fréchet (or in this case, since the domain of the path is contained
in IR, the equivalent notions of Hadamard and Gateaux differentiability could be used
also). The precise definition of Hellinger differentiability is the following. A path {pt }t∈V
is Hellinger differentiable at p ∈ P ∗ if there exists a generalized sequence {st }t∈V in L20 (p)
converging to zero as t ↓ 0, i.e.
kst kp −→ 0 , as t ↓ 0 (2.6)
1/2 1
pt ( · ) = p1/2 ( · ) + tp1/2 ( · ) ν( · ) + tp1/2 ( · )st ( · ) . (2.7)
2
The factor 21 in the second term of the right side of (2.7) will serve to accommodate with
the other notions of differentiability. Note that each st is in fact in L20 (p). For, from (2.7)
1/2
p ( · ) − p1/2 ( · ) ν( · )
st ( · ) = t − . (2.8)
tp1/2 ( · ) 2
1/2
2
pt (x) 1/2
pt (x)λ(dx) = 1 < ∞ , we have that pt ( · )/p1/2 ( · ) ∈
R R
Since X p1/2 (x)
p(x)λ(dx) = X
1/2
1/2
pt (x) 1 pt (x)
L2 (p), and hence tp1/2 (x)
= t p1/2 (x)
− 1 ∈ L2 (p).
Proposition 1 A path {pt } is Hellinger differentiable if and only if {pt } is weak differ-
entiable.
and
Proof: The proposition follows immediately from the fact that convergence in Lq (p)
implies convergence in Lr (p). u
t
There are two distinguished cases of Lq path differentiability: strong and mean dif-
ferentiability corresponding to L2 and L1 differentiability respectively. The L1 differen-
tiability is remarkable because it is the weakest notion of differentiability found in the
literature, and the L2 differentiability distinguish itself because the L2 spaces, when en-
dowed with the natural inner product, are Hilbert spaces, which simplifies significantly
the calculations.
We study next the relation between weak and Lq path differentiability. As we will see
in the propositions 3 and 4 given above, weak differentiability is an intermediate notion
of path differentiability between L2 and L1 differentiability.
Proof: Let {pt } be a differentiable path in the L2 sense with representation (2.9) and
krt kL2 (p) −→ 0 as t ↓ 0. We show that the path {pt } fulfills the conditions (2.4) and (2.5)
for the convergence of the remainder term in the sense of the weak path differentiability.
For,
1Z 1Z
|rt (x)|p(x)λ(dx) ≤ t|rt (x)||rt (x)|p(x)λ(dx)
t {x:t|rt (x)|>1} t {x:t|rt (x)|>1}
Z
= |rt (x)|2 p(x)λ(dx)
{x:t|rt (x)|>1}
Z
≤ |rt (x)|2 p(x)λ(dx)
X
= krt k2p −→ 0 , as t ↓ 0 .
Hence {rt } satisfies (2.4). On the other hand,
Z Z
2
|rt (x)| p(x)λ(dx) ≤ |rt (x)|2 p(x)λ(dx) = krt k2p −→ 0, as t ↓ 0.
{x:t|rt (x)|≤1} X
Hence {rt } satisfies (2.5). We conclude that {pt } is differentiable in the weak sense with
tangent ν. u
t
Proof: Take a path {pt } weakly differentiable at p with tangent ν. There exists a
generalized sequence of functions {rt } satisfying (2.4) and (2.5), such that for all t ∈ V ,
pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) .
Note that (2.4) implies that
Z
|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.11)
{x:t|rt (x)|>1}
and for any sequence {kn }n∈N ⊆ V such that kn −→ 0 as n → ∞ there is a subsequence
{ki }i∈N ⊆ {kn }n∈N such that rki ( · ) −→ 0 p-almost surely as i → ∞.
We show next that essential differentiability is weaker than differentiability in mean
which, in view of propositions 3 and 4 implies that the essential differentiability is the
weakest notion of path differentiability considered here.
Proof: The generalized sequence {rt }t∈V is Cauchy, because it converges in L1 to zero.
Using theorem 3.12 in Rudin (1987, page 68) the essential differentiability follows. u
t
36 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
The following scheme represents the interrelation between the various notions of path
differentiability considered.
L∞ differentiability
⇓
p
L differentiability
⇓
q
L differentiability
⇓
L2 differentiability
⇓
Weak differentiability ⇔ Hellinger differentiability
⇓
1
L differentiability
⇓
essential differentiability
Here 2 < p < r < ∞.
space as we will call the tangent space of the submodel we mentioned). It will then be
comfortable to work with a closed subspace of L2 . We remark that it can be proved that
the tangent set is a pointed cone, but in general not even a vector space. Therefore the
necessity to introduce the notion of tangent space as given here.
The formal definition of tangent space and tangent set depends on the notion of path
differentiability one uses. We give next a general definition of tangent set and tangent
space which will be made precise when we specify the notion of path differentiability we
use. Suppose we adopt a certain definition of path differentiability according to which a
differentiable path at p ∈ P ∗ , say {pt }, has representation, for each t ∈ V ,
and
rt −→ 0 , as t ↓ 0 , (2.15)
where the convergence in (2.15) is in a certain sense known. Then the tangent set of P
at p ∈ P ∗ is the class
( )
o o ν ∈ L20 (p) : ∃V, {pt }t∈V ⊆ P ∗ , {rt }t∈V , such that
T (p) = T (p, P) = .
∀t ∈ V, (2.14) and (2.15) hold
Since the tangent sets and spaces depend on the notion of path differentiability adopted,
we speak of Lq (for q ∈ [1, ∞]), weak (or Hellinger) tangent sets and tangent spaces. When
W
necessary we use the notation T for the weak tangent space. The Lq tangent spaces are
q e
represented by T and the essential tangent spaces by T .
The following proposition relates the notions of tangent sets and tangent spaces given.
Proof: Straightforward from the interrelations between the notions of path differentia-
bility. u
t
We close this section with two examples of the calculation of tangent spaces.
38 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
We claim that for t sufficiently small, pt ∈ P ∗ , which implies that ν ∈ T 0 (p, P). It suffices
to verify that pt is positive and integrates 1. For t small pt is positive because ν is bounded
and p is bounded in the support of ν, hence the second term in the right hand of (2.16)
is smaller than p (for t small). That pt integrates 1 follows from the fact that ν has
expectation zero (with respect to p). u
t
It is not surprising that the previous enormous class of distributions possesses a “full”
tangent space. The next example show that this could be the case even in families where
we have a lot of information about the distributions of the family.
Example 2 (Full tangent space for families with information on the moments)
Consider the class P of all distributions in IR dominated by the Lebesgue measure with
continuous density (with respect to the Lebesgue measure) and with support (of the density)
equal to the whole real line. Suppose further that the moments of all orders exist and that
there exist a δ > 0, a k ∈ N and the constants m1 , . . . , mk such that for each i ∈ {1, . . . , k}
the moment of order i is contained in the open interval (mi − δ, mi + δ). I claim that the
tangent space of P at any P ∗ is L20 (p). The proof follows the same line of the argument
as given in the previous example. Take a path as in (2.16) with ν ∈ Cb ∩ L20 (p). For t
sufficiently small, pt will be positive, integrate to one, possess finite moments of all orders,
and the moments of order i, for i ≤ k will be contained in the interval (mi − δ, mi + δ).
u
t
2.3. FUNCTIONAL DIFFERENTIABILITY 39
P ∗ = {p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z} .
φ{p( · ; θ, z)} = θ .
u
t
Since the definition of functional differentiability depends on the notion of path dif-
ferentiability, we speak of L∞ , Lp , strong (L2 ), weak, mean (L1 ) and essential functional
differentiability. When necessary we superpose a symbol indicating the notion of path
differentiability in play. When we are speaking generically or when it is clear from the
context which notion of path differentiability is in play, we just use the notation φ•p for
the gradient and T 0 (p, P ∗ ) = T 0 (p), T (p, P ∗ ) = T (p) for the tangent set and the tangent
space of P ∗ at p respectively.
Note that the notion of functional differentiability introduced here involves a subset
T (p) of the tangent space and not necessarily the whole tangent space as is current in
the literature. This will give much more flexibility to the estimation theory developed.
Clearly the smaller is the class T (P ) (or the stronger is the notion of path differentiability)
used, the weaker is the related functional differentiability. On the other hand, the larger
is the class T (p), the sharper will be the results of the estimation theory related, in the
sense that the bounds for the lower asymptotic variance will be larger or the optimality
results will include more estimating sequences. In this sense the ideal would be to choose
the larger T (p) (and the stronger path differentiability) that makes differentiable the
functional under study. Of course, we will have to require some mathematical properties
for the classes T (p) in order to obtain a notion of functional differentiability useful for the
estimation theory of differentiable functionals. For instance, it will be assumed through
(and silently) that T (p) is a pointed cone (i.e. if ν ∈ T (p), then for each α ∈ IR+ ∪ {0},
αν ∈ T (p)). We will refer form now on to T (p) as the tangent cone. It will be necessary
sometimes to require the tangent cones to be convex.
We consider next a trivial example that illustrates the mechanics of the functional
differentiability.
Z
p(x)λ(dx) = 1 ; (2.20)
X
p is continuous ; (2.21)
2.3. FUNCTIONAL DIFFERENTIABILITY 41
Z
x2 p(x)λ(dx) ∈ IR+ . (2.22)
X
We denote the class of densities of the elements of P with respect to λ by P ∗ . Define the
functional M : P ∗ −→ IR by, for each p ∈ P ∗
Z
M (p) = x p(x)λ(dx) .
X
Since (2.24) and (2.23) hold for any L2 differentiable path, we conclude that M is differ-
entiable with respect to the L2 tangent set and Mp• is a gradient of M . An argument based
on subsequences (c.f. Labouriau , 1996) yields the differentiability of M with respect to
the essential tangent set, i.e. the mean functional is differentiable with is the strongest
sense we can define in our setup.
Note that in this example (2.17) holds for any differentiable path with tangent ν ∈
T (p). However, according to our definition of functional differentiability it would be
enough if the condition (2.17) holds for one path with tangent ν. u
t
42 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
We conclude from the remark above that if φ•p is a gradient of φ at p and ξ ∈ {T (p)}⊥ (i.e.
ξ is in the orthogonal complement of the tangent space with respect to L20 (p)), then φ•p + ξ
is also a gradient of φ at p. Hence, in general the gradient of a differentiable functional is
not unique.
A gradient φ•p of a differentiable functional at p ∈ P ∗ is said to be a canonical gradient
if φ•p ( · ) ∈ T̄ (p). Here T̄ (p) denotes the L2 closure of the space spanned by T (p). The
following proposition shows that there exists only one canonical gradient (apart from
almost surely equal functions) and gives a recipe to compute the canonical gradient,
namely by orthogonal projecting any gradient onto T̄ (p). We will see that the canonical
gradient plays a crucial rule in the theory of estimation of functionals.
{φ•1p |T̄ (p)}, . . . , {φ•qp |T̄ (p)})T = ( {φ∗1p |T̄ (p)}, . . . , {φ∗qp |T̄ (p)})T ,
Y Y Y Y
(
p almost surely.
Proof: We prove the proposition for the case where q = 1. The same argument applied
componentwisely proves the case for q ∈ N , but with a more notation. From the projection
theorem we have the following orthogonal decomposition
φ•p = {φ•p |T̄ (p)} + {φ•p |T̄ (p)} .
Y Y
Since {φ•p |T̄ ⊥ (p)} is orthogonal to T̄ (p), we conclude from (2.25) that {φ•p |T̄ (p)} is a
Q Q
gradient.
2.3. FUNCTIONAL DIFFERENTIABILITY 43
which yields
Example 5 (Mean functional continued) It can be shown that the tangent space of
the model P given by (2.18) at each p ∈ P ∗ is the whole space L20 (p). Hence the gradient
calculated in example 4 is the canonical gradient. Moreover, the canonical gradient is the
only possible gradient for the mean functional. Note that if we drop the condition that
requires the existence of the variance of p (i.e. condition (2.22)), then Mp• is no longer a
gradient (because it is not in L2 ) and M is not differentiable at p. u
t
We consider next a proposition given trivial (but useful) rules for calculating gradients
of “composed” gradients.
Proof:
i) Straightforward.
ii) We give next the proof for the case where q = 1. The general case is obtained in a
similar way. Take an arbitrary differentiable path {pt } with tangent ν. Define ξ(t) = φ(pt ),
we have
φ(pt ) − φ(p)
−→< ν, φ• >p = ξ 0 (0) .
t
Now,
(g ◦ φ)(pt ) − (g ◦ φ)(p)
−→ (g ◦ ξ)0 (0) = g (ξ(0)) < ν, φ• >p
t
= < ν, φ(p)φ• >p .
u
t
functions {φ̂n }n∈N = {φ̂n } such that for each n ∈ N , φ̂n : X n −→ IRq is (An , B(IRq ))-
measurable is said to be an estimating sequence . Next we introduce two notions of
regularity of estimating sequences often found in the literature. An estimating sequence
{φ̂n } is said to be weakly regular (for estimating φ, with respect to the choice of tangent
cones made) if for each p ∈ P ∗ and each ν ∈ T (p) there exists a differentiable path
{pn−1/2 }n∈N converging to p and with domain V = {n−1/2 : n ∈ N }, for which
√ Z
n{φ(pn−1/2 ) − φ(p)} −→ φ• (x, p)ν(x)p(x)λ(dx)
X
and there exists a probability distribution Lpν (not depending on the path) such that
h√
D
i
Lpn−1/2 n{φ̂n ( · )φ(p)} −→ Lpν .
n
If the distributions Lpν above do not depend on the tangent ν, then we say that {φ̂n }n∈N
is regular.
2.3. FUNCTIONAL DIFFERENTIABILITY 45
An important class of estimating sequences are the asymptotic linear sequences defined
next. An estimating sequence {φ̂n } is said to be asymptotic linear (for estimating φ) if
there exists a function ICφ : X × P ∗ −→ IR such that for each p ∈ P ∗ , the function
ICφ ( · ; p) : X −→ IR is in L20 (p) and for each n ∈ N given a sample x = (x1 , . . . , xn ) of
size n, φ̂n admits the following representation
n
1X
φ̂n (x) = φ(p) + ICφ (xi ; p) + opn n−1/2 . (2.27)
n i=1
The function ICφ is called the influence function of φ. The representation (2.27) can be
re-written as
√ n n
o 1 X
n φ̂n − φ(p) = √ ICφ (xi ; p) + opn (1) .
n i=1
where
Z
ovp {ICφ ( · ; p)} = ICφ (x; p)ICφT (x; p)p(x)λ(dx) . (2.28)
X
Theorem 4 Let {φ̂n } be an asymptotic linear estimating sequence with influence function
w0
IC. Suppose that for each p ∈ P ∗ the tangent cone is given by T (p) =T (p, P ∗ ). Then,
{φ̂n } is regular if and only if for all p ∈ P ∗ , φ is differentiable at p (with respect to T (p))
and IC( · ; p) is a gradient of φ at p.
Proof: See Pfanzagl (1990) for the case where q = 1 or Bickel et al. (1995). u
t
The theorem above identifies (influence functions of) regular asymptotic linear se-
quences ofR estimators for estimating the functional φ with the gradients of φ. The co-
variance, φ•p (x)φ•p (x)T p(x)λ(dx), of a gradient φ•p of φ is the asymptotic covariance of
the corresponding regular asymptotic linear estimating sequence (under p) with influence
function φ•p . On the other hand, since the components of the canonical gradient φ∗ of φ
are the orthogonal projection of the components of any gradient onto the tangent space,
we have for a given gradient φ•p and for all p ∈ P ∗
φ•p ( · ) = φ?p ( · ) + R( · ; p) ,
46 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
for some R( · ; p) ∈ {T ⊥ (p; P)}q . A standard argument yields then that, for all p ∈ P ∗ ,
Z n on oT Z n on oT
φ∗p (x) φ∗p (x) p(x)λ(dx) ≤ φ•p (x) φ•p (x) p(x)λ(dx) , (2.29)
X X
with inequality in the sense of the Löwner partial order of matrices. That is, the covari-
ance of the canonical gradient is a lower bound for the asymptotic covariance of regular
asymptotic linear estimating sequences. Moreover, only an asymptotic linear estimating
sequence with influence curve equal to the canonical gradient achieves this bound. We
say that an asymptotic linear estimating sequence is optimal if, for each p ∈ P ∗ , its in-
fluence function is the canonical gradient of φ. The bound (2.29) is sometimes called the
semiparametric Cramèr-Rao bound.
In spite of the elegance of this theory, some care should be observed in applying it.
Firstly, there is a certain degree of arbitrariness in choosing only the class of regular
asymptotic linear estimating sequences. When restricting to that class one can discard
many interesting sequences. This criticism applies, of course, to any optimality approach.
A second, more specific criticism is the following: It occurs very often that the tangent
space of large (semi- or non-parametric models) is the whole space L20 (see the examples
at the end of the section on tangent spaces). In those cases, due to the uniqueness of the
canonical gradient, each differentiable functional possesses only one gradient. We conclude
from the previous discussion that then there is only one possible influence function and
hence all regular asymptotic linear estimating sequences are asymptotically equivalent (as
far as the asymptotic variance is concerned). Therefore an optimality theory for regular
asymptotic linear estimators is meaningless for the models with tangent spaces equal to
the whole L20 . We refine next the optimality theory for functional estimation.
It is convenient to introduce the following notation. Given a differentiable functional
φ with respect to the tangent cones {T (p) : p ∈ P ∗ } and with canonical gradient φ? ( · , p)
at each p ∈ P ∗ , denote X φ?p (x)φ?p (x)T p(x)λ(dx) by Iφ (p). That is Iφ (p) is the covari-
R
ance matrix of the canonical gradient. A weakly regular estimating sequence {φ̂n } is
asymptotically of constant bias at p ∈ P ∗ if for each ν, η ∈ T (p)
Z Z
xdLpν (x) = xdLpη (x) ∈ IRq .
where the symbol 00 ≥00 is understood in the sense of the Löwner partial order of matrices
1
. Moreover, the equality in (2.30) occurs only if
√ n
1 X
n{φ̂n − φ(p)} = √ φ?p (xj ) + oP (1) . (2.31)
n j=1
We see from the theorem above that the larger are the tangent cones T (p) used, the
sharper are the inequalities (2.30). Small tangent cones make more likely the differentia-
bility of the functional but can make also the bound in (2.30) unattainable.
Another important optimality result in the theory of estimation of functionals is the
convolution theorem, which we give the following version.
Theorem 6 (Convolution theorem) Suppose that T (p) is convex and φ : P ∗ −→ IRq
differentiable at p ∈ P ∗ with respect to T (p). Then any limiting distribution Lp of a
regular estimating sequence for φ at p satisfies
Lp = N (0, Iφ (p)) ∗ M , (2.32)
where M is a probability measure on IRq .
w
Proof: See Pfanzagl (1990) for the case where q = 1 and T (p) =T (p) and van der Vaart
(1980) for the general case. u
t
The expression (2.32) shows that, under the assumptions of the convolution theorem, a
regular estimating sequence cannot possess asymptotic covariance smaller than the L2 (p)
squared norm of the canonical gradient. This provides an extension of the interpretation
of the optimality theory for regular asymptotic linear estimating sequences. In fact, even
when the tangent cone is is the whole L20 , the “optimal” regular asymptotic linear esti-
mating sequence attains the bound for the concentration of regular estimating sequences
given by the convolution theorem, provide the functional is differentiable. An advantage
of the version of the convolution theorem presented is that we need not to work with the
whole tangent space but with a convex cone of it. This can be useful when the functional
in study is not differentiable or when the calculation of the (weak) tangent space is not
feasible.
We close this section presenting a theorem that gives a minimax approach to the
problem of estimation of functionals. A function l : IRq −→ IR is sad to be bowl-shaped
if l(0) = 0, l(x) = l(−x) and for all k ∈ IR, {x : l(x) ≤ k} is convex.
1
That is A ≥ B means that A − B is positive definite.
48 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
ii) For any bowl-shaped loss function l and any estimating sequence {φ̂n },
√ Z
lim lim inf sup EQ {l[ n{φ̂n − φ(Q)}]} ≥ l(x)dN (0, Iφ (p))λ(dx) , (2.34)
c→∞ n→∞ Q∈H (p,c)
n
where Hn (p, c) := {Q ∈ P : n {dQ1/2 (x) − p1/2 (x)}2 λ(dx)} is the interception between P
R
and the ball constructed with the Hellinger distance of center p and radius n−1/2 .
Note that from part i) one can obtain a bound for the concentration of weakly regular
estimating sequences based on the the canonical gradient, provided φ is differentiable with
respect to some convex tangent cones. In particular, if there exist an optimal asymptotic
linear estimating sequences and the assumptions of the theorem hold (i.e. differentiability
of φ and convexity of the tangent cone), then the bound for weak regular estimating
sequences given by (2.33) is attained by this regular asymptotic linear estimating sequence.
In this way, in the case where the tangent space is the whole L20 , the optimality of the
(unique) regular asymptotic linear estimating sequence can be justified. The bound of
the second part of the theorem above holds for the whole class of estimators, however it
is in general not attainable.
2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 49
with rt ( · ) −→ 0 λ- almost everywhere. Hence the path {pt } is (L∞ ) differentiable with
tangent lT ( · ; θ, z)α. Moreover,
φ(pt ) − φ(p) θ + tα − θ
= = α.
t t
−1
Defining φ?p ( · ) = Covθz { l( · ; θ, z) }l( · ; θ, z) we obtain,
Z Z
−1
φ?p (x)ν(x)p(x)λ(dx) = Covθz {l( · ; θ, z)}l(x; θ, z)lT (x; θ, z)αp(x)λ(dx)
X X
φ(pt ) − φ(p)
= α = lim .
t→0 t
We conclude that φ is differentiable at p with respect to T1 (p). Moreover,
Covθz ( l( · ; θ, z) )−1 l( · ; θ, z)
is the canonical gradient of φ. Note that we used (in (2.35)) implicitly the L∞ path dif-
ferentiability, however the argument presented holds for any weaker path differentiability.
For, note that the essential point is that we identify (through (2.35)) any element of the
tangent cone T1 (p) with a L∞ differentiable path. If we adopt a path differentiability
weaker than the L∞ differentiability, then the L∞ differentiable paths identified with the
elements of the tangent cone would be differentiable in the current sense also and the
differentiability of the functional φ follows from the argument presented above.
The efficient scores Iφ (p) (i.e. the correlation matrix of the canonical gradient of φ at
p) is the inverse of the correlation matrix of the score function l( · ; θz). The bounds for the
asymptotic variance obtained with this naive choice of tangent cones are not attainable
in general. This will be apparent from the development presented next where sharper
bounds will be presented.
We introduce the notion of nuisance tangent space that plays a fundamental rule in
the estimation theory in semiparametric models. For each θ0 ∈ Θ consider the submodels
Pθ∗0 = {p( · ; θ0 , z) : z ∈ Z} .
The nuisance tangent set at (θ, z) ∈ Θ × Z, TN0 (θ, z), is the tangent set of Pθ∗ , i.e.
TN0 (θ, z) = T 0 (p, Pθ∗ ). The closure of the space spanned by the nuisance tangent set is
called the nuisance tangent space and denoted by TN (θ, z). Here we do not specify the
notion of path differentiability adopted, but when necessary a symbol will be superim-
posed.
An alternative for the tangent cone better than T1 (p) is
We show next that φ is differentiable with respect to T2 (p), no matter which notion of
path differentiability we use. Consider a ν ∈ TN0 (p) ⊂ T2 (p). There is a differentiable
path {pt } contained in Pθ∗ with tangent ν. Since for each t, pt ∈ Pθ∗ , φ(pt ) = θ = φ(p) and
φ(pt ) − φ(p)
= 0.
t
From the definition of functional differentiability, any gradient φ•p of φ should satisfies, for
each ν ∈ TN0 (θ, z),
φ(pt ) − φ(p) Z •
0 = lim = φp (x)ν(x)p(x)λ(dx) . (2.36)
t&0 t X
On the other hand, the argument presented in the case of the tangent cone be T1 (p)
implies that, if ν ∈ span{li (x; θ, z) : i = 1, . . . , q}, say ν( · ) = l( · ; θ, z)T α, for some
α ∈ IRq , then any gradient φ•p of φ satisfies,
Z
α= φ•p (x)ν(x)p(x)λ(dx) . (2.37)
X
Clearly, the conditions (2.36) and (2.37) are sufficient to ensure that φ•p is a gradient
of φ. From these considerations, a natural candidate for being a gradient of φ is the
(standardized) projection of the score function onto the orthogonal complement of the
nuisance tangent space. Formally, define the function lE : X × Θ × Z −→ IRq by, for each
T
(θ, z) ∈ Θ × Z, lE ( · ; θ, z) = l1E ( · ; θ, z), . . . , lqE ( · ; θ, z) where, for i = 1, . . . , q,
Here (g|A) is the orthogonal projection of g ∈ L20 (Pθz ) onto A ⊆ L20 (Pθz ). Moreover,
Q
TN⊥ (θ, z) is the orthogonal complement of TN (θ, z) in L20 (Pθz ). The function lE is called
the efficient score function and we define the efficient score by
Z
J(θ, z) = lE (x; θ, z)lE (x; θ, z)T p(x)λ(dx) .
X
Define
φ?p ( · ) = J(θ, z)−1 lE (x; θ, z) .
Clearly φ?p satisfies (2.36) and (2.37). We conclude that φ?p is a gradient of φ. Moreover,
φ?p is the canonical gradient (with respect to T2 (p)), since φ?p is in the closure of the span
of the tangent cone.
Note that choosing T2 (p) as the tangent cone, the functional φ is still differentiable
and we obtain a bound related with the extended Cramér-Rao inequality sharper than
52 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
the bound obtained with T1 (p). However, since the T2 (p) is not necessarily convex, it is
impossible to use the convolution theorem and the local minimax theorem.
A third alternative for the tangent cone is
pt ( · ) = p( · ; tα + θ, zt ) (2.38)
is the canonical gradient. In other words, we obtained the same canonical gradient of
φ if we work with T2 (p) or T3 (p) and consequently the extended Cramér-Rao bound is
also the same with the two choices of tangent cone. Note that T3 (p) is convex hence
we can use the convolution and the local asymptotic minimax theorems. This provides
an additional justification of the extended Cramér-Rao bound (via convolution theorem)
and a optimality theory involving a larger class of estimators, namely the weakly regular
2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 53
asymptotic linear estimating sequences (as in the first part of the local asymptotic mini-
max theorem) or even arbitrary estimating sequences (as in the second part of the local
asymptotic minimax theorem). However, we pay a price for these improvements, we have
to introduce regularity conditions on the model in order to obtain the differentiability of
the interest parameter functional.
It is current in the literature to take the whole (weak or Hellinger) tangent set as the
tangent cone, assume that the tangent set is equal to T3 (p) and use (implicitly) assump-
tions equivalent to (2.38) (see Pfanzagl, 1990 page 17). The strength of the approach
based on tangent cones, and not necessarily on the whole tangent set, is that it allow us
to graduate the regularity conditions. We can avoid the assumptions mentioned above
in the difficult cases or take full advantage of them in the sufficiently regular cases. The
approach based on tangent cones allow us to treat the cases where the tangent set is dif-
ficult (or virtually impossible) to calculate (see for instance the semiparametric extended
L2 - restricted models considered in chapter 5).
We conclude the chapter with a comment regarding reparametrizations. Suppose
that we reparametrize the model by considering the interest parameter g(θ) instead of θ.
Here g is a one-to-one differentiable application from IRq to IRq . The interest parameter
functional becomes g ◦ ψ(Pθz ) = g(θ). An application of the proposition 8 and the chain
rule shows that if an estimating sequence {θ̂n } attains the semiparametric Cramèr-Rao
bound for estimating θ then the transformed sequence {g(θ̂n )} attains the Cramèr-Rao
bound for estimating g(θ).
54 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
Chapter 3
3.1 Introduction
In this chapter the theory of estimating functions for semiparametric models is studied.
The basic definitions and properties of estimating functions are given in section 3.2.
There a related notion called quasi estimating function is also introduced. Quasi esti-
mating functions are essentially functions of the observations, the interest parameter and
(different from the estimating functions) of the nuisance parameter. They will provide
a way to formalize in a more clear way the theory of estimating function and relate
estimating functions with regular asymptotic linear estimators. In order to construct
an optimality theory for estimating functions, we define a class of what we call regular
estimating functions. Two alternative (and equivalent) characterizations of the regular
estimating functions are provided in the subsections 3.2.2 and 3.2.3. The second charac-
terization is motivated by differential geometric considerations concerning the statistical
model (inspired by Amari and Kawanabe, 1996).
The characterizations referred to are used to derive an optimality theory in section 3.3.
A necessary and sufficient condition for the coincidence of the bound for the concentration
of estimators based on estimating functions and the semiparametric Cramèr-Rao bound
is provided in subsection 3.3.3. This condition says essentially that the nuisance tangent
space should not depend on the nuisance parameter.
The last section contains some complementary material. Subsection 3.4.1 studies a
technique for obtaining optimal estimating functions when the likelihood function can
be decomposed in certain way. In this way an alternative justification for the so called
principle of conditioning will be provided. A generalization of the notion of estimating
function is introduced in subsection 3.4.2. The chapter closes with a result that will allow
55
56 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
Under regularity conditions each θbn is well defined and the sequence {θbn } is consistent (for
estimating θ) and asymptotically normally distributed. We explore this fact to construct
an optimality theory.
We introduce next a notion related to estimating functions. A function Ψ : X × Θ ×
Z −→ IRq , of the parameters and the observations, such that for each θ ∈ Θ and each
z ∈ Z, the function Ψ( · ; θ, z) : X −→ IRq is measurable is called a quasi-estimating
function. Each estimating function can be naturally identified with a quasi-estimating
function by making it correspond to a suitable quasi-estimating function constant on
the nuisance parameter. We make no distinction between estimating functions and the
corresponding quasi- estimating functions. This abuse of language causes, in general, no
risk of ambiguity.
A quasi- estimating function Ψ : X × Θ × Z −→ IRq such that the conditions (3.2)-
(3.6) below are satisfied is said to be a regular quasi-estimating function . The conditions
are, with ψi denoting the ith component of Ψ and for all θ0 ∈ Θ, all z ∈ Z and all
i, j ∈ {1, ..., p},
ψi ( · ; θ0 , z) ∈ L20 (Pθ0 z ); (3.2)
the partial derivative with respect to θ is well defined (almost everywhere), i.e.
∂
ψi ( · ; θ, z) |θ=θ0 exists ; (3.3)
∂θj
the order of differentiation with respect to θ and integration can be exchanged in the
following sense
∂ Z Z
∂
j
ψi (x; θ, z)p(x; θ, z)λ(dx)|θ=θ0 = [ψi (x; θ, z)p(x; θ, z)]θ=θ0λ(dx); (3.4)
∂θ ∂θj
58 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
and
n o Z
T
Eθz Ψ(·; θ0 , z)Ψ (·; θ0 , z) = ψi (x; θ0 , z)ψj (x; θ0 , z)p(x; θ0 z)λ(dx) (3.6)
X i,j=1,...,q
is positive definite.
It is presupposed that the parametric partial score function is a regular quasi-estimating
function.
A regular quasi-estimating function that does not depend on the nuisance parameter
z is said to be a regular estimating function.
ψi ( · ; θ, z) ∈ TN⊥ (θ, z) .
2
Here and in the rest of this chapter TN (θ, z) =T N (θ, z) and TN⊥ (θ, z) is the orthogonal
2
complement of the nuisance tangent space T N (θ, z) in L20 (Pθz ).
Proof: Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , k} z ∈ Z fixed and ν ∈ TN0 (θ, z) arbitrary.
We prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ). This implies the proposition,
because of the continuity of the inner product.
Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term
{rt }t∈V . Using (2.1), for each t ∈ V ,
1Z 1Z
= ψi (x; θ, z)pt (x)dµ(x) − ψi (x; θ, z)p(x; θ, z)dµ(x)
t X t X
−hrt ( · ), ψi ( · ; θ, z)iθz
= −hrt ( · ), ψi ( · ; θ, z)iθz .
3.2. BASIC DEFINITIONS AND PROPERTIES 59
L2 (Pθz )
Since rt −→ 0, from the continuity of the inner product, we conclude that
hν( · ), ψi ( · ; θ, z)iθz = 0 .
u
t
If the quasi- estimating function does not depend on the nuisance parameter (i.e. it
corresponds to a genuine estimating function), then we can obtain a sharper result.
Proposition 10 Let Ψ : X × Θ −→ IRq be a regular estimating function with components
ψ1 , . . . , ψq . For all θ ∈ Θ and i ∈ {1, . . . , q},
TN⊥ (θ, z) .
\
ψi ( · ; θ) ∈
z∈Z
In fact, the proposition above holds for the class of quasi- estimating functions with
expectation invariant with respect to the nuisance parameter.
Proof: Take θ ∈ Θ and i ∈ {1, . . . , k} fixed and arbitrary ξ ∈ Z and ν ∈ TN0 (θ, ξ). We
prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ).
Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term
{rt }t∈V . Using (2.1), for each t ∈ V ,
hν( · ), ψi ( · ; θ, z)iθz = −hrt ( · ), ψi ( · ; θ, z)iθz .
L2 (Pθz )
Since rt −→ 0, from the continuity of the inner product, we conclude that
hν( · ), ψi ( · ; θ, z)iθz = 0 .
u
t
If a posses a finite expectation under Pθz∗ we define the e-parallel transport of a from z
to z∗ by
(e) Z
Πzz∗ a( · ) = a( · ) − a(x)p(x; θ, z∗ )λ(dx) .
X
The basic properties of the m- and e-parallel transport are given next.
Proposition 11 We have for each z, z∗ ∈ Z and each a, b ∈ L20 (p) ∩ L20 (p∗ ):
Z (m) Z (e)
Πzz∗ b(x)p∗ (x)λ(dx) = Πzz∗ b(x)p∗ (x)λ(dx) = 0 ; (3.7)
X X
(m)
ha, Πzz∗ biθz∗ = ha, biθz ; (3.8)
(e) (m)
hΠzz∗ a, Πzz∗ biθz∗ = ha, biθz (3.9)
and
(m) (m) (e) (e) (m) (e)
Πzz∗ Πzz∗ a( · ) =Πzz∗ Πzz∗ a( · ) =Πzz a( · ) =Πzz a( · ) = a( · ) . (3.10)
The parallel transports defined above have their origin in differential geometric consid-
erations for statistical parametric models (α-connections). We will not enter in details of
the geometric theory for semiparametric models, but refer instead to Amari and Kawan-
abe (1996) for an informal discussion. The parallel transports permit us to change the
inner product (see (3.8)), i.e. it permits us to move from one L2 space to another, keeping
to certain extent the structure given by the inner product of the first space. For instance
the L2 orthogonality (i.e. noncorrelation) is preserved after m-parallel transporting. From
the statistical viewpoint the e- and the m-parallel transport corresponds to correcting for
the mean and correcting for the distribution, respectively, when we move from one L2
space to another.
The following class of functions will be of interest in the theory of estimating functions,
r ∈ TN⊥ (θ, z) :
∀z∗ ∈ Z and ∀ν∗ ∈ TN (θ, z∗ ),
FIA (θ, z) = (m) .
hΠzz∗ ν∗ , riL2 (Pθz ) = 0
3.2. BASIC DEFINITIONS AND PROPERTIES 61
(e)
When the e-parallel transport is well defined one can use alternatively the relation hν∗ , Πzz∗
(m)
riL2 (Pθz∗ ) = 0 instead of hΠzz∗ ν∗ , riL2 (Pθz ) = 0. Informally, FIA is the class of functions r in
TN⊥ (θ, z) such that r corrected for the mean or corrected for the distribution is orthogonal
to each TN (θ, z∗ ) under Pθz∗ (for z∗ running in the whole Z).
Proposition 12 For each (θ, z) ∈ Θ × Z, FIA (θ, z) is a closed subspace of L20 (Pθz ).
(m)
Proof: The linearity and the continuity of hΠzz∗ ν∗ , ( · )iL2 (Pθz ) implies that FIA (θ, z) is a
vector subspace and a closed set in L2 (Pθz∗ ), respectively. u
t
ψi ( · , θ) ∈ FIA (θ, z) .
The proposition above can be easily sharpened making the components of the regular
estimating functions belong to the intersection (over the nuisance parameter) of the FIA ’s.
However the following theorem shows that this is in fact not necessary, since in fact FIA
does not depend on the nuisance parameter. We will use sometimes the notation FIA (θ).
Here TN⊥∗ (θ, z∗ ) denotes the orthogonal complement of TN (θ, z∗ ) in L20 (Pθz∗ ).
Proof:
62 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
’⊆’ Take η ∈ FIA (θ, z), z∗ ∈ Z arbitrary and ν∗ ∈ TN (θ, z∗ ). Applying (3.8) yields
(m)
hν∗ , ηiL2 (Pθz∗ ) = hΠzz∗ ν∗ , ηiL2 (Pθz ) = 0 .
Hence η ∈ TN⊥∗ (θ, z∗ ). Since z∗ was choose arbitrarily in Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ).
’⊇’ Take an arbitrary z∗ ∈ Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ) and ν∗ ∈ TN (θ, z∗ ). Using (3.8) we
obtain
(m)
hΠzz∗ ν∗ , ηiL2 (Pθz ) = hν∗ , ηiL2 (Pθz∗ ) = 0 .
We stress that K(θ, z) must not depend on the observation x. Clearly, two equivalent
estimating functions have the same roots almost surely and hence produce essentially the
same estimators, i.e. they are equivalent from the statistical point of view. Moreover,
it is easy to see that two equivalent quasi-estimating functions share the same Godambe
information for each value of the parameters.
is defined for each quasi-estimating function whose components are in L20 , not only for
regular estimating functions. According to the new definition both Ψ and ΨI posses the
same sensitivity.
u
t
The following proposition gives further properties of regular inference functions, which
will allow us to establish an upper bound for the Godambe information.
(i) ΨI ∼ lI ;
That is,
Moreover, for j = 1, . . . , q
Z ( )
∂
hljI ( · ; θ, z), ψi ( · ; θ, z)iθz =− ψi (x; θ, z) p(x; θ, z)dλ(x) . (3.16)
X ∂θj
We conclude from (3.15) and (3.16) that ΨI ( · ; θ, z) = SΨ (θ, z)lI ( · ; θ, z), which means
that ΨI and lI are equivalent.
(ii) From the previous discussion span{ΨIi ( · ; θ, z) : i = 1, . . . , q} is the space spanned by
−SΨ (θ, z)lI ( · ; θ, z) which is the span of {liI ( · ; θ, z) : i = 1, . . . , q}, since the sensitivity by
assumption is of full rank.
(iii) Straightforward.
u
t
A consequence of the two last proposition is that JlI is an upper bound for the Go-
dambe information of regular quasi inference functions. This upper bound is attained by
any (if any exists) extended regular inference functions with components in E. In partic-
ular if lI is a regular (quasi-) inference function, then it is an optimal (quasi-) inference
function.
Φ(p( · ; θ, z)) = θ .
stronger than (or equal to) the weak path differentiability, in particular the L2 path
differentiability is allowed. Moreover, in the examples we have in mind (L2 - restricted
models) the notions of weak and L2 path differentiability coincide. Take a fixed p( · ; θ, z) =
p( · ) in P ∗ . Consider the function
The following lemma will allow us to connect the optimality theory for estimating func-
tions with the semiparametric Cramér-Rao lower bound.
and
Z
φ•p (x)lT (x; θ, z)p(x)λ(dx) = Iq , (3.18)
X
that is the condition (3.18) holds. We conclude that Φ•p is a gradient of Φ at p with respect
to the tangent cone T (p). u
t
According to the lemma above Φ•p is a gradient of Φ at p, but not necessarily the
canonical gradient. In fact the canonical gradient of the functional Φ at p with respect to
T (p) is
where lE ( · ; θ, z) is the efficient score function at (θ, z) and J −1 (θ, z) is the covariance
matrix of lE ( · ; θ, z) under Pθz (see chapter 2). The unicity of the canonical gradient
implies that Φ•p is the canonical gradient if and only if it is equal to Φ?p and this occurs
3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 69
if and only if TN⊥ (θ, z) = FIA (θ). The covariance of Φ?p ( · ) (under Pθz ), that is J −1 (θ, z),
gives the semiparametric Cramér-Rao lower bound. On the other hand, the lower bound
for the asymptotic covariance of estimators obtained from regular estimating functions is
the covariance (under Pθz ) of Φ•p . We conclude that the following result holds.
Theorem 8 The semiparametric Cramér-Rao lower bound coincides with the bound for
the asymptotic covariance of estimators defined through regular estimating functions at
(θ, z) ∈ Θ × Z if and only if
The theorem above implies that estimating functions produce efficient estimators only if
the orthogonal complement of the nuisance tangent space does not depend on the nuisance
parameter. Examples of that situation are the L2 - restricted and the partial parametric
models and models considered in chapter 5 and 6.
70 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
Theorem 9 Assume that there exists a statistic T such that one has the decomposition
t t
(3.19). Moreover, suppose that the class {Pθz : z ∈ Z}, where Pθz is the distribution of
T (x) under Pθz (i.e. X ∼ Pθz ), is complete. Then the regular inference function given by
∂
Ψ(x; θ) = log ft (x; θ), ∀x ∈ X , ∀θ ∈ Θ (3.20)
∂θ
is optimal. Moreover, if Φ is also an optimal inference function then Φ is equivalent to
Ψ.
The theorem above gives an alternative justification for the use of conditional inference.
The following technical (and trivial) lemma will be the kernel of the proofs that follow.
But first it is convenient to introduce the following notation. Given a regular inference
function Ψ : X × Θ −→ IRk , we define
Ψ(x; θ, z)
Ψ̃(x; θ) = ,
Eθz {Ψ0 (θ)}
which is called the standardized version of Ψ. Here Ψ0 (θ) = ∇θ Ψ(θ). Along this section
we denote the class of all regular estimating functions by G.
(i)
Eθz {Ψ(θ)l(θ; z)}
= −1,
Eθz {Ψ0 (θ)}
where l(θ; z) is the partial score function at (θ; z);
3.4. FURTHER ASPECTS 71
(ii)
n o h i n o n o
Eθz Φ̃2 (θ) = Eθz {Φ̃(θ) − Ψ̃(θ)}2 + 2Eθz Φ̃(θ)Ψ̃(θ) − Eθz Ψ̃2 (θ) .
Differentiating the expectation above with respect to θ and interchanging the order of
differentiation and integration, we obtain
( )
∂
Eθz Ψ(θ) + Eθz {Ψ(θ)l(θ; z)} = 0
∂θ
which is equivalent to the first part of the lemma. The second part is straightforward. u
t
The following lemma gives a useful tool for computing optimal inference functions.
Lemma 3 Assume the previous regularity conditions. Consider two functions A : Θ −→
IR\{0} and R : X × Θ × Z −→ IR. Suppose that, for each regular inference function Φ,
one has, for each θ ∈ Θ and z ∈ Z,
Z
R(x; θ, z)Φ(x; θ)p(x; θ, z)dµ(x) = 0 .
A(θ)
= − .
Eθz {Ψ0 (θ)}
72 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
Eθz {Φ̃2 (θ)} = Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + 2Eθz {Φ̃(θ)Ψ̃(θ)} − Eθz {Ψ̃2 (θ)} (3.23)
= Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + Eθz {Ψ̃2 (θ)}
≥ Eθz {Ψ̃2 (θ)},
We conclude that Ψ is optimal. For the second part of the theorem, note that one has
equality in (3.23), and hence in (3.24), if and only if ∀θ ∈ Θ, ∀z ∈ Z, Eθz [{Φ̃(θ) −
Ψ̃(θ)}2 ] = 0. That is, if a regular inference function Φ is optimal then Φ̃(·; θ) =
Ψ̃(·; θ) Pθz -a.s. , ∀θ ∈ Θ, ∀z ∈ Z. u
t
∂ ∂ ∂
l(x; θ, z) = log p(x; θ, z) = log ft (x; θ) + log h(x; θ, z) . (3.25)
∂θ ∂θ ∂θ
We apply Theorem 3 to prove that ψ is a (“unique”) optimal inference function. More
precisely, defining A(θ) = 1 and R(x; θ, z) = −∂ log h{t(x); θ, z}/∂θ, and using (3.25) we
can write Ψ in the form
∂
Ψ(x; θ) = log ft (; θ) = A(θ)l(x; θ, z) + R(x; θ, z) .
∂θ
According to lemma 3, if R is orthogonal to every regular inference function, then Ψ
is optimal, moreover Ψ is the unique optimal inference function, apart from equivalent
inference functions.
Take an arbitrary regular inference function φ. We show that φ and R are orthogonal.
Note that for each z ∈ Z,
Z Z
0= φ(x; θ)p(x; θ, z)dµ(x) = φ(x; θ)ft (x; θ)h{t(x); θ, z)dµ(x) .
3.4. FURTHER ASPECTS 73
R
On the other hand Eθz (φ|T ) = φ(x; θ)ft (x; θ)dµ(x), which is independent of z. We write
Eθ (φ|T ) for Eθz (φ|T ), and we have Eθz {Eθ (φ|T )} = 0. Since T is complete, Eθ (φ|T ) = 0,
Pθz almost surely. We have then,
u
t
which is equivalent to
n
X
Φ(xi ; θ̂n ) = 0 .
i=1
In other words, for each generalized estimating function there is an estimating function
that yields the same estimating sequence. In fact generalized estimating functions are just
a tool that will simplify some formalizations. Examples of generalized estimating functions
are the efficient estimating function of most of the models studied in the subsequent
chapters. The following result will be useful latter on.
74 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS
Proposition 17 If for each (θ, z) ∈ Θ × Z, FIA (θ) = TN⊥ (θ, z), and the efficient score
function is a generalized estimating function, then the estimating function equivalent to
the efficient score function attains the semiparametric Cramèr-Rao bound.
Proof: Take an arbitrary (θ, z) ∈ Θ × Z. Since FIA (θ) = TN⊥ (θ, z), the efficient score
function lE coincides with the information score function lI at (θ, z). Hence the extended
sensibility of the efficient score function (at (θ, z)) is the q × q identity matrix and the
Godambe information of lE at (θ, z) is
−1 E
JlE (θ, z) = Covθ,z (l ) ,
75
Chapter 4
4.1 Introduction
This chapter studies some semiparametric extensions of the location-scale model. The
classic location-scale model is constructed by taking a (fixed) distribution and applying
to it a shift (location) and rescaling (scale) transformation. Now consider the situation
where instead of having a fixed distribution one deals with a given class of distributions. A
shift-rescaling transformation is applied to an unknown element of this class. Our interest
is in estimating the shift and the rescaling, but now in the presence of the indetermination
given by the unknown particular element of the class of distribution that generated the
data, i.e. the location (shift) and the scale (rescaling) are the parameters of interest and
the unknown distribution is the nuisance parameter.
We will consider location-scale models defined for distributions contained in exponen-
tial families and with the support equal to the whole real line. Here we do not know in
which exponential family the supposed data distribution is contained. Additionally, we
consider some models for which some standardized cumulants are fixed and known. This
will allow us to obtain a range of semiparametric models of various sizes.
The main purpose here is to study the behavior of the efficient score function (i.e. the
canonical gradient of the interest parameter functional) for estimating the location and
the scale. This will allow us to gain intuition in a simple example, before treating more
complicate cases. We will not be concerned at this stage with generality and stringent
restrictions on the family of distributions will be largely used in order to keep the math-
ematical argumentation transparent. The location and scale model will be studied as an
example in chapter 5 under less restrictive conditions.
Section 4.2 presents and discusses the semiparametric location-scale model with which
77
78 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
we work. The nuisance tangent spaces are calculated in section 4.3 and some details
supplied in the appendices. Section 4.4 studies the efficient score function and in the
subsections 4.4.1 and 4.4.2 we specialize to the case where the first two and the first three
standardized cumulants are fixed, respectively. Some discussion is provided in section
4.5. There is an appendix with a brief summary of the theory of the Laplace transform,
adapted to the context we need, and we prove a sufficient condition, in terms of the
Laplace transform, for having the class of polynomials dense in the L2 space associated
to a certain probability measure.
4.2. SEMIPARAMETRIC LOCATION-SCALE MODELS 79
Z
xa(x)λ(dx) = 0 ; (4.4)
IR
Z
x2 a(x)λ(dx) = 1 (4.5)
IR
and
a is differentiable λ-almost everywhere. (4.6)
We consider also the following technical conditions involving the Laplace transform and
the behavior of the function a in the tails. Assume that there exists a δ > 0 (we stress
that δ may depend on a) such that for all s ∈ (−δ, δ)
Z
M [s , a( · )] = esx a(x)λ(dx) < ∞ ; (4.7)
IR
and
0 · −µ
−1 · −µ a σ
l/σ ( · ) = 1+ . (4.12)
σ σ a · −µ
σ
Note that condition (4.8) together with proposition 19 in appendix 4.6.1 ensures that the
2 2
2 {a ( · ) }
0 2 0
functions {aa(( ·· )}
)
and ( · ) a( · )
are Lebesgue integrable.
It can be seen that condition (4.9) implies that
Z Z
l/µ (x)a(x)λ(dx) = l/σ (x)a(x)λ(dx) = 0 ,
IR IR
i.e. the location and the scale partial scores are unbiased. We will need the condition
(4.9) with polynomials of arbitrary order in the calculation of the nuisance tangent space
and the projection of the score function onto the orthogonal complement of the nuisance
tangent space.
Condition (4.7) implies that the distribution associated with a possesses finite moments
of all orders and that the polynomials are dense in L2 (a) (see the proposition 19 and
theorem 11 in the appendix 4.6.1). Those properties will be crucial in the calculation of
the nuisance tangent space and in the projection of the score function onto the orthogonal
complement of the nuisance tangent space.
The following proposition gives a useful sufficient condition for verifying whether a
given probability density satisfies the technical conditions (4.7)-(4.9).
Proposition 18 Let a : IR −→ IR be a function for which (4.2), (4.3) and (4.6) hold.
Assume moreover that there exists δ > 0 such that for all s ∈ (−δ, δ),
lim esx a(x) = lim esx a(x) = 0
x→+∞ x→−∞
and
a0 (x) a0 (x)
lim esx q = lim esx q = 0
x→+∞ x→−∞
a(x) a(x)
Then the technical conditions (4.7)-(4.9) hold.
Proof: Conditions (4.7) and (4.8) follow from proposition 20 in appendix 4.6.1, part i)
and (4.9) from part ii). u
t
Using the proposition above it is easy to see that the following classic families of
distributions have probability densities satisfying the technical conditions (4.7)-(4.9): the
normal distributions, the hyperbolic distributions, the Gumbel distributions, the double
exponential or Laplace distribution.
82 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Theorem 10 The L2 - nuisance tangent space of the location-scale model given by (4.1)
at (µ, σ, a) ∈ IR × IR+ × A is
2
∗
T N (µ, σ, a) = clL2 (Pµσa ) [span{ei ( · ) : i = k + 1, k + 2, ...}] .
Proof: We give now the main steps of the proof (the details can be found in appendix
4.6.2).
First of all, it can be shown that under a semiparametric location-scale model the L2 -
nuisance tangent space at (µ, σ, a) ∈ IR × IR+ × A is given by
( )
· −µ 2
2
T N (µ, σ, a) = ν : ν ∈TN (0, 1, a)
σ
(see theorem 12 in the appendix 4.6.3 for a detailed proof). Therefore there is no loss of
generality in restricting our attention to the case where (µ, σ, a) = (0, 1, a) and show that
2
T N (0, 1, a) = clL2 (P01a ) {span [{ei ( · ) : i = k + 1, k + 2, ...}]} .
2
For notational simplicity, in this proof we denote T N (0, 1, a) by TN (and use the
analogous convention for the tangent sets). Define, for each i ∈ N ,
Hi = span{e1 ( · ), ..., ei ( · )} .
4.3. CALCULATION OF THE NUISANCE TANGENT SPACE 83
Next we sketch the proof that Hk⊥ ⊆ TN . The verification of this inclusion above can
be reduced to proving that for each i ∈ {k + 1, k + 2, ...}
ei ( · ), if i is even,
hi ( · ) =
ei+1 ( · ) − ei ( · ), if i is odd,
is in TN (0, 1, a). The proof is done by showing that for t small enough we have that
at ( · ) = a( · ) + ta( · )hi ( · )
belongs to A. The conditions (4.2)-(4.10) for at are verified in the appendix 4.6.2. There,
the crucial point is the verification of (4.7) and (4.8), which is done with the Laplace
transform properties given in appendix 4.6.1. u
t
We remark that in fact the weak and the L1 tangent spaces coincide with the L2
tangent space for the location and scale models. This will be proved in a more general
context in chapter 5. In the rest of the chapter we suppress the symbol “2” indicating
that we work with the L2 tangent space.
84 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
and hence the the location component of the efficient score function at (µ, σ, a) is given
by
E
l/µ ( · ) = c1 e∗1 ( · ) + ... + ck e∗k ( · ) (4.15)
and analogously the scale component of the efficient score function is
E
l/σ ( · ) = d1 e∗1 ( · ) + ... + dk e∗k ( · ) . (4.16)
Here the Fourier coefficients in (4.15) and (4.16) are given, for all i ∈ N , by
ci = < l/µ ( · ), e∗i ( · ) > (4.17)
Z
1 a0 x−µ
σ
x−µ 1 x−µ
= −
x−µ
ei a λ(dx)
IR σ a σ σ σ σ
1
Z
e0i (y)a(y)λ(dy)
+∞
= − ei (y)a(y) −∞ −
σ IR
1Z 0
= e (y)a(y)λ(dy) .
σ IR i
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 85
Here we used the condition (4.9). A similar calculation leads to the following formula for
the coefficients of (4.16):
1Z
di = < l/σ ( · ), e∗i ( · ) >= ye0i (y)a(y)λ(dy), for i = 1, . . . , k . (4.18)
σ IR
We discuss now the dependence of the efficient score function on the nuisance param-
eter. First of all, since for each i ∈ N , ei ( · ) is a polynomial of degree i, the coefficient ci
is a linear combination of the standardized cumulants up to order k − 1 of the distribu-
tion with density a (see (4.17)). Moreover, the coefficients di are linear combinations of
the standardized cumulants of the distribution in play up to order k. We conclude that
E E
the coefficients of l/µ ( · ) and l/σ ( · ) (given in (4.15) and (4.16) ) depend on the nuisance
parameter a only through the standardized cumulants up to order k. However, the depen-
dence of the efficient score function on the nuisance parameter is more complex because
the polynomials e0 , e1 , ..., ek generated by the orthonormalization procedure (in L2 (a))
depend on higher order standardized cumulants. In fact the polynomial ek ( · ) depends on
the moments of order up to 2k, because in order to normalize the polynomial of degree
k in the Gram-Schmidt procedure, we have to divide the polynomial by its L2 (a) norm,
which clearly depends on the standardized cumulant of order 2k.
Summing up, the dependence of the efficient score function on the infinite dimensional
parameter for the location-scale model under study occurs here via a finite dimensional
intermediate parameter involving only the standardized cumulants of order up to 2k.
4.4.1 The case where the first two standardized cumulants are
fixed
We study now in detail the case where only the standardized cumulants up to order 2
are fixed (i.e. k = 2). It will be shown that in this case the efficient score function is
equivalent to an estimating function, which is independent of the nuisance parameter and
has the sample mean and standard deviation as roots.
The first three elements of the basis {ei }i∈N0 are
e0 ( · ) = 1 , e 1 ( · ) = ( · ) (4.19)
1 q
e2 ( · ) = {( · )2 − m3 ( · ) − 1} , where ∆2 = m4 − m23 − 1 .
∆2
The detailed calculations are given in appendix 4.6.4. There can be found also an argu-
ment showing that ∆2 > 0, and hence (4.19) is well defined. Using the formulas (4.17)
and (4.18) to calculate the coefficients c0 , c1 , c2 , d0 , d1 , d2 of the efficient score function
86 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
gives
c1 = σ1 c2 = −m
σ∆2
3
2
d1 = 0 d2 = σ∆ 2
.
Inserting the coefficients given above into (4.15) and (4.16) we obtain that the efficient
score function at (µ, σ, a) is given by
E
l/µ ( · ) = c1 e∗1 ( · ) + c2 e∗2 ( · ) (4.20)
" ( 2 )#
1 · −µ m3 · −µ · −µ
= − 2 − m3 −1
σ σ ∆2 σ σ
and
( 2 )
2 · −µ · −µ
E
l/σ (·) = d2 e∗2 ( · ) = − m3 −1 . (4.21)
σ∆22 σ σ
Under independent repeated sampling, with sample x = (x1 , ..., xn )T we obtain the
following expression for the efficient score function,
xi −µ 2
Pn xi −µ m3 Pn xi −µ
i=1 σ
− ∆22 i=1 σ
− m3 σ
−1
−1
lE (x
; µ, σ, a) =σ .
xi −µ 2
2 Pn xi −µ
− ∆22 i=1 σ
− m3 σ
−1
− m23
" #
1
M =σ −m23 −∆22
m3 2
we obtain the following estimating function which is equivalent to the efficient score
function
xi −µ̂
P
n
i=1
σ̂
M · lE (x ; µ, σ, a) = Pn xi −µ̂ 2 .
i=1 σ̂
−n
Note that the matrix M is indeed nonsingular (its determinant is −σ∆22 /2 6= 0, see a
justification in appendix 4.6.4). Hence the efficient score function is equivalent to an
estimating function independent of the nuisance parameter with roots
v
n n
1 u1 X
u
X
µ̂ = xi and σ̂ = t (x i − µ̂)2 .
n i=1 n i=1
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 87
4.4.2 The case where the first three standardized cumulants are
fixed
We show in this section that in the case where the standardized cumulants up to order 3
are fixed (i.e. k = 3) the roots of the efficient score function do depend on the nuisance
parameter through the cumulants up to order 6. Moreover, we exemplify some situations
where the roots of the efficient score function are not the sample mean and the sample
variance.
We have computed in the last section the first three coefficients of the efficient score
function, namely c0 , c1 , c2 , d0 , d1 and d2 . We calculate now the coefficients c3 and d3 which
will allow us to compute the efficient score function by using (4.15) and (4.16). Note that
the polynomial e3 is given by (see appendix 4.6.4)
( ! ! !)
1 γ m3 γ γ
e3 ( · ) = ( · )3 − 2
( · )2 − m4 − 2 ( · ) − m3 − 2
∆3 ∆2 ∆2 ∆2
where
γ = m5 − m3 m4 − m3
and
! ! !
γ m3 γ γ
∆3 =
( · )3 − ( · )2 − m4 − 2 ( · ) − m3 − 2
2
∆2 ∆2 ∆2
Note also that according to the argument given in appendix 4.6.4, ∆3 > 0 and hence
e3 ( · ) is well defined.
Using (4.17) and (4.18) we obtain
( )
1Z 0 1 m3 γ
c3 = e (x)a(x)λ(dx) = 3 − m4 − 2
σ IR 3 σ∆3 ∆2
and
( )
1Z 0 1 γ
d3 = xe (x)a(x)λ(dx) = 3m3 − 2 2 . (4.22)
σ IR 3 σ∆3 ∆2
88 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Example 7 We study now the case where m3 = m5 = 0 and m4 6= 3. This is the case
for instance for the Laplace distribution (double exponential) or the hyperbolic distribution
with symmetry parameter β vanishing, which are symmetric (hence m3 = m5 = 0) and
have the standardized cumulant of fourth order different from 3.
In this case the coefficients of the efficient score function are given by
n o
c3 = 1
σ∆3
(3 − m4 ) 6= 0 , d3 = σ∆1
3
3m3 − 2 ∆γ2 = 0
2
−m3 2
c2 = σ∆2
=0 , d2 = σ∆ 2
1
c1 = σ
, d1 = 0
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 89
Equating (4.23) to zero, inserting (4.25) and rearranging we obtain the following equation
n
!
− n (2A + 1) µ̂3 + (4A + 3) µ̂2
X
xi
i=1
!2
n n
!
x2i + 2(A + 1)
X X
− {(n + 1)A + n} xi µ̂
i=1 i=1
n n n
( ! ! !)
1
x3i x2i
X X X
+ A + (A + 1) xi = 0,
i=1 n i=1 i=1
4.5 Discussion
There exist in the literature many studies of the extensions of the pure location model. We
refer to Stein (1956), van der Vaart (1988,1991) and Bickel et al. (1993), among others.
In all the studies referred to the unknown distributions are assumed to be symmetric,
which can simplify the mathematical treatment considerably. In this paper we presented
a class of distributions that are not necessarily symmetric and we treat at the same
time the problem of estimating the scale. It is to be noted that we did not attempt to
obtain the largest class (or even a very large class) of distributions, containing asymmetric
distributions, for which it is possible to treat the problem of estimating the location (and
the scale). Rather, we restricted the discussion to some interesting infinite dimensional
classes of distributions. It would be a considerable improvement to eliminate the technical
assumptions on the Laplace transform and the polynomial decay (of all orders) of the tails
of the densities (i.e. conditions (4.7)-(4.9) ).
In the case of the location-scale model where only the first two moments are fixed,
the efficient score function does not essentially depend on the nuisance parameter and
is in fact a regular estimating function. This implies, according to the discussion in
chapter 2 (see also Jørgensen and Labouriau, 1995), that the efficient score function is an
optimal estimating function, in the sense that it maximizes the Godambe information.
The roots of the efficient score function yield the sample mean and the sample standard
desviation as estimators of the location and scale respectively. This is in agreement
with the literature for the location model with symmetry and with the common sense in
statistics. In this way, the theory of estimating functions does not suggest new estimators
and its merit apparently is only to justify the classic statistical estimation procedure.
However, some care should be taken at this point. We showed that the only possible
estimating sequence derived from regular estimating functions, in this example, is the
sample mean and the sample standard desviation. Hence, that estimating sequence is
optimal in a class containing only one element. Moreover, the usual claim that it is
possible to obtain B-robust estimating sequences from estimating functions fails in this
example.
Note that in the case of the location-scale model where only the first two moments are
fixed, the (global) tangent space is the whole space L20 (see Labouriau,1996a). That is, the
model is not a semiparametric model in the terminology of J. Wellner (see Groeneboom
and Wellner, 1992, page 7) but rather is a nonparametric model. This implies that there
is essentially only one sequence of regular linear asymptotic estimators. Clearly, this
sequence of estimators is optimal, even though optimality in a class containing only one
element does not say very much. Hence, a naive application of the direct extension of the
estimating function theory to semiparametric models and the classical theory of regular
linear asymptotic both fail in this example.
4.5. DISCUSSION 91
When one fixes some standardized cumulants of higher order, the model becomes a
genuine semiparametric model, in the sense that its (global) tangent space becomes a
proper subspace of L20 . In this case the estimation problem is harder even though the
family of distributions now is smaller than the “less restricted case” discussed before.
It will be shown in chapter 5 that, in fact, the semiparametric Cramèr-Rao bound is
not attained in this example. The efficient score function and its roots now depend on
the nuisance parameter, however through a finite dimensional intermediate parameter,
namely only through a finite number of standardized cumulants (2k). This suggests
that plugging some reasonable estimators of the standardized cumulants (of order up
to 2k) into the efficient score function, can produce reasonable asymptotic results. In
fact, since the standardized cumulants can be estimated consistently from the sample
moments, pluging these estimators in the efficient score function should produce efficient
estimators. However, since one has to estimate standardized cumulants of high order
(2k), one can expect a poor performance for finite samples of moderate size. The method
of local efficient estimation and the method of sieves can perhaps offer some attractive
alternatives.
92 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
4.6 Appendices
4.6.1 The Laplace transform and polynomial approximation in
L2
Basic properties of the Laplace Transform
In this section we review the basic properties of the Laplace transform and prove some
technical lemmas required in the study of the semiparametric location-scale model defined
in section 4.2. The properties presented here are essentially well known for the case were
the distribution is concentrated on the positive real line, however the result concerning
densities of polynomials in L2 is original.
Let f : IR −→ [0, ∞) be a function such that for some s ∈ IR the integral
Z
M (s; f ) = esx f (x)λ(dx) (4.26)
IR
converges. The function M ( · ; f ) : IR −→ [0, ∞] such that for each s ∈ IR, M (s; f ) is
given by (4.26) is said to be the Laplace transform of f .
We now study some properties of the functions with finite Laplace transform in a
neighborhood of zero.
Proposition 19 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0
and for all s ∈ (−δ, δ)
Proof: Since for all s ∈ (−δ, δ), M (s; f ) < ∞, e|sx| ≤ esx + e−sx and using the series
version of the monotone convergence theorem (see Billingsley 1986 page 214 theorem 16.6
1
) we have
Z Z
∞ > eδx f (x)λ(dx) + e−δx f (x)λ(dx)
IR IR
Z
≥ e|δx| f (x)λ(dx)
IR
1
RP P R
The theorem referred to states: ”If fn ≥ 0, then n fn dµ = n fn dµ.”.
4.6. APPENDICES 93
(∞
|δx|k
Z )
X
= f (x)λ(dx)
IR k=0 k !
(from theorem 16.6 in Billingsley 1986)
∞
|δx|k
(Z )
X
= f (x)λ(dx) ,
k=0 IR k !
The notion of Laplace transform can be extended to functions with range equal to the
whole real line in the following way. Given a function f : IR −→ IR we define the positive
and the negative part of f respectively by
Here χA ( · ) is the indicator function of the set A. We have clearly the decomposition
f ( · ) = f +( · ) − f −( · ) .
M ( · ; f ) = M ( · ; f +) − M ( · ; f −) , (4.28)
provided that at least one of the terms of the right side of (4.28) is finite (otherwise the
Laplace transform of f is not defined). The following proposition will be useful for the
calculation of the L2 - nuisance tangent space of the location-scale model considered in
section 4.2.
Proposition 20 Let f : IR −→ IR and δ > 0 be such that for all s ∈ [−δ, δ] M (s; f ) ∈ IR.
Then, for all n ∈ N and all s ∈ (−δ/2, δ/2) we have
M [s; ( · )n f ( · )] ∈ IR .
Proof: Assume without loss of generality that the function f is nonnegative. Take an
arbitrary s ∈ [−δ/2, δ/2] and n ∈ N . By hypothesis, f has finite Laplace transform in
a neighborhood of zero; then, from proposition 19, f has finite moments of all orders, in
particular
Z
x2n f (x)λ(dx) ∈ IR .
IR
94 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Z 1/2Z 1/2
2sx 2n
= e f (x)λ(dx) x f (x)λ(dx) < ∞
IR IR
u
t
In that case we can define the sequence of polynomials {ei ( · )}i∈N0 ⊆ L2 (a) as the result
of a Gram-Schmidt orthonormalization process applied to the sequence {1, ( · ), ( · )2 , ...}.
The following theorem gives a sufficient condition for {ei ( · )} to be a complete sequence
in L2 (a), which implies that the polynomials are dense in L2 (a).
Z
∃δ > 0 such that ∀s ∈ [−δ, δ], M (s; a) = esx a(x)λ(dx) < ∞ . (4.30)
IR
Proof: First of all we observe that condition (4.30) implies that the measure determined
by a possesses finite moments of all orders (see proposition 19).
4.6. APPENDICES 95
We prove that f ( · ) = 0 a-a.e. which implies the theorem (see Luenberg, 1969, Lemma 1,
page 61).
Define for each k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,
We will use a series version of the dominated convergence theorem applied to {fk }. In
the following we find a Lebesgue integrable function dominating uniformly (i.e. for all
k) the functions fk , which will enable as to use the referred theorem. We have for each
n ∈ N , k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,
n n n
X X X |xt|k
f k (x) ≤ |fk (x)| = |f (x)|a(x) (4.32)
k=0 k !
k=0 k=0
n ∞
X |xt|k X |xt|k
= |f (x)|a(x) ≤ |f (x)|a(x)
k=0 k ! k=0 k !
Then the first term in the right side of (4.33) is in L2 (λ). On the other hand,
q
2 Z
a( · )e( · )t
e2tx a(x)λ(dx) = M (2t; a) < ∞
2 =
L (λ) IR
and
q
2 Z
a( · )e−( · )t
e−2tx a(x)λ(dx) = M (−2t; a) < ∞ .
2 =
L (λ) IR
96 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Then the second term in the right side of (4.33) is in L2 (λ). Using the Cauchy-Schwartz
inequality (see Luenberg, 1969, lemma 1, page 47) we obtain
Z q q
< |f ( · )| a( · ) , a( · ) e( · )t + e−( · )t >λ
g(x)λ(dx) =
IR
q
q
( · )t
+ e−( · )t
≤
|f ( · )| a( · )
2
a( · ) e
< ∞.
L (λ) L2 (λ)
Since (4.32) holds for each n ∈ N , x ∈ IR, t ∈ [−δ/2, δ/2] and g is Lebesgue integrable
we can use the series version of the dominated convergence theorem (see Billingsley, 1986,
theorem 16.7 page 214 2 ) to obtain
(∞
(xt)k
Z Z )
xt
X
e f (x)a(x)λ(dx) = f (x)a(x) λ(dx)
IR IR k=0 k !
(from the series dominated convergence theorem)
∞
(xt)k
(Z )
X
= f (x)a(x)λ(dx) = 0 .
k=0 IR k !
2
P Pn
The theorem states: ”IfP n fn converges almost everywhere andR P| k=1 fk |P
≤ gR almost everywhere,
where g is integrable, then n fn and the fn are integrable, and f
n n dµ = n fn dµ”.
4.6. APPENDICES 97
Proposition 21 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0
and for all s ∈ [−δ, δ]
Then we have:
Proof:
i) Take s ∈ (−δ, δ). Condition (4.35) implies that there exists L ∈ IR+ such that for all
x ∈ IR \ [−L, L], eδx f (x) < 1 and e−δx f (x) < 1. We have then
Z
M (s; f ) = esx f (x)λ(dx)
ZIR Z Z
= esx f (x)λ(dx) + esx f (x)λ(dx) + esx f (x)λ(dx)
[−L,L] [L,∞) (−∞,−L]
Z Z Z
= sx
e f (x)λ(dx) + e e f (x)λ(dx) + e(δ−s)x e−δx f (x)λ(dx)
(s−δ)x δx
[−L,L] [L,∞) (−∞,−L]
Z Z Z
≤ esx f (x)λ(dx) + e(s−δ)x λ(dx) + e(δ−s)x λ(dx) < ∞ .
[−L,L] [L,∞) (−∞,−L]
and
lim xk f (x) = lim {eδx xk }{e−δx f (x)} = 0
x→−∞ x→−∞
u
t
98 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Hi = span{e1 ( · ), ..., ei ( · )} .
We will prove that for i ∈ {k + 1, k + 2, ...}, hi ( · ) ∈ TN0 ⊆ TN . This implies the lemma.
To see that take any linear combination of {hk+1 ( · ), hk+2 ( · ), ...} in TN . In particular, we
have that if i is even, then ei ( · ) ∈ TN , and if i is odd, then ei ( · ) = hi ( · ) − hi−1 ( · ) ∈ TN .
Hence for all i ∈ {k + 1, k + 2, ...}, ei ( · ) is in TN . Since TN is a closed linear subspace of
L2 (a),
clL2 (P01a ) {[{ei ( · ) : i ∈ {k + 1, k + 2, ...}]} ⊆ TN .
On the other hand, the condition (4.7) and the theorem 11 implies that the polynomials are
dense in L2 (a) and, since Hk = span{e0 ( · ), ..., ek ( · )} and {ei ( · )}i∈N0 is an orthonormal
system, we have
We conclude that the lemma is proved if we show that for all i ∈ {k + 1, k + 2, ...},
hi ( · ) ∈ TN0 and this is proved below.
Let ν( · ) = hi ( · ) for a fixed but arbitrary i ∈ {k + 1, k + 2, ...}. We prove that for t
small enough,
Note that (4.36) is equivalent to take a differentiable path with vanishing remainder term.
Hence, if we show (4.36) we prove in fact that ν ∈ TN0 . We verify next that each at (for t
in a neighborhood of zero) satisfies the conditions (4.2)-(4.10).
4.6. APPENDICES 99
Verification of (4.2):
Note that
lim ν(x) = lim ν(x) = ∞ .
x→+∞ x→−∞
Hence there exists L > 0 such that for all x ∈ IR \ [−L, L], ν(x) > 0. Then, since a( · ) is
strictly positive, for all x ∈ IR \ [−L, L],
On the other hand, since ν( · ) is continuous its restriction to the compact interval [−L, L]
is bounded, and since a( · ) is continuous and strictly positive, the restriction of a( · ) to
[−L, L] is bounded alway from zero. It can be easily shown then that for t small enough
and for all x ∈ [−L, L], at (x) > 0.
Verification of (4.3)-(4.5) and (4.10):
Given i ∈ {1, 2, ..., k} and t ∈ IR+ ,
Z Z Z
i i
x at (x)λ(dx) = x a(x)λ(dx) + t xi ν(x)a(x)λ(dx) = mi + t0 = mi .
IR IR IR
The second last inequality comes from the fact that {ei }i∈N0 is an orthogonal system in
L2 (a) and from (4.3)-(4.5) and (4.10).
Verification of (4.6):
Since the polynomials are of class C ∞ the property follows immediately for at .
Verification of (4.7):
We have for all s ∈ (−δ, δ) (δ = δ(a) ) and all t ∈ IR+
Since ν is a polynomial and M (s; a) < ∞ by hypothesis, from proposition 20 (in the
appendix on the Laplace transform), M (s; aν) < ∞, then, for all s ∈ (−δ, δ)
M (s, at ) < ∞ .
Verification of (4.8):
A routine calculation yields
{a0t ( · )}2 = p( · ){a0 ( · )}2 + q( · ){a( · )}2 + w( · )a( · )a0 ( · ) , (4.37)
for some polynomials p, q, and w. We will show that for all s ∈ [−δ/2, δ/2] and for t ∈ IR+
small enough,
M [s; p( · ){a0 ( · )}2 /at ( · )] , M [s; q( · )a2 ( · )/at ( · )] , (4.38)
M [s; w( · ){a( · )a0t ( · )}/at ( · )] ∈ IR .
100 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
We verify that each of the terms in the right side of (4.40) are finite. Note that since
limx→±∞ ν(x) = ∞, there exists a L > 0 such that for all x ∈ IR \ [−L, L], at (x) > a(x).
We have then
2
a( · )
a2 (x)
Z
( · )s/2 sx 2
e r( · ) q = e r (x) λ(dx) (4.41)
at ( · )
IR at (x)
L2 (λ)
Z
a2 (x)
sx 2
Z
a2 (x)
= e r (x) λ(dx) + esx r2 (x) λ(dx)
[−L,L] at (x) IR\[−L,L] at (x)
Z
a2 (x) Z
a2 (x)
≤ esx r2 (x)) λ(dx) + esx r2 (x) λ(dx) < ∞ .
[−L,L] at (x) IR a(x)
The first integral in the last line is finite because its integrand is continuous and hence
bounded in [−L, L], and the second integral is finite because it coincides with the Laplace
transform of r2 ( · )a( · ) which according to proposition 20 is finite. Moreover,
2
( · )s/2 a0 ( · )
{a0 (x)}2
Z
e q
= esx λ(dx) (4.42)
at ( · )
IR at (x)
L2 (λ)
Z
{a0 (x)}2 Z
{a0 (x)}2
≤ esx λ(dx) + esx λ(dx) < ∞ .
[−L,L] at (x) IR a(x)
The first integral in the last line is finite because its integrand is continuous and hence
bounded in [−L, L], and the second integral is finite because it coincides with the Laplace
4.6. APPENDICES 101
0 2
transform of {aat((··)}) which according to condition (4.8) is finite. Inserting (4.41) and
(4.42) in (4.40) we obtain, for all s ∈ [−δ/2, δ/2],
a( · )a0 ( · )
" #
M s; r( · ) <∞.
at ( · )
2
h i
We show now that M s; q( · ) aat (( ·· )) is finite. Using the Cauchy-Schwartz inequality we
obtain
"
a2 ( · )
#
a( · ) ( · )s/2 a( · )
< e( · )s/2 q( · ) q
M s; q( · ) = ,e >L2 (λ) (4.43)
q
at ( · )
at ( · ) at ( · )
a( · )
( · )s/2 a( · )
( · )s/2
≤
e q( · ) q
e q
< ∞.
at ( · )
at ( · )
L2 (λ) L2 (λ)
To see that the right side of the expression above is finite note that for every polynomial,
say r, we have
a( · ) {a(x)}2
Z
( · )s/2
sx 2
e r( · ) q
= e r (x) λ(dx) (4.44)
at ( · )
IR at (x)
L2 (λ)
Z
sx 2 {a(x)}2 Z
{a(x)}2
≤ e r (x) λ(dx) + esx r2 (x) λ(dx)
[−L,L] at (x) IR\[−L,L] a(x)
Z
{a(x)}2
≤ esx r2 (x) λ(dx) + M [s; r2 ( · )a( · )] < ∞.
[−L,L] at (x)
Note that proposition 20 himplies that M [s; r2 ( · )a( · )] < ∞.
0 2
i
Finally, we show that M s; p( · ) {aat((··)}) is finite. For,
{a0 ( · )}2 0
" Z
2
#
{a (x)}
M s; p( · ) = esx p(x) λ(dx)
at ( · )
IR at (x)
0
Z
{a (x)} 2 Z
{a0 (x)}2
≤ esx |p(x)| λ(dx) + esx |p(x)| λ(dx)
[−L,L] at (x) IR\[−L,L] a(x)
{a0 (x)}2 {a0 ( · )}2
Z " #
sx
≤ e |p(x)| λ(dx) + M s; |p( · )| < ∞.
[−L,L] at (x) a( · )
Verification of (4.9):
Given i ∈ N0 we have for each t ∈ IR+ ,
lim xi at (x) = lim {xi a(x) + xi ν(x)a(x)} = 0 ,
x→±∞ x→±∞
102 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
because xi ν(x) is a polynomial and from (4.9), for each polynomial p( · ) we have
lim p(x)a(x) = 0 .
x→±∞
u
t
4.6. APPENDICES 103
· − µ0
TN (µ0 , σ0 , a0 ) = ν : ν ∈ TN (0, 1, a0 ) .
σ0
The theorem above holds for any of the notions of path differentiability given in Labouriau
(1996a), and in particular for any path differentiability used in this paper. Therefore we
do not specify the path differentiability adopted in this appendix.
The point (µ0 , σ0 , a0 ) will be considered fixed in the rest of this section. We introduce
the following notation:
Pµ0 σ0 a0 = P0 ;
1 · − µ0
a• ( · ) = a0 .
σ0 σ0
The inner product and the norm of L2 (P0 ) will be denoted by < · , · >0 and k · k0
respectively. We also use the symbol P0∗ to denote Pµ∗0 σ0 . Note that
1 · − µ0
P0∗ = a : a∈A
σ0 σ0
and
( )
ν ∈ L20 (P0 ) : ∃ > 0, {a0t }t∈[0,) ⊆ P0∗ , {rt0 }t∈[0,) ⊆ L2 (P0 )
TN0 (µ0 , σ0 , a0 ) = 0 .
such that (4.45) and (4.46) hold
and
rt0 ( · ) −→ 0 , as t ↓ 0 , (4.46)
where the convergence above is one of the forms of convergence defined in section ??.
· −µ0
Lemma 5 If ν ∈ TN0 (0, 1, a0 ), then ν σ0
∈ TN0 (µ0 , σ0 , a0 ).
104 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
∗
Proof: Since ν ∈ TN0 (0, 1, a0 ) there exists > 0, {at }t∈[0,) ⊆ P01 = A and {rt }t∈[0,) ⊆
2
L (P01a0 ) such that for all t ∈ [0, ),
and
rt0 ( · ) −→ 0 , as t ↓ 0 .
Using (4.47), for all t ∈ [0, ), we can write
1 · − µ0 1 · − µ0 1 · − µ0 · − µ0
at = a0 + t a0 ν (4.48)
σ0 a0 σ0 a0 σ0 a0 a0
1 · − µ0 · − µ0
+ t a0 rt
σ0 a0 a0
· − µ0 · − µ0
= a• ( · ) + ta• ( · )ν + ta• ( · )rt .
a0 a0
Clearly, σ10 at · −µ
a0
0
∈ P0∗ , because at ( · ) ∈ A. Suppose now that the convergence of the
remainder term is in the Lp sense (for q ∈ [1, ∞)). We have then
· − µ0
q x − µ0 q
Z
rt = rt a• (x)λ(dx) (4.49)
a0
Lq (P0 ) IR a0
Z
= {rt (y)}q a0 (y)λ(dy) = krt ( · )kqLq (a0 ) .
IR
· −µ0
Since, for all t ∈ [0, ), rt ( · ) ∈ Lq (P01a0 ), then from (4.49), rt a0
∈ L2 (P0 ), and
q
q 0
since krt ( · )kL2 (P0 ) −→ 0,
rt · −µ
a0
0
q −→ 0. We conclude that ν · −µ0
σ0
∈TN
L (P01a0 )
(µ0 , σ0 , a0 ). An analogous argument can be used to prove the lemma for the weak nuisance
tangent spaces. The idea again is to use a suitable change of variables in the integrals
used to define the convergence of the remainder term. u
t
· −µ0
Lemma 6 If ν σ0
∈ TN0 (µ0 , σ0 , a0 ), then ν( · ) ∈ TN0 (0, 1, a0 ).
Proof: We give next the argument for the L2 - nuisance tangent space. For the other no-
· −µ0
tions of path differentiability the argument is analogous. Since ν σ0 ∈ TN0 (µ0 , σ0 , a0 ),
there exist > 0, {a0t }t∈[0,) ⊆ P0∗ and {rt0 }t∈[0,) ⊂ L2 (P0 ) such that
· − µ0
a0t ( · ) = a• ( · ) + ta• ( · ) ν + ta• ( · ) rt ( · ) (4.50)
σ0
4.6. APPENDICES 105
and
0
rt ( · )
−→ 0 , as t ↓ 0 .
0
Since for each t ∈ [0, ) a0t ( · ) ∈ P0∗ , there exists at ( · ) ∈ A such that
1 · − µ0
a0t ( · ) = at .
σ σ0
Then (4.50) is equivalent to
1 · − µ0 1 · − µ0 1 · − µ0 · − µ0
at = a0 + t a0 ν
σ σ0 σ σ0 σ σ0 σ0
1 · − µ0 0
+t a0 rt ( · ) .
σ σ0
1
Eliminating the common factor σ
and changing variables we obtain
Note that
Z
k rt [σ0 ( · ) + µ0 ] k2L2 (P01a0 ) = {rt [σ0 (x) + µ0 ]}2 a0 (x)λ(dx) (4.51)
IR
1 y − µ0
Z
= {rt (y)}2 a0 λ(dy)
IR σ σ0
= krt ( · )k2L2 (P0 ) .
· − µ0
= clL2 (P0 ) ν : ν ∈ span{TN0 (0, 1, a0 )}
σ0
(from lemma 8 )
· − µ0
0
= ν : ν ∈ clL2 (a0 ) [span{TN (0, 1, a0 )}]
σ0
· − µ0
= ν : ν ∈ TN (0, 1, a0 )
σ0
u
t
We present now the two technical lemmas required in the proof given above.
Lemma 7 Given a class of functions A we have
· −µ · −µ
span ν : ν∈A = ν : ν ∈ span(A) .
σ σ
Proof:
’⊆’ n o
Take h ∈ span ν · −µ
σ0
0
: ν ∈ A . Then there exists n ∈ N , t1 , ..., tn ∈ IR and h1 , ..., hn ∈
n o
· −µ0
ν σ0
: ν ∈ A such that
n
X
h( · ) = ti hi ( · ) .
i=1
n o
· −µ0
Since for each i ∈ {1, ..., n}, hi ∈ ν σ0
: ν ∈ A , there exists νi ∈ A such that
· − µ0
νi = hi ( · ) .
σ0
Pn
Clearly i=1 ti νi ( · ) ∈ span(A) and
n
· − µ0
X
h( · ) = ti νi .
i=1 σ0
n o
· −µ0
Hence h ∈ ν σ0
: ν ∈ span(A) .
’⊇’ n o
Take z ∈ ν · −µ
σ0
0
: ν ∈ span(A) . Then there exists ν ∈ span(A) such that z( · ) =
ν · −µ
σ0
0
. Since ν ∈ span(A), there exists n ∈ N , t1 , ..., tn ∈ IR and ν1 , ..., νn ∈ A such
that ν( · ) = ni=1 ti νi ( · ). We have then
P
n
· − µ0
X
z( · ) = ti νi
i=1 σ0
4.6. APPENDICES 107
n o
· −µ
and hence z ∈ span ν σ
: ν∈A . u
t
L2 (a0 )
such that νn −→ ν. Note that {νn } is a Cauchy sequence
in L2 (a0 ). Define the
sequence {zn } in L2 (P0 ) by, for all n, zn ( · ) = νn · −µ σ0
0
. From (4.52) we see that
2
{zn } isna Cauchy
sequenceo
in L (P0 ) and then it is convergent with limit, say ξ ∈
· −µ0
clL2 (P0 ) ν σ0 : ν ∈ A . Using (4.52) and the L2 (P0 )-continuity of the norm k · kL2 (P0 )
we obtain
· − µ0
kξ( · ) − z( · )kL2 (P0 ) = lim
zn ( · ) − νn
= 0. (4.53)
n↑ σ0
L2 (P0 )
u
t
108 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
and
Z
2
ke1 k = x2 a(x)λ(dx) = 1 . (4.54)
IR
We define
1
e2 ( · ) = {( · )2 − m3 ( · ) − 1} .
∆2
4.6. APPENDICES 109
5.1 Introduction
This chapter treats a class of semiparametric models defined via restrictions imposed on
the moments of some square integrable functions. The class of models studied include
semiparametric extensions of many important models such as the multivariate location
and shape models, covariance selection models, growth curve models with modeled vari-
ance, generalized linear models, factor analysis, multivariate structural models, linear
structural relationships models, among others. In fact we present a tool to produce
semiparametric extensions of parametric models for which the moment structure plays a
structural rule.
The class of semiparametric models presented possesses a simple mathematical struc-
ture which makes it attractive to be used for testing general inference procedures. We
will be able to calculate explicitly the nuisance tangent spaces, which are the orthogonal
complements of the spaces spanned by the square integrable functions used to introduce
the restrictions defining the model. It is interesting to observe that the different notions
of nuisance tangent spaces defined in chapter 2 coincide for the models considered. More-
over, the orthogonal complement of the nuisance tangent spaces does not depend on the
nuisance parameter. This simplifies the treatment of some classic techniques for inference
in semiparametric models. For instance, any regular estimating function will be a lin-
ear combination of the restriction functions used to define the model corrected by their
means. In fact it will be shown that there is only one possible root for a regular estimating
function. That root is obtained in the form of a moment estimator. The efficient score
function will be a linear combination of the mean corrected restriction functions, however
with the coefficients in general depending on the nuisance parameter. The semiparametric
111
112 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
Cramèr-Rao bound for the asymptotic variance of regular asymptotic linear estimating
sequences will be obtained from the quadratic norm of the efficient score function. We
will give a necessary and sufficient condition for attaining this bound with estimating
sequences based on regular estimating functions. There will be presented examples where
the bound is attained and examples where it is not attained by roots of regular estimating
functions.
The chapter is organized in the following way. Section 5.2 introduces the main class
of semiparametric models we deal with and some examples are given. The estimation
via regular asymptotic linear estimating sequences and regular estimating functions is
considered in section 5.4. Most of the technical proofs are given in the section 5.5.
5.2. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS 113
p is of class C l , (5.2)
where for r ∈ N , C r is the class of functions r times continuously differentiable C 0 is the
class of continuous functions and C ∞ is the class of functions infinitely many differentiable;
Z
p(x)λ(dx) = 1; (5.3)
X
Z
fj2 (x)p(x)λ(dx) < ∞; (5.4)
X
Z
fj (x)p(x)λ(dx) = Mj (θ0 ); (5.5)
X
Z
for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (5.6)
X
5.3 Examples
5.3.1 Location-scale models
In this example we consider a collection of location-scale models defined on the real line,
for which we fix the first k (for k ∈ N greater or equal 2) standardized cumulants of the
distributions. The models we define are larger than the semiparametric extensions of the
location-scale models considered in chapter 4. In particular, we avoid to use strong condi-
tions on the tails and on the Laplace transform of the distributions, but those conditions
can be incorporated in the model through condition (5.6), without essentially modifying
the results obtained. These models will serve as prototypes of L2 - restricted models.
The sample space here X is the real line, B is the Borel σ-algebra in IR and λ is
the Lebesgue measure defined on (IR, B). Consider the following family of probability
measures dominated by λ
( )
dPµσa 1 · −µ
P = Pµσa : (·) = a , µ ∈ IR, σ ∈ IR+ , a ∈ A , (5.7)
dλ σ σ
where A is the class of probability densities a : IR −→ IR+ such that the conditions
(5.8)-(5.16) are satisfied. The conditions referred are:
a ∈ C 1; (5.9)
Z
a(x)λ(dx) = 1; (5.10)
IR
Z
xa(x)λ(dx) = 0; (5.11)
IR
Z
x2 a(x)λ(dx) = 1; (5.12)
IR
Z
x2k a(x)λ(dx) < ∞; (5.13)
IR
Z
for j = 3, . . . , k, xj a(x)λ(dx) = mj . (5.14)
IR
5.3. EXAMPLES 115
)2
xa0 (x)
Z (
a(x)λ(dx) ∈ (0, ∞) . (5.16)
IR a(x)
The parameter of interest is θ := (µ, σ) ∈ Θ := IR × IR+ . The nuisance parameter is
a ∈ A.
The characterization given above is natural for the location and scale model, however it
is not in the form of a L2 - restricted semiparametric model. In fact, we define the submodel
∗
P(0,1) (= A) and then use the affine transformation (·) 7→ σ(·) + µ to characterize the
whole model P. On the other hand, in the definition of the L2 - restricted model we give
conditions satisfied by each submodel Pθ (for θ ∈ Θ).
We give next an alternative characterization of the location and scale model given by
(5.7), which is in the form required for the L2 - restricted models. For each (µ, σ) ∈ IR×IR+
consider the following class of probability densities on the real line,
∗
Pµσ = {pµ,σ : IR −→ IR+ : (5.18)-(5.24) hold } . (5.17)
The conditions referred in the definition above are, for each (µ, σ) ∈ IR × IR+ ,
∀x ∈ IR, pµ,σ (x) > 0; (5.18)
pµ,σ ∈ C 1 ; (5.19)
Z
pµ,σ (x)λ(dx) = 1; (5.20)
IR
j !
j
Z
j
σ j−i µi mj−i ,
X
for j = 1, . . . , k, x pµ,σ (x)λ(dx) = (5.21)
IR i=0
i
)2
xp0µ,σ (x)
Z (
pµ,σ (x)λ(dx) ∈ (0, ∞) . (5.24)
IR pµ,σ (x)
It is easy to see that the model given by (5.17) coincides with the model given by (5.7).
Moreover, the model given by (5.17) is clearly a L2 - restricted model. Here the condition
(5.21) corresponds to the condition (5.12) in the definition of L2 - restricted model with
fj ( · ) = ( · )j and Mj (µ, σ) given by the right side of (5.21).
5.3. EXAMPLES 117
p ∈ C 1; (5.27)
Z
p(x)λ(dx) = 1; (5.28)
IRq
Z
xp(x)λ(dx) = µ; (5.29)
IRq
Z
x · xT p(x)λ(dx) = Σ + µµT ; (5.30)
IRq
Z
(xT · x)2 p(x)λ(dx) < ∞. (5.31)
IRq
Assume moreover that the components of the partial score function are in L2 , i.e.
Z n o2
for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.32)
IRq
Z n o2
for i, j = 1, . . . , q, i ≤ j, l/σij (x) p(x)λ(dx) ∈ (0, ∞) , (5.33)
IRq
where l/µi and l/σij are the components corresponding to the partial derivative with respect
to the ith component of µ and the (i, j)th entry of Σ, respectively, in the score function
for the location and shape generated by the distribution associated with the density p.
Note that by (5.26) and (5.28) p is a probability density on IRq with support equal to the
118 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
whole IRq . Moreover, (5.27) implies that the partial score function referred in (5.32) and
(5.33) are well defined.
The multivariate location and shape model is defined by
where f ij is a function from IRq+(q−1)+...+1 into IR. Note that in the definition of f ij we
used only the triangle below the diagonal of the matrix Σ, in order to avoid ambiguity due
to the symmetry of Σ. In fact one should adopt a similar convention in the parametrization
of the location and shape model.
σ ij = 0 . (5.34)
σ kl 6= 0 . (5.35)
Specifying conditions like (5.34) and (5.35) we can define whether each pair of variables
(entries of the random vector of observations) is conditionally correlated or not given the
other variables. This corresponds to specify a covariance selection model.
120 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
The referred conditions are, for some given pairs (i, j) and (k, l), with i, j, k, l ∈ {1, . . . , q},
i > j and k > l,
p ∈ C 1; (5.37)
Z
p(x)λ(dx) = 1; (5.38)
IRq
Z
xp(x)λ(dx) = µ; (5.39)
IRq
Z
|x|2 p(x)λ(dx) < ∞; (5.40)
IRq
Z n o2
for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.41)
IRq
Z Z
ij
f x1 x1 p(x)λ(dx), . . . , xq xq p(x)λ(dx) = 0 ; (5.42)
IRq IRq
5.3. EXAMPLES 121
Z Z
f kl q
x1 x1 p(x)λ(dx), . . . , q
xq xq p(x)λ(dx) 6= 0 , (5.43)
IR IR
where for x ∈ IRq , |x|2 = x21 + . . . + x2q and x1 , . . . , xq are the components of the vector
x. The model in study is given by P ∗ ∪µ∈IRq Pµ∗ .
Note that the conditions (5.42) and (5.43) cannot be expressed in terms of the con-
ditions (5.1)-(5.6) and hence the model presented is not a L2 - restricted model. We will
study latter this kind of models.
We close this example by considering the case where the dimension is q = 3 in details,
which is relatively simple. Let us consider the case where the pair formed by the first and
the second variables is conditionally uncorrelated and the other pairs are conditionally
correlated given the other variables. The conditions in terms of the unknown entries of
the inverse of the covariance matrix are
σ 12 = 0 , σ 13 6= 0 and σ 23 6= 0 .
or equivalently,
σ13 σ23 − σ12 σ33 = 0, σ12 σ23 − σ13 σ22 6= 0 and σ13 σ12 − σ11 σ23 6= 0 .
X = Λx ξ + δ .
Here Λy and Λx are p × m and q × n matrices of parameters respectively; and δ are
p × 1 and q × 1 random vectors respectively with
cov() = Θ ; cov(δ) = Θδ ; cov(ξ) = Φ ;
E(ξ) = 0 ; E(η) = 0 .
We assume that Θ , Θδ and Φ are positive definite matrices given. Here E( · ) and cov( · )
are the expectation and covariance operators respectively. The latent variables are related
through the following equation
η = Γξ + ζ ,
where Γ is a m × n matrix of parameters and ζ is a m × 1 random vector with
E(ζ) = 0; cov(ζ) = Ψ .
Here Ψ is a given matrix positive definite matrix. We assume additionally that ζ, and
δ are mutually uncorrelated; and η are uncorrelated and δ is uncorrelated with ξ. It
is assumed that m, n, p and q, are such that the number of parameters is less or equal
than the number of entries of the inferior triangle of the joint covariance matrix of Y and
X, i.e. (p + q)(p + q + 1)/2. Some of the entries of Θ , Θδ , Ψ or Φ can be sometimes
considered as parameters.
The class of possible joint distributions of Y and X with the properties mentioned
above is said to be a Linear structural relationships model (LISREL) (see Johnson and
Wichern, 1992). Sometimes in the literature maximum likelihood estimation is performed
based on the assumption of multivariate normal distributions for the random quantities.
However, in general, it is not specified any particular distribution for the random quanti-
ties referred and moment estimation via the covariance matrix is performed. When using
5.3. EXAMPLES 123
LISREL models one is interested in modeling the covariance structure of a group of vari-
ables (for example in factor analysis) and an alternative definition of a LISREL model
is given by taking the class of distributions with a special covariance structure. The
following covariance structure can be easily obtained from the model description given
above:
cov(Y ) = Λy ΓΦΓT + Ψ + Θ , (5.44)
and
We show that the LISREL models are examples of L2 - restricted model. The conditions
(5.44)-(5.46) can be expressed in terms of conditions of the type of (5.5) with integrand
given by, for all x = (x1 , . . . , xp+q ) ∈ IRp+q and for i, j = 1, . . . , p + q, i > j,
fij (x) = xi xj ,
(here we index the functions fi ’s given in (5.5) by a double index) and the left side of the
equation (i.e. Mij (θ)) determined by the equations (5.44)-(5.46).
If the matrices Θ , Θδ , Ψ or Φ are viewed as parameters (or some entries as parameters
and some entries known) the model is still a L2 - restricted model. In this case the right
side of the equations of type (5.5) becomes more complicated. However, if some entries
of Θ , Θδ , Ψ or Φ are nuisance parameters (i.e. are not interest parameters and are
unknown), then the model in general is no longer a L2 - restricted model. An extension of
the L2 restricted models given in the next chapter will cover these cases.
Example 8 This example is extracted from Johnson and Wichern (1992, page 445). Sup-
pose that we want to model firm performance and managerial talent in a certain economic
system. Since these two quantities cannot be measured directly, one might represent them,
by two latent variables, say η and ξ and use a set of related observable variables in a
LISREL model. In this example the firm performance is characterized by the profit, Y1 ,
and the common stock price, Y2 . The managerial talent is represented by the time of
chief executive experience, X1 and memberships of boards of directors, X2 . A LISREL
model with the specifications given below is pointed by Johnson and Wichern (1992) as an
alternative for modeling this situation. Take m = n = 1, p = q = 2,
" # " # " # " #
1 λ2 θ1 0 θ3 0
Λy = , Λx = , Θ = , Θδ = ,
λ1 1 0 θ2 0 θ4
124 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
var(ξ) = Φ and var(ζ) = Ψ. Denoting the covariance of the ith and the jth entries of
the joint vector (Y T |X T )T by κij we obtain from the covariance structure of a LISREL
model the following equations:
The parameters were estimated by replacing the second order moments in the equations
above by the sample moments and solving for the parameters. We will show that this
intuitively appealing procedure can be justified by the theory of semiparametric models.
5.3. EXAMPLES 125
and for i = 1, . . . , n,
where ∆ti ≡ ti − ti−1 , V1 is a suitable function fixed and α, β, η and λ are parameters.
Moreover, suppose that X0 , ∆X1 , . . . , ∆Xn are uncorrelated. It is easy to see that we
have the following moment structure,
and
" #
ηV1 (α/η) ηV1 (α/η)eT
Cov(X) = (5.50)
ηV1 (α/η)e ηV1 (α/η)E + λV1 (β/λ)T
where e = (1, . . . , 1)T , E = [Eij ] and T = [Tij ] are n × n matrix with Eij = 1 for all
(i, j) and Tij = Tji = ti for j ≥ i. Hence the model is adequate to study linear growth
with data with non constant variance over time. When the family F is an exponential
dispersion model with variance function V1 and parametrized in a suitable way (such that
X satisfies (5.49) and (5.50)) the model described above coincides with the latent growth
process studied in Jørgensen, Labouriau and Lundbye-Christensen (1996). Consider now
the class F of all families of distributions in IR parametrized by the mean and the variance
in an identifiable form, for which all the members of the family posses finite moments of
fourth order. For each F ∈ F we can generate a linear growth curve model by using
(5.47) and (5.48). The union of all these models, obtained by taking each element of F
and applying the construction described above, provides a semiparametric extension of the
original parametric model described above. Adding suitable differentiability conditions
(satisfied by the exponential dispersion models, for example) one obtain a L2 - restricted
model.
126 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
Let us consider now a more complicated structure. Suppose that X is not directly
observable. We can observe only a vector Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,
and
Hence we have again a model suitable for linear growth data with variance varying with
time. If the families F and G are exponential dispersion models with variance functions
V1 and V2 respectively suitably parametrized, then the model defined by (5.47), (5.48)
and (5.51) coincides with a linear growth curve considered in Jørgensen, Labouriau and
Lundbye-Christensen (1996). The covariance matrix of Y is
Here D is a diagonal matrix with ith diagonal entry α+βti , E and T are defined as before.
The component of the covariance matrix proportional to E reflects the between subjects
variation at time t0 , the component proportional to T arises because the X- process has
stationary and independent increments and the third component proportional to D can
be interpreted as a measure of the noisy. If the family G is allowed to vary freely in a
class of distribution families, as we did before with the semiparametric model for X, one
obtains a L2 - restricted model. The restrictions corresponding to (5.5) are obtained from
equations (5.52) and (5.53).
If the matrix D is the identity and the functions V1 and V2 are constant functions, then
the model described above becomes a generalization of the univariate Brownian growth
model of Lundbye-Christensen (1991).
Linear growth in biology occurs only in a few cases. Let us generalize the model
described above to a situation where the growth does not follow a linear pattern on time.
Suppose that there is a unobserved linear latent process X defined as in (5.47) and (5.48).
We observe a growth process Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,
where b is a sufficiently smooth one-to-one increasing real function given (the growth
curve). We have, for i = 0, 1, . . . , n,
and Cov(Y ) is given by (5.53) with D diagonal with ith diagonal given by b(α + βti ).
Applying the same construction used before, we obtain a L2 - restricted model.
128 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z). Note that Hk
depends only on θ, but Hk⊥ depends on both parameters, because we took the orthogonal
complement in L20 (Pθz ).
We denote the Lp nuisance tangent space (for 1 ≤ p ≤ ∞) and the weak (or Hellinger)
p w
nuisance tangent space at (θ, z) ∈ Θ × Z by T N (θ, z) and T N (θ, z) respectively (see
Labouriau, 1996a for details).
Theorem 13 Under the L2 - restricted semiparametric model the nuisance tangent spaces
are given, for θ ∈ Θ and z ∈ Z, by
1 w 2
⊥
T N (θ, z) =T N (θ, z) =T N (θ, z) = Hk (θ, z) .
The proof of theorem 13 is given in two lemmas in appendix 5.5.1. First it is proved
1
that T N ⊆ Hk⊥ (lemma 9) by using an argument based on the existence of almost surely
convergent subsequences of each L1 convergent sequence. The second step is to prove that
2
Hk⊥ ⊆T N (lemma 10), which is done by directly verifying that the continuous compact
supported elements of Hk⊥ are in the L2 nuisance tangent set. The inclusion follows from
the fact that the class of continuous compact supported functions is dense in L2 , provided
the sample space is locally compact Hausdorff.
5.4. EFFICIENT ESTIMATION 129
A careful analysis of the argument presented shows that in fact the theorem 13 holds
for any Lp nuisance tangent space (for 1 ≤ p ≤ ∞) or even for weaker notions of nuisance
tangent space.
Note that the nuisance tangent space is determined only by the functions involved
in the condition (5.5). The restrictions associated with (5.6) produce no effects on the
nuisance tangent space. Note that condition (5.5) can be improved by replacing the
property of the integral being equal to a constant by the property of belonging to a real
set that pocesses only isolated points.
The condition (5.4) associated with (5.5) is indeed necessary, otherwise the space Hk
would not be in L2 , which would produce irregularities in the nuisance tangent spaces.
That is we should use square integrable functions to restrict the model indeed.
∂ Pθo z0
l/θi ( · , θ0 , z0 ) = i log ( · ) .
∂θ dλ θ=θ0
It is assumed that for each (θ0 , z0 ) ∈ Θ × Z and each i ∈ {1, . . . , q}, l/θi ( · , θ0 , z0 ) is well
defined and is in L2 (Pθ0 z0 ).
The efficient score function is obtained by orthogonal projecting the parametric partial
score function on the orthogonal complement of the nuisance tangent space. Note that
different versions of the efficient score function are obtained by using alternative definitions
of the nuisance tangent space. However, from theorem 13 the different notions of tangent
space coincide for the models we work with. Hence we can just reffer generically to the
efficient score function.
Consider the function l/θ E
: X × Θ × Z −→ IRq with components l/θ E E
1 , . . . , l/θ q : X ×
where ( · |A) is the orthogonal projection operator onto A ⊆ L20 (Pθo z0 ) and TN⊥ (θ0 , z0 )
Q
is the orthogonal complement of the nuisance tangent space at (θ0 , z0 ) in L20 (Pθo z0 ). The
E
function l/θ is the efficient score function.
130 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
We calculate now the efficient score function at (θ, z) ∈ Θ × Z. Denote the result of a
Gram-Schmidt orthonormalization process, in the space L2 (Pθz ), applied to the functions
1, f1 , . . . , fk by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form an orthonormal basis on Hk =
TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for i = 1, . . . , q,
k
E
hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,
X
l/θ i ( · , θ, z) = (5.56)
j=1
where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be
written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,
The inverse of the covariance matrix above provides a lower bound for the asymptotic
variance of regular asymptotic linear estimating sequences. Here we use the partial order
of matrices (i.e. , A ≥ B iff A − B is positive semidefinite).
(iii) For j = 1, . . . , q,
( )
∂ Z dPθz Z
∂ dPθz
ψi (x; θ) (x)λ(dx) = ψi (x; θ) (x) λ(dx) ;
∂θj X dλ ∂θj dλ
Here ψ1 , . . . , ψq are the components of the function Ψ and Eθz (X) is the expectation of a
random vector X under Pθz .
Theorem 14 Under the semiparametric L2 - restricted model we have that any regular
estimating function Ψ : X × Θ −→ IRq can be expressed in the following form, for all
θ ∈ Θ and for i = 1, . . . , q,
k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} . (5.60)
j=1
The representation (5.60) can be written in a matricial form in the following way,
where
α11 (θ) . . . α1k (θ) f1 ( · ) M1 (θ)
. . . .
α(θ) = . . , f( · ) = . and M (θ) = . .
. .
.
.
αq1 (θ) . . . αqk (θ) fk ( · ) Mk (θ)
This can be easily seen by differentiating Ψ and observing that Eθz {f ( · ) − M (θ)} = 0.
The variability matrix of Ψ is given by, for each θ ∈ Θ and z ∈ Z,
JΨ (θ, z) := SΨT (θ, z)VΨ−1 (θ, z)SΨ (θ, z) = {∇M (θ)}T [Covθz {f ( · )}]−1 {∇M (θ)} .
The condition (iii) in the definition of regular estimating function says that, for all
(θ, z) ∈ Θ × Z, SΨ (θ, z) is non singular. We have then rank(SΨ (θ , z)) = q. Since (see
Mardia et al., 1979 page 464 formula A.4.2.d)
α(θ) and ∇M (θ) must be of full rank (i.e. have rank equal to q).
We study now the roots of a regular estimating function Ψ based on a repeated inde-
pendent sample x = (x1 , . . . , xn )T . The estimator of θ based on Ψ and x is given by θ̂
such that
n
X
Ψ(xl ; θ̂) = 0 , (5.65)
l=1
which is equivalent to
n
X n
X
0 = α(θ̂){f (xl ) − M (θ̂)} = α(θ̂) {f (xl ) − M (θ̂)} =
l=1 l=1
n
( )
1X
≡ α(θ̂) f (xl ) − M (θ̂) = α(θ̂){IP n f (x) − M (θ̂)} .
n l=1
5.4. EFFICIENT ESTIMATION 133
Here IP n is the empirical sample operator given by IP n g(x) = 1/n nl=1 g(xl ), for any
P
g : X −→ IRq . Since the rank of α(θ̂) is q, we can construct a q × q matrix γ(θ) which
has the columns equal to q linearly independent columns of α(θ̂). The system (5.65) is
equivalent to
0 = γ(θ̂){IP n f (x) − M (θ̂)} .
Since the rank of γ(θ) is q, the matrix γ(θ) is non singular and the system above has the
unique solution M (θ̂) = IP n f (x). Hence we proved the following.
Theorem 15 Under a L2 - restricted model, the only possible estimator for the interest
parameter θ based on a regular estimating function and a sample x = (x1 , . . . , xn )T is
θ̂n = θ̂n (x) such that
n
1X
M (θ̂n ) = IP n f (x) := f (xl ) . (5.66)
n l=1
In other words, under a L2 - restricted model, the moment estimator (5.66) is the unique
possible estimating sequence that can arise from the method of (regular) estimating se-
quences. This shows how poor is this class of estimators for the models in discussion.
In particular, any optimality theory for estimating sequences (such as maximizing the
Godambe information) is absolutely meaningless for L2 - restricted models.
We give next conditions for having consistency and asymptotic normality of the mo-
ment estimators arising from regular estimating functions.
Theorem 16 Consider a L2 - restricted model, a regular estimating function Ψ and the
estimator θ̂n (based on a sample of size n) given by the solution of (5.66) in theorem 15.
i) If the application θ 7→ M (θ) from Θ ⊆ IRq to IRq is invertible with continuous inverse,
then {θ̂n } is consistent;
ii) If in addition to the assumptions of the previous item, each Mj ( · ) and αij ( · ) are
twice differentiable, then
√ D
n(θ̂n − θ) −→ N 0, JΨ−1 (θ, z) .
Proof:
i) Let M −1 be the inverse of M , i.e. for all θ ∈ Θ, M −1 {M (θ)} = θ. The law of large
numbers gives
P
IP n f −→ Eθz (f ) = M (θ) .
134 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
Since M −1 is continuous,
P
θ̂n = M −1 {IP n f } −→ M −1 {M (θ)} = θ .
ii) Take (θ, z) ∈ Θ×Z fixed. Note that by the assumptions, the entries of Eθz {∇∇Ψ( · ; θ)}
P
are in IR. Moreover, from the previous item, θ̂n −→ θ. Using the results 4.9 and 4.10 in
Jørgensen and Labouriau (1996, pages 10 and 171) we conclude the proof. u
t
Assuming M inversible and differentiable and applying the chain rule for the Hampel
influence function (see Labouriau, 1989) we obtain that the Hampel influence function for
the estimator given by (5.66) is
Hence the Hampel influence function of an estimator derived from a regular estimating
function is bounded if and only if the function f is bounded. That is θ̂n is resistent to
gross errors,i.e. B- robust if and only if the function f is bounded. On the other hand
the Hampel influence function is continuous (on x) if and only if f is continuous, and
that is the condition for having bounded sensibility to lateral shifts, i.e. V - robustness.
We show that the efficient score function is a generalized estimating function. Recall
that the efficient score function has representation
lE ( · ; θ, z) = A(θ, z)ξ θz ( · ) ,
where A(θ, z) is given by (5.58). On the other hand, f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)
generate the same space that the orthonormalized basis {ξ1θz ( · ), . . . , ξkθz ( · )}. Then there
exists a nonsingular matrix, say B(θ, z) such that
ξ θz ( · ) = B(θ, z){f ( · ) − M (θ)} .
Hence,
lE ( · ; θ, z) = A(θ, z)B(θ, z){f ( · ) − M (θ)} .
We conclude that lE ( · ; θ, z) is equivalent to {f ( · ) − M (θ)} provided A(θ, z) is of full
rank. An application of proposition 17 in chapter 3 shows that in fact the semiparametric
Cramèr-Rao bound is attained by the estimating function given in theorem 15 (i.e. given
by (5.66)) ( provided A(θ, z) is of full rank). We have proved then the following result.
Proposition 22 Suppose that for each (θ, z) ∈ Θ × Z the matrix A(θ, z) given by (5.58)
is of full rank. Then the semiparametric Cramèr-Rao bound is attained by the moment
estimator given by (5.62).
The following construction gives a more precise result about the attainability of the
semiparametric Cramèr-Rao bound for L2 - restricted models. The first step will be to
obtain an alternative representation of regular estimating functions that will permit us
to compare the Godambe information with the covariance of the efficient score function.
T
Take (θ, z) ∈ Θ × Z fixed. Define ξ θz ( · ) = ξ1θz ( · ), . . . , ξkθz ( · ) , where 1, ξ1θz , . . . , ξkθz is
the result of an orthonormalization process (with respect to the inner product of L2 (Pθz ))
applied to 1, f1 , . . . , fk . That is, the equations of type (5.5) became, for each (θ0 , z0 ) ∈
Θ × Z and for i = 1, . . . , k,
Z
dPθ0 z0
ξiθz (x) (x)λ(dx) = Miθz (θ0 ) .
X dλ
Note that by construction Miθz (θ) = 0. From theorem 14, any regular estimating function
Ψ evaluated at θ can be represented as,
n o
Ψ( · ; θ) = γ θz (θ) ξ θz ( · ) − M θz (θ) = γ θz (θ)ξ θz ( · ) ,
T
where M θz (θ0 ) = M1θz (θ0 ), . . . , Mkθz (θ0 ) . The sensibility at (θ, z) is given by
SΨ (θ, z) = −γ θz ∇M θz (θ) ,
136 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
the variability is
VΨ (θ, z) = γ θz (θ)Covθz ξ θz γ θz (θ)T
Comparing the Godambe information given above with the variance of the efficient score
function given by (5.59) and using theorem 16 yields the following result.
Theorem 17 Consider a L2 - restricted model for which the assumptions of theorem 16
(item ii)) hold. Then the bound for the asymptotic variance of regular asymptotic linear
estimating sequences is attained by a regular estimating function at (θ, z) ∈ Θ × Z if and
only if
∇M θz (θ)T = A(θ, z) , (5.67)
where A(θ, z) is the matrix defined in (5.58).
We will present one example where the semiparametric bound is not attained by estimators
derived from regular estimating functions. In fact, in the example the bound is not
attained even by regular asymptotic linear estimating sequences.
5.4.6 Examples
In this section the examples considered before are revisited.
(5.69)
Xk k
X
− jα2j (µ, σ) mj−1 jα1j (µ, σ) mj 6= 0,
j=1 j=1
Note that m3 (a) and ∆2 (a) depend on the fixed density a. For each (µ0 , σ0 ) ∈ IR × IR+ ,
µ0 − µ
M1µσa (µ0 , σ0 ) := Eµ0 σ0 a {ξ1µσa ( · )} =
σ
138 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
and
σ0 − 2µµ0 − µ2
( )
1
M2µσa (µ0 , σ0 ) := µσa
Eµ0 σ0 a {ξ2 ( · )} = − m3 (a) − 1 .
∆2 (a) σ2
Hence
" 1 #
µσa 0
∇M (µ0 , σ0 )|(µ0 ,σ0 )=(µ,σ) = σ
−2µ 2 = AT (µ, σ, a) .
∆2 (a)σ 2 ∆2 (a)σ
We conclude from theorem 17 that the bound for regular asymptotically linear estimating
sequence is attained by estimators based on regular estimating functions, for the semi-
parametric location and scale model with two standardized cumulants fixed. It is easy
to see that the moment estimator associated with any regular estimating function in this
example is the sample mean and the sample variance.
An alternative argument is that the global tangent space is the whole L20 , which implies
that the efficiency of any regular asymptotic linear estimating sequence, in particular the
roots of regular estimating functions.
We study now the case where k = 3. The efficient score function has representation
(5.70) with ξ µσa ( · ) = (ξ1µσa ( · ), ξ2µσa ( · ), ξ3µσa ( · ))T , where ξ1µσa and ξ2µσa are as in the case
where k = 2 and
" 3 2
1 · −µ −m3 m4 − m3 · −µ
ξ3µσa ( · ) = −
∆3 σ ∆22 σ
( )
m3 (m5 − m3 m4 − m3 ) · −µ
− m4 −
∆22 σ
( )#
m5 − m3 m4 − m3
− m3 −
∆22
( 3 2 )
1 · −µ · −µ · −µ
=: + k2 + k1 + k0 .
∆3 σ σ σ
(see chapter 4 for a detailed calculation). Here ∆3 is a constant depending on the standard-
ized cumulants up to order 6 of the density a. Let us study the case where m3 = m5 = 0.
We have then
" 3 #
1 · −µ · −µ
ξ3µσa ( · ) = + m4 .
∆3 σ σ
Moreover,
3σ 2 + 1
hξ3µσa ( · ), l/µσa iµσa = .
∆3
5.4. EFFICIENT ESTIMATION 139
and
∂ 1 m4
M3µσa (µ0 , σ) = 3− ,
∂µ0
µ0 =µ
∆3 σ
which is not equal to hξ3µσa ( · ), l/µσa iµσa . We conclude, from theorem 17 that the bound
for regular asymptotically linear estimating sequence is not attained by estimators based
on regular estimating function. That is, the moment estimator associated with regular
estimating functions does not generate an optimal regular asymptotic linear estimating
sequence. Note that this conclusion easily extends to the case where k > 3.
f1 (x) = x1 , . . . , fq (x) = xq ,
and the following functions of the interest parameter on the right hand of (5.5)
M1 (µ) = µ1 , . . . , Mq (µ) = µq .
f 1 ( · ) − µ1 f q ( · ) − µ1
ξ1 ( · ) = , . . . , ξq ( · ) = .
kf1 kL2 (p) kfq kL2 (p)
The Fourier coefficient for the expansion of the efficient score function with respect to
this basis are
and
− xi p(x)λ(dx)
R
for i, j = 1, . . . , qhl/µi , ξi iL2 (p) = := ki 6= 0 .
kfq kL2 (p)
140 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
(k1 ( · − µ1 ), . . . , kq ( · − µq ))T .
where A(µ, p) is given by (5.58) and f and M are defined as before. Note that the (i, i)th
(for i = 1, . . . , q) entry of the matrix A(µ, p) is given by
µp 1
aii (µ, p) := hξi , l/µ1 iµp = .
kfi ( · ) − µi kL2 (p)
µp µ∗1 − µ1
M1 (µ∗ ) =
kfi ( · ) − µi kL2 (p)
and
∂ µp 1
M1 (µ∗ ) = .
∂µi µ∗ =µ kfi ( · ) − µi kL2 (p)
Hence the diagonal of the matrices A and ∇M are equal. It is easy to see that both A
and ∇M are diagonal matrices and hence they are equal. We conclude from theorem
17 that the bound for regular asymptotically linear estimating sequence is attained by
estimators based on regular estimating functions.
It is easy to see that for the multivariate location and shape model the ortonormal
basis of Hk is composed by the functions
f i ( · ) − µi fij ( · ) − σij
ξi ( · ) = , ξij ( · ) = ,
kfi − µi kL2 (p) kfij − σij kL2 (p)
for i, j = 1, . . . , q, i ≥ j. Here fij (x) = xi xj and σij is the (i, j)th entry of the covariance
matrix. Reasoning like in the case of the multivariate location model given above we
conclude that the semiparametric Cramèr-Rao is attained by estimating sequences derived
from regular estimating functions.
5.4. EFFICIENT ESTIMATION 141
fij (z) = zi zj ,
5.5 Appendices
5.5.1 Technical proofs related with L2 - restricted models
Lemma 9 Under the L2 restricted semiparametric model, for each (θ, z) ∈ Θ × Z,
1
⊥
T N (θ, z) ⊆ Hk (θ, z) .
Proof: Take (θ, z) ∈ Θ × Z fixed and denote by p the density of Pθz . For simplicity
1
we drop p, θ and z from the notation along this proof. We prove that TN0 ⊆ Hk⊥ , which
implies the lemma.
10
Suppose by hypothesis of absurdum that there is a ν ∈T N (ν 6= 0) such that ν ∈ / Hk⊥ .
Since ν ∈ L20 , ν ∈ Hk . That is, ν is of the form ν( · ) = αT f ( · ), for some α ∈ IRk and
10
f ( · ) := ( f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ) )T . Since ν ∈T N , there exists a differentiable
path {pt } ⊂ Pθ∗ and {rt } such that
and
From (5.71),
pt ( · ) − p( · )
rt ( · ) = − αT f ( · ) .
tp( · )
Hence
p ( · ) − p( · )
pt ( · ) − p( · )
t T
T
krt kL1 = −α =
α f ( · ) −
f ( · )
(5.73)
tp( · ) tp( · )
1
L1 L
p ( · ) − p( · )
t
≥ kαT f ( · )kL1 −
−
= kαT f ( · )kL1 > 0 .
tp( · )
1
L
1
R R
For, kf − gkL1 = |f − g|pdλ ≥ |f | − |g|pdλ = kf kL1 − kgkL1 .
5.5. APPENDICES 143
Lemma 10 Under the semiparametric model with L2 restrictions, for each (θ, z) ∈ Θ×Z,
2
Hk⊥ (θ, z) ⊆T N (θ, z) . (5.74)
Proof: Take (θ, z) ∈ Θ × Z and ν ∈ Hk⊥ (θ, z) ∩ Cc fixed and denote by p the density
of Pθz . Here Cc is the class of continuous infinitely many differentiable functions from X
into IR with compact support. For simplicity we drop p, θ and z from the notation along
this proof. Define for each t ∈ IR+ the generalized sequence {pt }t∈IR+
pt ( · ) = p( · ) + tp( · )ν( · ) .
We show that for t small enough pt ∈ Pθ∗ , i.e. ν is in the L2 tangent set at (θ, z) and we
conclude that
n o 2
Hk⊥ (θ, z) ∩ Cc ⊆TN0 (θ, z) .
Since Cc is dense in L2 (Pθz ) (because X is a locally compact Hausdorff space, see Rudin,
1987), taking the closure in the expression above yields (5.74) and proves the lemma.
We verify next that for t small enough pt satisfies the conditions (5.1)-(5.6). Since
ν is continuous and compact supported, it is bounded and so is the restriction of the
continuous function p to the support of ν. Hence ν( · )p( · ) is bounded. Moreover, the
restriction of the strictly positive and continuous function p( · ) to the compact support
of ν is bounded away form zero. Hence for t small enough pt is strictly positive, i.e. (5.1)
holds.
Condition (5.2) holds for all t ∈ IR+ , because ν is continuous and infinitely many differ-
entiable.
The function pt (for each t ∈ IR+ ) integrates 1 because ν integrates 0 with respect to
p( · )λ(d · ).
The functions pt satisfy condition (5.4) because for each fj
Z Z Z
fj2 (x)pt (x)λ(dx) = fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)
X X X
Z Z
= fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)
X supp(ν)
Z Z
≤ fj2 (x)p(x)λ(dx) + tξ fj2 (x)p(x)λ(dx) ,
X X
where supp(ν) is the support of ν and ξ is a superior bounded for the bounded function
ν.
The condition (5.5) holds for pt because ν is orthogonal to each fj (j = 1, . . . , k).
144 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
The following argument shows that for t small enough pt satisfies condition (5.6). We
have, for i = 1, . . . , k,
Z Z Z
gi (x)pt (x)λ(dx) − gi (x)p(x)λ(dx) = t gi (x)ν(x)p(x)λ(dx)
X X X
Z
≤ tξ gi (x)p(x)λ(dx) ,
X
small enough. u
t
5.5. APPENDICES 145
(5.75)
Xk k
X
− jα2j (µ, σ) mj−1 jα1j (µ, σ) mj 6= 0,
j=1 j=1
Proof:
Let Ψ : IR×IR×IR+ −→ IR2 be a function in the form (5.68), with the functions αij (for
i = 1, 2 and j = 1, . . . , k) being differentiable and satisfying (5.75) for all (µ, σ) ∈ IR×IR+ .
We show that in this case Ψ is a regular estimating function.
· −µ j
Condition (i) holds because, for j = 1, . . . , k, the functions σ
− mj are in
2
L0 (Pµσa ).
j
The differentiability of the functions αij and of · −µ
σ
− mj implies condition (ii).
We check the first part of condition (iii) by direct calculation (i.e. the part related
with differentiability with respect to µ). Take i = 1, 2 and (µ, σ) ∈ IR × IR+ fixed. Since
Ψ is unbiased,
∂ 1 x−µ
Z
ψi (x; µ, σ) a dλ(x) = 0 .
∂µ σ σ
∂ 1 x−µ
Z
ψi (x; µ, σ) a dλ(x) = 0 . (5.76)
∂µ σ σ
Note that
Z ( )
∂ 1 x−µ
ψi (x; µ, σ) a dλ(x)
∂µ σ σ
146 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
)
k
( j
∂ X x−µ 1 x−µ
Z
= αij (µ, σ) − mj a dλ(x)
∂µ j=1 σ σ σ
k
"Z ( ) ( j ) #
∂ x−µ 1 x−µ
X
= αij (µ, σ) − mj a dλ(x)
j=1 ∂µ σ σ σ
k
(Z j−1 )
−j x−µ 1 x−µ
X
+ αij (µ, σ) a dλ(x)
j=1 σ σ σ σ
k
X −j
= αij (µ, σ)mj−1 . (5.77)
j=1 σ
Inserting (5.77) and (5.78) into the left hand of (5.76) leads to the conclusion that the
equality (5.76) holds and the proof of the first part of (iii) is concluded.
We verify the second part of condition (iii). Since Ψ is unbiased,
∂ 1 x−µ
Z
ψi (x; µ, σ) a dλ(x) = 0 .
∂σ σ σ
Hence the proof of condition the second part of (iii) reduces to the verification of
∂ 1 x−µ
Z
ψi (x; µ, σ) a dλ(x) = 0 . (5.79)
∂σ σ σ
Proceeding in a similar way as in the proof of the first part of the property (iii), after
some routinary calculations, one obtains
Z ( k
)
∂ 1 x−µ −j
X
ψi (x; µ, σ) a dλ(x) = αij (µ, σ)mj . (5.80)
∂σ σ σ j=1 σ
Using (5.80) and (5.81) in the left hand side of (5.79) we conclude that (5.79) holds.
We verify property (iv). From the previous calculations (see (5.77) and (5.80) ) we
have
" P #
k Pk
−1 jα1j (µ, σ)mj−1 jα1j (µ, σ)mj
Eµσa {∇Ψ( · ; µ, σ)} = Pj=1
k
j=1
Pk .
σ j=1 jα2j (µ, σ)mj−1 j=1 jα2j (µ, σ)mj
are affinely independent in the sense of the referred lemma, i.e. there do not exist con-
stants a1 , . . . , ak and b such that
k
( j )
X x−µ
aj − mj =b
j=1 σ
do not vanish, otherwise the determinant in (5.75 vanishes. Moreover, the vectors
and
(α21 (µ, σ), . . . , α2k (µ, σ))
are not equal, otherwise the determinant given in (5.75) vanishes. This, together with
the affinely independency of the functions given in (5.82), implies that ψ1 ( · ; µ, σ) and
ψ2 ( · ; µ, σ) are affinely independent. Since ψ1 ( · ; µ, σ) and ψ2 ( · ; µ, σ) are in L2 (Pµσa ), we
conclude that the covariance matrix of ψ1 ( · ; µ , σ) and ψ2 ( · ; µ, σ) is positive definite. u t
2
Lemma 7.1 (Lehmann, 1983, page 124) - For any random variables X1 , . . . , Xn with finite
second moments, the covariance matrix ... is positive semidefinite; it is positive definite if andPonly if the
Xi are affinely independent, that is, there do not exist constants (a1 , . . . , ar ) and b such that ai Xi = b
with probability 1.
148 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
Chapter 6
6.1 Introduction
We study in this chapter two extensions of the L2 - restricted models which we call:
extended L2 - restricted models and partial parametric models. The first is obtained by
imposing an additional condition involving a function of the expected values of a group
of functions to a L2 - restricted model. Examples of extended L2 - restricted models are
the covariance selection model defined on a pure location model and regression models
with link. The second class of models considered here is constructed by assuming that the
density is known (or known to lie in a parametric model) in a region of the sample space
and belongs to a semiparametric L2 - restricted model out of this region. A typical example
of partial knowledge model are the trimmed models derived from a location model.
As it will be apparent from the text, the mathematical structure of the models con-
sidered in this chapter becomes more complex than the case of L2 - restricted models.
149
150 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
and
Z Z
∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (6.2)
X X
Clearly (6.3) is a condition of the type of (6.1) and (6.4) is of the type of (6.2). We
conclude that the covariance selection model defined on the location model is an extended
L2 - restricted model. We will work more with this example at the end of this section.
The calculation of the nuisance tangent spaces for an extended L2 - restricted model
is in general a difficult task. We give in the following a sequence of lemmas that will,
under regularity conditions, yield a characterization of the intersection (over the values
of the nuisance parameter) of the nuisance tangent spaces for each value of the interest
parameter. This will allow us to give a representation of the regular estimating functions
and to study the effect of the introduction of restrictions of the type of (6.1) and (6.2) on
the problem of estimation via estimating functions. The proof of the lemmas are given in
appendix 6.4.1.
It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote
Hk (θ, z) = Hk = span {f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk )}
and
(Hk + Br )(θ, z) = (Hk + Br )
( )
f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),
= span .
b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )
Moreover, Hk⊥ (θ, z) = Hk⊥ and (Hk + Br )⊥ (θ, z) = (Hk + Br )T denote the orthogonal
complement of Hk (θ, z) and (Hk + Br )(θ, z) in L20 (Pθz ), respectively. Note that Hk (θ, z)
in fact does not depend on z, and we write sometimes simply Hk (θ).
6.2. EXTENDED L2 - RESTRICTED MODELS 151
Theorem 20 Under an extended L2 - restricted model that satisfies the condition given
in lemma 11 iii),
i) For each θ ∈ Θ,
2 1
TN⊥ TN⊥
\ \
(θ, z) = (θ, z) = Hk (θ) .
z∈Z z∈Z
k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .
j=1
The theorem shows that the estimation via regular estimating functions is not affected
by the introduction of the constraints given by (6.1) and (6.2).
152 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
A direct verification shows that conditions (6.5)-(6.7) hold for both p and q, however
Ep (bi ) 6= Eq (bi ), for i = 1, . . . , 4.
We conclude that theorem 20 holds for the covariance selection model described above
(it holds also for covariance selection models defined on a location model in general) and
hence the estimation of the mean via estimating functions is not affected by the constraints
defined via the inverse of the covariance matrix. This is in accordance with the literature
for covariance selection models.
There is a situation in which we can calculate the nuisance tangent space for an
extended L2 - restriction model.
Example 11 (Location models with arbitrary tails) Suppose that X = IR, λ is the
Lebesgue measure on the real line, Θ = IR and for each θ ∈ IR
Pθ∗ = {p : X −→ IR such that (6.9)-(6.15) hold} .
Suppose furthermore that
∀x ∈ X , p(x) > 0; (6.9)
p is of class C 1 ; (6.10)
Z
p(x)λ(dx) = 1; (6.11)
X
Z
x2 p(x)λ(dx) < ∞; (6.12)
IR
Z
xp(x)λ(dx) = θ; (6.13)
IR
6.3. PARTIAL PARAMETRIC MODELS 155
)2
p0 (x)
Z (
p(x)λ(dx) < ∞; (6.14)
IR p(x)
and
where Iθ = [θ − ξ1 , θ + ξ2 ] (for
R 1
ξ > 0 and ξ2 > 0) and fθ ( · ) = f0 ( · − θ) for a given
probability density f0 ( · ) with IR xf0 (x)λ(dx) = 0.
The model described above is composed by distributions that coincide with the location
model generated by the density f0 , in the central intervalR Iθ and possesses free tails. If
we define ξ1 and ξ2 in such a way that −∞ f0 (x)λ(dx) = ξ∞
R ξ1
2
f0 (x)λ(dx) = α/2, for some
α ∈ (0, 1), then the model is called the α-trimmed model.
The following theorem characterizes the nuisance tangent spaces and their orthogonal
complements in L20 for the partial parametric models. It is convenient to introduce the
following notation, for each θ ∈ Θ and z ∈ Z,
Iθc = X \ Iθ ;
n o
Hk⊥ (θ, z) = Hk⊥ = ν ∈ L20 (Pθz ) : ν ∈ Hk⊥ (θ, z) and supp(ν) ⊆ Iθc ;
n o⊥
Hk (θ, z) = Hk = Hk⊥ (θ, z) .
Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions
that are almost surely equal and we adopt the convention that supp(ν) ⊆ A means that
ν( · )χAc ( · ) = 0, λ- almost everywhere.
q⊥
the function Ei (θ) does not depend on the nuisance parameter. Furthermore, T N (θ, z)
does not depend on the nuisance parameter z and we write sometimes simply Hk (θ).
ii) Any regular estimating function Ψ, with components ψ1 , . . . , ψq , can be written in the
form, for i = 1, . . . , q,
k
X
ψi ( · ) = ξi ( · ; θ) + αij (θ){fi ( · )χIθc ( · ) − Ei (θ)} , (6.16)
i=1
The regular estimating function Ψ in part ii) of the corollary above has matricial
representation
where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .
Theorem 23 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (6.17)
and for each θ ∈ Θ and z ∈ Z:
−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (6.18)
6.3. PARTIAL PARAMETRIC MODELS 157
b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking
i=1
where A(θ, z) is the matrix formed by the ℵij (θ, z)s and ξ θz ( · ) = (ξ1θz ( · ), . . . , ξqθz ( · ))T .
The covariance of the efficient score function is given by,
The matrix γ θz is the matrix of change of basis. Now define the functions f1θz , . . . , fkθz :
X −→ IR such that f θz ( · ) = (f1θz ( · ), . . . , fkθz ( · ))T and f ( · ) = γ θz f θz ( · ). Clearly, for
i = 1, . . . , k, ξiθz ( · ) = fiθz ( · )χIθc ( · ). For any θ0 ∈ Θ, z ∈ Z and i = 1, . . . , k, we have
Z
fiθz (x)p(x; θ0 , z)λ(dx) = Miθz (θ0 ) , (6.23)
X
that is, the integral above does not depend on the choice of the nuisance parameter z. The
condition (6.23) can be used to characterize alternatively the model instead of condition
(5.5). The nuisance tangent spaces can be characterized alternatively in the following
way, for each (θ0 , z0 ) ∈ Θ × Z,
where
Z
Pθ 0 z 0
Eiθz (θ0 ) = Miθz − fiθz (x) (x)λ(dx) .
Iθ0 dλ
where E θz (θ0 ) = γ θz E(θ0 ). Using the theory already developed the Godambe information
of Ψ at (θ, z) is given by,
Theorem 24 Consider a partial parametric model. Suppose that the function E is differ-
entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating
functions at (θ, z) ∈ Θ × Z if and only if,
Example 12 (Location models with arbitrary tails) Consider the location model with
arbitrary tails introduced in example 11. We have k = 1, f1 ( · ) = ( · ) and M1 (θ) = θ.
Suppose that ξ1 = ξ2 and that f0 ( · ) is symmetric about the origin. We have then,
Z θ+ξ1
E1 (θ) = θ − xf0 (x − θ)λ(dx) = θ(1 − Q) ,
θ−ξ1
R ξ1
where Q = −ξ1 f0 (x)λ(dx). Any regular estimating function Ψ is of the form
Hence,
(1 − Q)
∇E θz (θ)T = ∇E θz (θ)T {γ θz }T ) = .
k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )
6.4 Appendices
6.4.1 Technical proofs related with extended L2 - restricted mo-
dels
Lemma 13 (lemma 11) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,
2
i) (Hk + Br )⊥ (θ, z) ⊆TN (θ, z) ;
2
ii) TN⊥ (θ, z) ⊆ (Hk + Br )(θ, z) ;
2
iii) ∩z∈Z TN⊥ (θ, z) ⊆ ∩z∈Z (Hk + Br )(θ, z) = Hk (θ), provided for each bi (i = 1, . . . , r)
there exist zi and wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ).
Proof:
i) Take (θ, z) ∈ Θ × Z fixed and ν ∈ (Hk + Br )⊥ (θ, z). Define for each t ∈ IR+ the
generalized sequence {pt }t∈IR+
pt ( · ) = p( · ) + tp( · )ν( · ) .
We show that for t small enough pt ∈ Pθ∗ . The functions pt (for t small) satisfy the
conditions (5.1)-(5.6) as verified in the proof of lemma 10. We check the conditions (6.1)
and (6.2). For i = 1, . . . , r,
Z Z Z
bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)
X X X
Z
= bi (x)p(x)λ(dx)
X
span {bi ( · ) − Eθzi (bi )} ∩ span {bi ( · ) − Eθwi (bi )} = {0} . (6.25)
6.4. APPENDICES 161
Hence,
= Hk (θ)
u
t
Lemma 14 (lemma 12) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,
1
i) TN (θ, z) ⊆ Hk (θ);
1
ii) Hk (θ) ⊆TN⊥ (θ, z) ;
1
iii) Hk (θ) ⊆ ∩z∈Z TN⊥ (θ, z).
Proof:
i) The same argument of lemma 9 yields this part of the lemma.
ii) Straightforward from i).
iii) Follows immediately from taking the interception (over z ∈ Z) in both sides of ii)
and observing that Hk (θ) does not depend on z. u
t
Theorem 25 (theorem 20) Under an extended L2 - restricted model that satisfies the con-
dition given in lemma 11 iii),
162 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
i) For each θ ∈ Θ,
2 1
∩z∈Z TN⊥ (θ, z) = ∩z∈Z TN⊥ (θ, z) = Hk (θ) .
k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .
j=1
Proof:
2 1
i) Take (θ, z) ∈ Θ × Z fixed. Since T N (θ, z) ⊆T N (θ, z), taking intersections yields
2 1
∩z∈Z T N (θ, z) ⊆ ∩z∈Z T N (θ, z) . (6.26)
Combining (6.26) with lemma 13 item iii) and lemma 14 item iii) implies the first part
of theorem 25
ii) Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , q} fixed. The result follows from item i) and
2
because ψi ( · ; θ) ∈ ∩z∈Z T N (θ, z) (see chapter 3). u
t
Proof:
10 L1
“⊆” Take ν ∈T N . Then, for each t ν = pttp−p − rt , where pt ∈ Pθ∗ and rt −→ 0. Using the
same argument given in the proof of lemma 9 we conclude that hfi , νi = 0. Moreover,
Z Z
b b1 (x)pt (x)λ(dx), . . . , bs (x)pt (x)λ(dx)
X X
Z Z
= b b1 (x)p(x)λ(dx), . . . , bs (x)p(x)λ(dx) .
X X
Hence, for i = 1, . . . , s,
1
Z Z Z
hbi , νi = bi (x)pt (x)λ(dx) − bi (x)p(x)λ(dx) − rt (x)p(x)λ(dx)
t Z X X X
= − rt (x)p(x)λ(dx) .
X
Using the same argument given in the proof of lemma 9 we conclude that hbi , νi = 0.
“⊇” Take ν ∈ [span{f1 , . . . , fk , g1 , . . . , gs }]⊥ ∩ Cc . Define, for each t ∈ IR+
pt ( · ) = p( · ) + tp( · )ν( · ) .
We show that for t small enough, pt ∈ Pθ∗ . Note that the argument given in the proof of
lemma 10 shows that the conditions (5.1)-(5.6) hold. To check condition (6.1), observe
that, for i = 1, . . . , s,
Z Z Z
bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)
X X X
Z
= bi (x)p(x)λ(dx) ,
X
hence (6.1) holds. Reasoning in a similar way we conclude that (6.2) holds too, which
concludes the proof. u
t
164 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
Proof: Take (θ, z) ∈ Θ × Z fixed. We suppress θ and z from the notation in of this proof.
Define
" #
{fi ( · )χIθc ( · ) − Ei (θ) : i = 1, . . . , k}
Fk = .
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }
We conclude that Fk⊥ ⊆ Hk⊥ . We check whether supp(ν) ⊆ Iθc . Using (6.27), for any
g ∈ L20 (Pθz ), hν, gχIθc iθz = 0. In particular ν is orthogonal to any function in an orthogonal
basis of the space of functions in L20 (Pθz ) with support contained in Iθ . We conclude that
ν( · )χIθ ( · ) ≡ 0, and hence supp(ν) ⊆ Iθc . u
t
Proof: The lemma follows from an argument similar to the proof of lemma 9 in chapter
5. u
t
q0
Proof: Take ν ∈ Hk⊥ (θ, z) ∩ Cc . We prove that ν ∈T N (θ, z). Define, for each t ∈ IR+ ,
pt ( · ) = p( · ) + tp( · )ν( · ) .
We show that for t small enough, pt ∈ Pθ∗ , which proves the lemma. The conditions
(5.1)-(5.6) are satisfied by pt , for t small enough, as shown in the proof of lemma 10.
The additional condition (6.8) holds because supp(ν) ⊆ Iθc (by definition of Hk⊥ ). Hence
Hk⊥ ∩ Cc ⊆ TN0 and the lemma follows by taking the closure. u
t
Theorem 27 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (6.17)
and for each θ ∈ Θ and z ∈ Z:
−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}Covθz (f χIθc ){∇E(θ)} . (6.29)
166 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking
Proof: The first part follows immediately from the definition of Godambe information
and the representation (6.17).
We prove the second part. Note that the second term of the right hand of the expres-
sion (6.29) does not depend on the choice of Ψ, i.e. it depends only on the functions E
and f that are characteristics of the model and not of the estimating function. Hence, to
maximize the Godambe information of a regular estimating function it is enough to max-
imize Jξ , i.e. we can concentrate on the interval Iθ where the model is parametric. The
rest of the proof follows from the classic theory of estimating functions for semiparametric
models, for example, theorem 4.9 in Jørgensen and Labouriau (1996, pages 150-152). The
details are given below.
We prove that Jξ is maximized by
Define
Note that the differentiation under the integral sign performed above is allowed because
Ψ is a regular estimating function.
We have then
Hence
The last inequality in the expression above is due to the fact that the components of ξ u
are in the space spanned by the components of U. Obviously the equality in the last
expression is obtained if ξ is equivalent to U. u
t
168 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
Bibliography
[2] Barndorff-Nielsen, O.E. ; Jensen, J.L. and Sørensen, M. (1990). Parametric modelling
of turbulence. Phil. Trans. R. Soc. Lond. A 332, 439-445.
[3] Begun, J.M,; Hall, W.J.; Huang, W.M. and Wellner, J.A. (1983). Information and
asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, 432-
452.
[4] Bickel, P.J.; Klaassen, C.A.J.; Ritov, Y. and Wellner, J.A. (1993). Efficient and
Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,
London.
[5] Billingsley, P. (1986). Probability and Measure. Second edition. John Wiley and Sons.
New York.
[6] Chow, Y.S. and Teicher, H. (1978). Probability Theory: Independence, Interchange-
ability, Martingales. Springer-Verlag, Heidelberg.
[8] Dieudonne, J. (1960). Foundations of Modern Analysis. Academic Press, New York.
[9] Dunford, N. and Schwartz, J.T. (1958). Linear Operators, Part I . Interscience, New
York.
[11] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Proc. Royal Soc.
London Ser. A 144, 285–307.
169
170 BIBLIOGRAPHY
[12] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-
mation. Ann. Math. Statist. 81, 1208–1212.
[13] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating
equations. Biometrika 63, 277–284.
[14] Godambe, V.P. (1980). On sufficiency and ancillarity in the presence of a nuisance
parameter. Biometrika 67, 269–276.
[15] Godambe, V.P. (1984). On ancillarity and Fisher information in the presence of a
nuisance parameter. Biometrika 71, 626–629.
[16] Godambe, V.P. and Thompson, M.E. (1974). Estimating equations in the presence
of a nuisance parameter. Ann. Statist. 2, 568–571.
[17] Godambe, V.P. and Thompson, M.E. (1976). Some aspects of the theory of estimating
equations. J. Statist. Plann. Inference 2, 95–104.
[18] Hájek, J. (1962). Asymptotically most powerful rank-order tests. Ann. Math. Statist.
33 1124-1147.
[20] Jørgensen, B. and Labouriau, R. (1995). Exponential Families and Theoretical Infer-
ence. Lecture notes at the University of British Columbia, Vancouver.
[22] Kendall, G.M. and Stuart, A. (1952). The Advanced Theory of Statistics. Vol. 1.
Charles Griffin, London.
[23] Kimball, B.K. (1946). Sufficient statistical estimation functions for the parameters
of the distribution of maximum values. Ann. Math. Statist. 17, 299–309.
[28] LeCam, L. (1966). Likelihood functions for large number of independent observations.
In: Research Papers in Statistics. Festschrift for J. Neyman (F.N. David, ed.), 167-
187, Wiley, London.
[29] Luenberg, D.G. (1969). Optimization by Vector Space Methods. John Wiley and Sons.
New York.
[30] Lukacs, E. (1975). Stochastic Convergence. Second Edition. Academic Press, Inc;
New York.
[31] Lundbye-Christensen (1991). A multivariate growth curve model for pregnancy. Bio-
metrics 47, 637-657.
[32] McLeish, D.L. and Small, C.G. (1987). The Theory and Applications of Statistical
inference Functions. Lecture Notes in Statistics 44, Springer-Verlag, New York.
[34] Pfanzagl, J. (1985). Asymptotic Expansions for General Statistical Models. Lecture
Notes in Statistics 31. Springer-Verlag.
[36] Plausonio, A. (1996) De Re Ætiopia. 74th edition. Editora Rodeziana, São Paulo-
Barcelona.
[37] Rudin, W. (1966). Real and Complex Analysis. McGraw-Hill, New York.
[40] Stein, C. (1956). Efficient nonparametric testing and estimation. Proc. Third Berkeley
Symp. Math. Statist. 1, 187-195, Univ. California Berkeley.
[41] Vaart, A.W. van der (1988). Estimating a real parameter in a class of semiparametric
models. Ann. Statist. 16 4 , 1450-1474.
172 BIBLIOGRAPHY
[42] Vaart, A.W. van der (1988). Efficiency and Hadamard differentiability. Scand. J.
Statist. 18, 63-75.
[43] Vaart, A.W. van der (1988). Statistical Estimation in Large Parameter Spaces. CWI
Tracts 44, Amsterdam.