Sei sulla pagina 1di 173

Estimating Functions and Semiparametric Models

1
Rodrigo Labouriau

December, 1996

1 Department of Biometry and Informatics, Foulum Research Center, Danish Ministry of


Agriculture and Fisheries and Department of Theoretical Statistics, Aarhus University.
1

To Luiz, Maria Léa and Luiza.


2

Acknowledgement
I have learned from many people to do research: my supervisor Professor Ole E. Barndorff-
Nielsen, my previous suprevisor at IMPA and co-worker Professor Bent Jørgensen, Pro-
fessor Richard Gill, Professor Amari, from my childhood Professor Luiz Fernando Gouvêa
Labouriau and Professor Maria Lea Salgado Labouriau, among others. To all of them I
would like to thank.
My research on estimating functions, one of the main topics of this thesis, was started
under supervision of Professor Bent Jørgensen at IMPA. When it was not possible to
continue my formation at IMPA Professor Barndorff-Nielsen offered me to stay under his
umbrella at the University of Aarhus. There my work gained special momentum specially
when working with Professor Barndorff-Nielsen and Professor Amari.
This work was conducted initially with finantial support of the Conselho Nacional de
Pesquisa - CNPq (Brazil), latter on I have received a grant from the european program
Human and Capital Mobility (HCM) at the MRI under supervision of Professor Richard
Gill (University of Utrecht). Partial finantial support was also provided by Fundação
Apolodoro Plausônio. I would like to thank the Research Centre Foulum, specially the
Department of Biometry and Informatic, for providing facilities and for beeing flexible in
allowing me time off for preparation of this thesis. In particular Aage Nielsen was always
supportive and a good friend.
Contents

1 Introduction 7
1.1 Semiparametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Classical optimality theory and estimating functions for semipara-
metric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Some classes of semiparametric models . . . . . . . . . . . . . . . . 10
1.2 Description of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Basic set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

I General Theory of Semiparametric Models 25


2 Path and Functional Differentiability 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Differentiable paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 General definition of path differentiability . . . . . . . . . . . . . . 30
2.2.2 Hellinger and weak path differentiability . . . . . . . . . . . . . . . 31
2.2.3 Lq path differentiability . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.4 Essential path differentiability . . . . . . . . . . . . . . . . . . . . 35
2.2.5 Tangent spaces and tangent sets . . . . . . . . . . . . . . . . . . . . 36
2.3 Functional differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Definition and first properties of functional differentiability . . . . . 39
2.3.2 Asymptotic bounds for functional estimation . . . . . . . . . . . . . 44
2.4 Asymptotic bounds for semiparametric models . . . . . . . . . . . . . . . . 49

3 Estimating and Quasi Estimating Functions 55


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Basic definitions and properties . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Estimating and quasi-estimating functions . . . . . . . . . . . . . . 57
3.2.2 First characterization of regular estimating functions . . . . . . . . 58

3
4 CONTENTS

3.2.3 Amari-Kawanabe’s geometric characterization of regular estimating


functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Optimality theory for estimating functions . . . . . . . . . . . . . . . . . . 63
3.3.1 Classic optimality theory . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.2 Lower bound for the asymptotic covariance of estimators obtained
through estimating functions . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Attainability of the semiparametric Cramér-Rao bound . . . . . . . 67
3.4 Further aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 Optimal estimating functions via conditioning . . . . . . . . . . . . 70
3.4.2 Generalized estimating functions . . . . . . . . . . . . . . . . . . . 73

II Detailed Study of Some Classes of Semiparametric Mod-


els 75
4 Semiparametric Location and Scale Models 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Semiparametric Location-Scale Models . . . . . . . . . . . . . . . . . . . . 79
4.3 Calculation of the nuisance tangent space . . . . . . . . . . . . . . . . . . . 82
4.4 Calculation of the efficient score function . . . . . . . . . . . . . . . . . . . 84
4.4.1 The case where the first two standardized cumulants are fixed . . . 85
4.4.2 The case where the first three standardized cumulants are fixed . . 87
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6.1 The Laplace transform and polynomial approximation in L2 . . . . 92
4.6.2 Calculation of the L2 - nuisance tangent space at (0, 1, a) . . . . . . 98
4.6.3 Calculation of the tangent space at an arbitrary point . . . . . . . . 103
4.6.4 Calculation of the first four orthogonal polynomials in L2 (a) . . . . 108

5 Semiparametric Models with L2 Restrictions 111


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Semiparametric Models with L2 Restrictions . . . . . . . . . . . . . . . . . 113
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 Location-scale models . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Multivariate location-shape models . . . . . . . . . . . . . . . . . . 117
5.3.3 Covariance selection models . . . . . . . . . . . . . . . . . . . . . . 119
5.3.4 Linear structural relationships models (LISREL) . . . . . . . . . . . 122
5.3.5 Growth curve models . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Efficient Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
CONTENTS 5

5.4.1 Calculation of the tangent spaces . . . . . . . . . . . . . . . . . . . 128


5.4.2 Efficient score functions . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.3 Characterization of the class of regular estimating functions . . . . 130
5.4.4 Robustness of estimators derived from regular estimating functions 134
5.4.5 Attainability of the semiparametric Cramèr-Rao bound via estimat-
ing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.5.1 Technical proofs related with L2 - restricted models . . . . . . . . . 142
5.5.2 Proof of a theorem related with location and scale models . . . . . 145

6 Variants of the L2 - Restricted Models 149


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Extended L2 - restricted models . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3 Partial parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.4 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.4.1 Technical proofs related with extended L2 - restricted models . . . . 160
6.4.2 Technical proofs related with partial parametric models . . . . . . . 164
6 CONTENTS
Chapter 1

Introduction

1.1 Semiparametric models


A common problem in scientific research is to draw conclusions about some phenomena
on the basis of some observations. Statistics provides tools to support the procedures for
drawing such conclusions. In statistical models the observations are thought of as origi-
nating from a random mechanism governed by a certain probability law. The underlying
probability law is usually unknown and assumed to belong to a certain class of laws given
and termed the statistical model or simply the model. The problem of modeling, in this
context, can be viewed as composed of a deductive process, where the class of supposed
possible underlying laws is determined, and an inductive process, where one tries to gain
information about the subjacent law on the basis of the observations. In this thesis we
treat some techniques related with the inductive process mentioned, i.e. given certain
classes of laws, we will develop some techniques for inferring about the specific law that
generated the observations.
It occurs very often that the given class of subjacent laws (i.e. the model) is indexed
by a finite number of parameters. Those classes are called parametric models. In many
situations the parameters reflect the state of nature of some phenomena under study.
It is then of interest to obtain information about these characteristics or parameters. In
these cases, one may take advantage of the well developed inferential theory of parametric
statistical models. However, there are situations where, due to the nature of the problem
or due to our lack of knowledge, one cannot deduct a reasonable parametric model.
A way to circumvent the lack of adequate parametric models is to use nonparametric
models. Those models are, as the name says, composed by large classes of distributions
that cannot be indexed by a finite number of parameters. Typical examples are models
that only require the distributions to be symmetric or continuous. Nonparametric models
are statistical useful tools; however, when using them one can typically draw inference only

7
8 CHAPTER 1. INTRODUCTION

on some very basic aspects of the phenomena in study. In other words, the specificity of
interpretation of parametric models is necessarily lost when using classic nonparametric
models. That is the price one sometimes has to pay for dealing with extremely large
families of distributions.
Recently some attention has been devoted to a kind of nonparametric model that can
be placed in an intermediate position between the two extreme situations described above.
They are called by the suggestive name of semiparametric models. Roughly speaking,
semiparametric models are families of distributions indexed by a parameter that can
be decomposed into two sub-parameters: the first sub-parameter is called the interest
parameter and belongs to a finite dimensional space, the second sub-parameter is termed
the nuisance parameter and lives in a infinite dimensional space. Semiparametric models
are genuine nonparametric models in the sense that they cannot be indexed by any finite
dimensional parameter. What distinguish the semiparametric models from the classic
nonparametric models is that in the first we can identify a finite dimensional interest
parameter. This identification of the interest parameter is in principle guided by our
interest. In this way, we incorporate in the model parameters that reflect characteristics
of the phenomena under study via the interest parameter and keep the flexibility of the
nonparametric models to adapt to the unknown peculiarities of the phenomena studied.
Another distinctive characteristic of the theory of semiparametric models are the meth-
ods of statistical inference that are used. Typically when dealing with semiparametric
models one tries to take advantage of the (partially) parametric structure of the model.
We discuss this point more precisely below.

1.1.1 Classical optimality theory and estimating functions for


semiparametric models
There is a developed branch of the theory of parametric models that treats models with
(finite dimensional) nuisance parameters. Due to the nature of semiparametric models, it
is natural to hope that the techniques for dealing with parametric models with nuisance
parameters could be somehow adapted to semiparametric models. Unfortunately this is
only partially true. However there are some classic notions from the parametric theory,
such as the efficient score function and the efficient Fisher information, that possess an
adapted version in the modern theory of semiparametric models. The generalization of
these structures to semiparametric models generated a beautiful optimality theory that in
a certain sense is analogous to some classic theories for parametric models. We will review
this theory in the first part of this thesis. The theory we refer to here is called optimality
theory for regular asymptotic linear estimating sequences (see chapter 2 for details). It
should be noted that in spite of the elegance of the optimality theory mentioned above,
it involves usually non trivial work, even in some rather simple semiparametric models.
1.1. SEMIPARAMETRIC MODELS 9

In the second part of this thesis we carry out all the computations necessary to apply
the optimality theory of regular asymptotic linear estimating sequences to some classes
of semiparametric models.
In this thesis we study also the use, in a context of semiparametric models, of a rela-
tively well developed technique for parametric models: the so called estimating functions.
In the approach of estimating functions we consider estimators which can be expressed
as solutions of an equation such as

Ψ(x; θ) = 0 . (1.1)

Here Ψ is a function of the given data, say x, and the parameter, say θ, of a certain
statistical model. We call Ψ an estimating function, also known as an inference function
(the precise definition will be given later). Following the same procedure as in the clas-
sical theories, one introduces some constraints in the class of inference functions to be
considered, a criterion for ordering the estimators obtained from the estimating functions
is given in the restricted class and the uniformly best estimator is chosen. It is clear that
in most of the classical “well-behaved” cases, the maximum likelihood estimator is given
by the solution of an estimating equation. Moreover, the criterion for ordering inference
functions is closely related to the asymptotic variance of the associated estimators. In
that way, the approach of estimating equations can be viewed as a generalization of the
maximum likelihood theory. Due to the optimal behavior of the maximum likelihood in
“regular” cases, it is not surprising that the optimal inference function will give us ex-
actly the maximum likelihood estimator. However, there are some situations in which the
maximum likelihood theory fails and the estimating equation theory works well. More-
over, the theory of estimating equations provides alternative justifications for important
statistical techniques for parametric models, such as conditional inference (see Jørgensen
and Labouriau, 1995).
Estimating functions have been used for inference with parametric models for a long
time. The earliest mention of the idea of estimating equations is probably due to Fisher
(1935) (he used the term “equation of estimation”). A remarkable example of an early
non-trivial use of inference functions can be found in Kimball (1946), where estimating
equations were used to give confidence regions for the parameters of the family of Gumbel
distributions (or extreme value distributions). There, the idea of “stable” estimating
equations, i.e. inference functions whose expectations are independent of the parameter,
was introduced, anticipating the theory of sufficiency and ancillarity for inference functions
proposed by McLeish and Small (1987) and Small and McLeish (1988). The theory of
optimality of inference functions appears in the pioneering paper of Godambe (1960). In
the same year Durbin (1960) introduced the notion of unbiased linear inference function
and proved some optimality theorems particularly suited to applications in time series
analysis. Since that time, the theory of inference functions has been developed a great
10 CHAPTER 1. INTRODUCTION

deal, both by Godambe (c.f. Godambe, 1976, 1980, 1984; Godambe and Thompson, 1974,
1976), and by others in different contexts and with different names and approaches. We
mention, for instance, the so-called theory of M -estimators developed in the Seventies in
order to obtain robust estimators, and the quasi-likelihood methods used in generalized
linear models. As one can see, the theory of inference functions was not only inspired by
an alternative optimality theory for point estimation. One could say that there is now a
firm and well established theory of inference functions for parametric models, with many
branches, some of them based on very deep mathematical foundations.
Estimating functions have been applied with relative success to estimation under para-
metric models with nuisance parameters (see Jørgensen and Labouriau, 1995 and the
references therein). It is then natural to ask whether this technique produces reasonable
results in a context of semiparametric models. We will show in this thesis that in fact the
class of estimators derived from estimating functions is rather limited for semiparametric
models. However, estimating functions will prove to be of utility as auxiliary for obtaining
efficient regular asymptotic linear estimating sequences.

1.1.2 Some classes of semiparametric models


We present next an example of a semiparametric model. Consider the following class of
probability distributions on the real line.
( )
Pθz
P = Pθz : ( · ) = z( · − θ), θ ∈ Θ = IR , z ∈ Z . (1.2)

Here Z is the class of probability densities z : IR −→ IR such that (1.3)-(1.6) given below
hold.
∀x ∈ IR, z(x) > 0; (1.3)
Z
z(x)dµ(x) = 1; (1.4)
IR

Z Z
xz(x)dµ(x) = 0 or equivalently xz(x − θ)dµ(x) = θ (1.5)
IR IR

and
Z
x2 z(x)dµ(x) < ∞. (1.6)
IR

Conditions (1.3) and (1.4) ensure that z is a density of a probability measure with support
equal to the whole real line. From condition (1.5) the parametrization (θ, z) 7→ Pθz is
identifiable, i.e. this map is one-to-one.
1.1. SEMIPARAMETRIC MODELS 11

Clearly, the model described above is a semiparametric extension of the socalled lo-
cation model. This model is an example contained in the main class of models studied
in this thesis. The models in the class referred to are constructed by assuming that the
expectation of a number of given square integrable functions are equal to some fixed func-
tions of a finite dimensional interest parameter. In the example above this corresponds to
condition (1.5). These models are termed “L2 - restricted models” and they incorporate
semiparametric extensions of many important statistical models such as: multivariate
location and shape models, growth curve models, linear relationships models (LISREL)
and some graphical models.
Two other closely related classes of models are studied also. The first is obtained
by considering L2 - restricted models that coincide entirely with a parametric model
(parametrized exclusively by the interest parameter) in some given region of the sam-
ple space. The distributions are assumed to vary freely (only subject to the L2 - restricted
model constraints) out of those regions. Here an example is the semiparametric location
model described above with the additional condition that the densities of the distribu-
tions in the model coincide with the density φ of a normal distribution with mean θ and
variance σ 2 := IR x2 φ(x)dµ(x) in the interval [θ − 2σ, θ + 2σ]. These models are called
R

“partial parametric models”. The best known examples are the trimmed models and the
location model with free tails.
The second class of models are constructed by adding to a L2 - restricted model con-
straints obtained by assuming that a (non-linear) function of the expectation of some
square integrable functions equals a function of the interest parameter. These models are
referred to as “extended L2 - restricted models”. For example, one could introduce addi-
tionally the assumption that the coefficient of variation is constant in the semiparametric
location model described above, i.e. insert the condition
qR
IR x2 z(x)dµ(x)
R = k.
IR xz(x)dµ(x)

Examples of extended L2 - restricted models are regression models with link and some
types of covariance selection models.
12 CHAPTER 1. INTRODUCTION

1.2 Description of the thesis


The thesis is divided in two parts. The first part treats some topics of the estimation the-
ory for semiparametric models in general. There the classic optimality theory is reviewed
and exposed in a suitable way for the further developments given after. Further the the-
ory of estimating functions is developed in details in a context of semiparametric models.
There does not to the knowledge of the author, exist any such systematic treatment of
estimating functions for semiparametric models in the literature.
The second part studies some classes of semiparametric models described previously.
The material contained in this part of the thesis constitutes an original contribution.
There can be found the detailed characterization of the class of regular estimating func-
tions, a calculation of efficient regular asymptotic linear estimating sequences (i.e. the
classical optimality theory) and a discussion of the attainability of the bounds for the con-
centration of regular asymptotic linear estimating sequences by estimators derived from
estimating functions.
There follows a more detailed description of the contents each chapter of the the-
sis. The chapters 5 and 6 are intentionally described in a more detailed way since they
constitute the main contribution of the thesis.

Chapter 1 This chapter contains some introductory material, an overview of the thesis
and some notational conventions.

Part 1 - General Theory of Semiparametric Models

Chapter 2 In this chapter the classic optimality theory for non- and semiparametric
models is studied. A range of notions of path differentiability and tangent spaces are
introduced and their inter-relations studied. Next it is studied some concepts of statistical
functional differentiability. Here the differentiability is considered relatively to a pointed
cone contained in the tangent space and not relatively to the whole tangent space, as is
current in the literature. These cones are referred as the tangent cones. The optimality
theory of differentiable functionals is reviewed next. Again, the results are stated relative
to the tangent cone and not with respect to the whole tangent space, as is usual. The
estimation of the interest parameter of semiparametric models is studied by applying
the optimality theory to a specially designed functional called the interest parameter
functional, which associates to any probability measure in the model in play the value of
the interest parameter associated to it. An increasing range of tangent cones is considered.
Here, the larger is the tangent cone used, the sharper is the bound for the concentration
of regular estimators obtained. However, too large tangent cones may imply that the
interest parameter functional is differentiable only under somehow stringent regularity
1.2. DESCRIPTION OF THE THESIS 13

conditions on the model. It is shown how the imposition of such conditions usually done
in the literature can be avoided by using adequate choices of tangent cones. The bound
for the concentration for regular estimating sequences obtained with this choice of the
tangent cone is referred to as the semiparametric Cramér-Rao bound.

Chapter 3 The chapter extends the theory of estimating functions classically consid-
ered for parametric models to a context of semiparametric models. A class of regular
estimating functions (REF) is defined and characterized in two alternative forms. The
first characterization says essentially that the components of any REF are in the intersec-
tion (over the values of the nuisance parameter) of the so called (strong or L2 ) nuisance
tangent spaces. This result is original, even though it can be found already in Jørgensen
and Labouriau (1995, chapter 4). The second characterization, shown to be equivalent
to the first, is a modification of the (informal) characterization recently given by Amari
and Kawanabe (1996) based on differential geometric considerations. The first character-
ization is used to obtain an optimality theory of REFs by using a projection technique.
It is proved that the semiparametric Cramér-Rao bound coincides with the bound for
the concentration of estimators based on REF if and only if the orthogonal complement
of the nuisance tangent space does not depend on the nuisance parameter. This result
is original and is used in the subsequent chapters to check in a range of semiparametric
models whether the semiparametric Cramér-Rao bound is attained by estimators based
on REFs. The chapter closes with a discussion of a generalization of the notion of REFs
in which the dependence on the nuisance parameter is allowed.
Most of the material presented in this chapter is result of the work of the author in
the last years. The result concerning the coincidence of the semiparametric Cramér-Rao
bound and the bound for the concentration of estimators based on REFs is original.

Part 2 - Detailed Study of Some Classes of Semiparametric Models

Chapter 4 The chapter studies a one dimensional semiparametric extension of the lo-
cation and scale model. The goal there is to gain intuition, and restrictions are introduced
largely in order to make the basic theory work in a simple way. The treatment of the
location and scale model is refined latter in chapter 5. The basic computations necessary
for obtaining efficient regular asymptotic linear estimating sequences (RALES) are given
in detail for location and scale models where a number of standardized cumulants are
fixed. The main point there is the calculation of a version of the efficient score function.
The efficient score function does not depend on the nuisance parameter in the case where
only the first two standardized cumulants are fixed and its roots are the sample mean
and the sample standard deviation. In that case, the efficient score function is a regular
estimating function (REF) and its roots provide efficient RALES, or in other words attain
14 CHAPTER 1. INTRODUCTION

the semiparametric Cramér-Rao bound. In the case where more than two standardized
cumulants are fixed the semiparametric Cramér-Rao bound is not always attained by
estimators based on REFs. Moreover, the efficient score function does depend on the
nuisance parameter but only through an intermediate finite dimensional parameter, sug-
gesting the use of some “plug-in” estimating procedure for obtaining efficient estimation.
The location and scale models considered in this chapter do not incorporate the assump-
tion of symmetry of the distributions, as occurs in the previous treatments of similar
models found in the literature. However, strong conditions on the behavior of the tails
of the distributions and the Laplace transform of some given functions of the density are
imposed. This will allow the calculations to be performed with the help of polynomial
expansions, a technique used in the early stages of the work. These technical restrictions
are eliminated in chapter 5 in a more general context. An auxiliary technical condition
for having the class of polynomials dense in L2 is given in the appendices of the chapter.
The chapter reports essentially a joint work with Professor Shun-Ichi Amari (University
of Tokyo) and Professor Ole E. Barndorff-Nielsen (University of Aarhus), developed in a
preliminary stage of the thesis work.

Chapter 5 This chapter studies the main classes of models considered in the thesis,
the L2 - restricted semiparametric models which we define next. The results presented in
the chapter are original and will be presented next with relatively more detail than the
previous chapters.
Consider a measurable space (X , B) on which a σ- finite measure λ and a family of
probability measures P are defined. The sample space X is assumed to be a locally
compact Hausdorff space, typically a Euclidean space.
The family P is parametrized in the following way
P = {Pθz : θ ∈ Θ , z ∈ Z} . (1.7)
Here Θ is an open set of IRq and Z is a class of arbitrary nature, typically infinite
dimensional. We consider z as a nuisance parameter and θ as a parameter of interest
which we want to estimate. It is assumed that the parametrization of P is identifiable,
i.e. the mapping (θ, z) 7→ Pθz is a bijection between Θ × Z and P. Each element of the
family P is dominated by λ and has support equal to the whole sample space X . We can
then represent P alternatively by
( )
∗ dPθz
P = p( · ; θ, z) = (·) : θ ∈ Θ, z ∈ Z . (1.8)

It is assumed without loss of generality that the versions of the Radon-Nikodym derivative
used in (1.8) to define p( · ; θ, z) : X −→ IR+ are strictly positive, i.e. for all x ∈ X , θ ∈ Θ
and z ∈ Z, p(x; θ, z) > 0.
1.2. DESCRIPTION OF THE THESIS 15

It is convenient for the characterization of P given next to introduce the following


submodels. For each θ0 ∈ Θ define the submodel
Pθ0 = {Pθ0 z : z ∈ Z} . (1.9)
We introduce also the notation Pθ∗0 to denote the class of densities of the probability
measures of Pθ0 .
Now we can introduce the kind of models studied in the chapter. Suppose that there
are k, m ∈ N ∪ {0}, l ∈ N ∪ {0, ∞}, functions f1 , . . . , fk : X −→ IR and functions
g1 , . . . , gm : X × Z −→ IR such that for each θ0 ∈ Θ the submodel Pθ∗0 is the class of
functions p : X −→ IR+ for which the conditions (1.10)-(1.15) hold. The conditions
referred to are, for j = 1, . . . , k and i = 1, . . . , m:
∀x ∈ X , p(x) > 0; (1.10)

p is of class C l ; (1.11)
Z
p(x)λ(dx) = 1; (1.12)
X

Z
fj2 (x)p(x)λ(dx) < ∞; (1.13)
X

Z
fj (x)p(x)λ(dx) = Mj (θ0 ); (1.14)
X

Z
for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (1.15)
X

where Mj (θ0 ) ∈ IR and Bi (θ0 ) is a set given real open. We refer to the models of the form
described above as L2 - restricted semiparametric models or simply L2 - restricted models.
The conditions (1.10) and (1.12) ensure that p is a probability density of a distribution
with support equal to the whole sample space X . Conditions (1.13)-(1.15) are used to
restrict each submodel Pθ0 (and consequently shrink the model P). These condition
could be used to express a partial a priori knowledge about the phenomena we study or
to ensure some desirable mathematical characteristics of the model, such as identifiability
and regularity of the partial score functions. For instance, the conditions (1.15) can be
used to ensure that the partial score functions are in L2 . Condition (1.11) can be assumed
to hold apart from a λ- null set.
The following examples of L2 - restricted models are considered in detail in the thesis:
location and scale (without the restrictive conditions on the tails and existence of the
16 CHAPTER 1. INTRODUCTION

Laplace transform used in chapter 4), multivariate location and shape models, covariance
selection models defined on the location and shape model, linear structural relationship
models (LISREL) and some growth curve models with modeled covariance structure.
The covariance selection models are distinguished among the other examples because the
functions used to restrict are not polynomials.
It is shown that all the notions of nuisance tangent spaces considered in the first
part coincide and are equal to the orthogonal complement of the L2 - closure of the space
spanned by the functions f1 − E(f1 ), . . . , fk − E(fk ). More precisely, define

Hk (θ) = span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)} .

The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z) and it is equal
to the nuisance tangent space. Note that Hk depends only on θ, which in the light of
the theory developed in chapter 3 implies that the semiparametric Cramér-Rao bound
coincides with the bound for REFs.
The next step is to calculate the efficient score function by projecting the partial
score function onto the orthogonal complement of the nuisance tangent space. To do so,
consider the result of a Gram-Schmidt orthonormalization process, in the space L2 (Pθz ),
applied to the functions 1, f1 , . . . , fk and denoted by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form
an orthonormal basis on Hk = TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for
i = 1, . . . , q,

k
E
hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,
X
l/θ i ( · , θ, z) = (1.16)
j=1

where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be
written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,

lE ( · ; θ, z) = A(θ, z)ξ θz ( · ) , (1.17)

where ξ θz ( · ) = ( ξ1θz ( · ), . . . , ξkθz ( · ) )T and A(θ, z) = [hξi , l/θj iθz ]j=1,...,q


i=1,...,k is a k × q matrix.
The covariance matrix of the efficient score function is given by
n o
Covθz lE ( · ; θ, z) = A(θ, z) AT (θ, z) . (1.18)

The inverse of the matrix A(θ, z) provides a lower bound for the asymptotic variance
of regular asymptotic linear estimating sequences i.e. the semiparametric Cramér-Rao
bound. Here we use the partial order of matrices (i.e. , A ≥ B iff A − B is positive
semidefinite).
1.2. DESCRIPTION OF THE THESIS 17

The class of regular estimating functions is studied next. Using the first characteri-
zation of estimating functions given in chapter 3 it is shown that any regular estimating
function can be represented in the following way,
Ψ( · ; θ) = α(θ){f ( · ) − M (θ)} , (1.19)
where α(θ) is a q×k matrix, f ( · ) = (f1 ( · ), . . . , fk ( · ))T and M (θ) = (M1 (θ), . . . , Mk (θ))T .
It can be shown from the properties of the regular estimating functions that the only pos-
sible root of any REF, under a repeated sampling scheme with sample x = (x1 , . . . , xn )T ,
is the solution of the system

0 = IP n f (x) − M (θ̂) .
In other words, there is one and only one moment estimator associated with any REF. The
basic properties of this moment estimator, such as consistency and asymptotic normality,
are studied. The Hampel influence function of the moment estimator referred to is given
by

IFH (x; p, θ̂n ) = ∇M (θ){f (x) − M (θ)} .


Hence the Hampel influence function of an estimator derived from a regular estimating
function is bounded if and only if the function f is bounded. That is θ̂n is resistent to
gross errors,i.e. B- robust, if and only if the function f is bounded. On the other hand
the Hampel influence function is continuous (on x) if and only if f is continuous, and
that is the condition for having bounded sensibility to lateral shifts, i.e. V - robustness.
The attainability of the semiparametric Cramér-Rao bound is discussed by comparing
the asymptotic variance associated with the moment estimator derived from the REFs
with the inverse of the covariance of the efficient estimating function. It is proved that
under a semiparametric L2 - restricted model the semiparametric Cramér-Rao bound is
attained by estimators derived from REFs if and only if
∇M θz (θ)T = A(θ, z) . (1.20)
Using this result the attainability of the semiparametric Cramér-Rao bound is discussed in
many examples of L2 - restricted models. In particular it is shown that the semiparametric
Cramér-Rao bound is not attained by any estimator derived from REF in some cases of
the location-scale model with the third standardized cumulant fixed.
The chapter presents also a complete determination of the class of REFs for the
semiparametric location scale models with some standardized cumulants fixed.

Chapter 6 In this chapter two extensions of the L2 - restricted models are studied. We
term these: extended L2 - restricted models and partial parametric models.
18 CHAPTER 1. INTRODUCTION

Extended L2 - restricted models Consider a model defined as a L2 - restricted


model with the additional restrictions given by (1.21) and (1.22) below. Assume that
there are r, s ∈ N ∪{0}, functions b1 , . . . , br : X −→ IR in L2 (p), h1 , . . . , hs : X ×Z −→ IR,
b : IRr −→ IR and a continuous function h : IRs −→ IR such that
Z Z 
b b1 (x)p(x)λ(dx), . . . , br (x)p(x)λ(dx) = G(θ) (1.21)
X X

and
Z Z 
∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (1.22)
X X

Here G(θ) ∈ IR and H(θ) is an open set in IR.


The model described above is called an extended L2 - restricted model. Excluding the
trivial choices for g and h, the conditions (1.21) and (1.22) cannot be expressed in terms of
the conditions of the L2 - restricted models. Hence the model described above is indeed an
extension of the class of L2 - restricted models. Examples of these models are: covariance
selection models on which the covariance matrix is not part of the interest parameter,
regression models with link function and some proportional odd models.
It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote
Hk (θ, z) = Hk = span {f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk )}
and
( )
f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),
(Hk + Br )(θ, z) = (Hk + Br )span .
b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )
Moreover, let Hk⊥ (θ, z) = Hk⊥ and (Hk + Br )⊥ (θ, z) = (Hk + Br )⊥ denote the orthogonal
complement of Hk (θ, z) and (Hk + Br )(θ, z) in L20 (Pθz ), respectively. Note that Hk (θ, z)
in fact does not depend on z, and we write sometimes simply Hk (θ).
It is difficult, in general, to calculate the nuisance tangent spaces of extended L2 -
restricted models, but the following result is proved. Under weak regularity conditions
(that hold in all the examples considered) one has:
i) For each θ ∈ Θ,
2 1
∩z∈Z TN⊥ (θ, z) = ∩z∈Z TN⊥ (θ, z) = Hk (θ) .

ii) If Ψ : X ×Θ −→ IRq is a regular estimating function for θ, with components ψ1 , . . . , ψq ,


then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,
k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .
j=1
1.2. DESCRIPTION OF THE THESIS 19

The result is used to characterize the class of REFs for the covariance selection models
on which the covariance matrix is not part of the interest parameter. Moreover, the
estimators derived from REFs are shown to be necessarily moment estimators of the
same type as the moment estimators derived from REFs for the case of the L2 - restricted
model considered in chapter 5. The following result proved in the chapter can be used
to calculate precisely the nuisance tangent spaces for the example of regression with link
function. If additionally to the previous assumptions one assumes that the function b is
injective, then
TN (θ, z) = [span{f1 − E(f1 ), . . . , fk − E(fk ), g1 − E(g1 ), . . . , gs − E(gs )}]⊥ .

Partial parametric models Studied next are semiparametric models defined by


imposing a restriction of a different nature different from that determined by the re-
strictions considered previously in the thesis. The restriction says essentially that the
distribution is known (or known to be in a certain parametric model) in a certain region
of the sample space. A typical example is the trimmed model.
The precise definition is given next. Consider the class of probability measures
P = {Pθz : θ ∈ Θ ⊆ IRq , z ∈ Z} = ∪θ∈Θ Pθ
on (X , A), as before. Suppose that for each θ ∈ Θ there is a function fθ : X −→ IR+ and
a measurable set Iθ such that the class of densities Pθ∗ is given by
Pθ∗ = {p : X −→ IR such that (1.10)-(1.15) and (1.23) hold} .
The additional condition in the definition above says that the density p is equal to fθ in
the set Iθ . More precisely, consider the condition
∀x ∈ Iθ , p(x) = fθ (x), λ − a.e. . (1.23)
The family P is termed a partial parametric model. Note that if (1.23) is suppressed from
the definition above, the model become a L2 - restricted model.
The following sequence of results extends the theory developed in chapter 5 to L2 - re-
stricted models for the context of partial parametric models. It is convenient to introduce
the following notation, for each θ ∈ Θ and z ∈ Z,
Iθc = X \ Iθ ;
n o
Hk⊥ (θ, z) = Hk⊥ = ν ∈ L20 (Pθz ) : ν ∈ Hk⊥ (θ, z) and supp(ν) ⊆ Iθc ;

n o⊥
Hk (θ, z) = Hk = Hk⊥ (θ, z) .
20 CHAPTER 1. INTRODUCTION

Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions
that are almost surely equal and adopt the convention that supp(ν) ⊆ A means that
ν( · )χAc ( · ) = 0, λ-almost everywhere.
Theorem 1 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z
and m ∈ [1, ∞],
m
i) T N (θ, z) = Hk⊥ (θ, z);
m⊥
ii) T N (θ, z) = Hk (θ, z). Moreover,
" #
{fi ( · )χIθc ( · ) − Ei (θ)} : i = 1, . . . , k}
Hk (θ, z) = span ,
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

fi (x) Pdλθz (x)λ(dx).


R
where Ei (θ) = Iθc

Note that the function Ei (θ) in fact does not depend on the nuisance parameter and we
write sometimes simply Hk (θ).

Corollary 1 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z


and m ∈ [1, ∞],
m⊥
i) ∩z∈Z T N (θ, z) = Hk (θ);
ii) Any regular estimating function Ψ, with components ψ1 , . . . , ψq , can be written in the
form, for i = 1, . . . , q,
k
X
ψi ( · ) = ξi ( · ; θ) + αij (θ){fi ( · )χIθc ( · ) − Ei (θ)} , (1.24)
i=1

where supp(ξi ) ⊆ Iθ and for i = 1, . . . , q and j = 1, . . . , k, αij (θ) ∈ IR.

The regular estimating function Ψ in part ii) of the corollary above has matricial
representation
Ψ( · ; θ) = ξ( · ; θ) + α(θ){f ( · ) − E(θ)} , (1.25)
where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .

Theorem 2 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (1.25)
and for each θ ∈ Θ and z ∈ Z:
1.2. DESCRIPTION OF THE THESIS 21

a) The Godambe information is given by

−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (1.26)

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking

ξ( · ; θ) = {∇logfθ ( · )} χint(Iθ ) ( · ) . (1.27)

Finally, it is proved the following theorem, which solves the problem of optimality.

Theorem 3 Consider a partial parametric model. Suppose that the function E is differ-
entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating
functions at (θ, z) ∈ Θ × Z if and only if,

∇E θz (θ)T (= ∇E θz (θ)T {γ θz }T ) = A(θ, z) and ξ( · ; θ) = l/θ ( · ; θ)χIθ ( · ) .


22 CHAPTER 1. INTRODUCTION

1.3 Notation and basic set-up


We describe next the basic setup used throughout this thesis. Let us consider a family
of probability measures P defined on a common measurable space (X , A). It is assumed
that the elements of P possess a common support, say X , and that there exists a σ-finite
measure λ defined on (X , A) such that each member of P is absolutely continuous with re-
spect to λ. The family P defines the basic model on which we construct the mathematical
machinery necessary to attack later the problems of estimation under the semiparametric
models we have in mind. In this way, even though no parametric structure is assumed
in P, we will not construct a theory of nonparametric inference in full generality. It will
be seen that the assumption of fixed support and the existence of a common dominating
measure, usually avoided in the ordinary theory of nonparametric models, will provide
us a simple mathematical environment suitable for our purposes. Clearly, each P ∈ P is
identified with a version of its Radon-Nikodym derivative with respect to λ. Denote the
class of these densities by
( )
∗ dP
P = (·) : P ∈ P .

We shall use capital letters to denote the elements of P and small letters to represent the
elements of P ∗ . Without loss of generality we assume that the versions of the Radon-
Nikodym derivatives used are such that for all P ∈ P and for each x ∈ X ,
dP
p(x) = (x) > 0 . (1.28)

Moreover, when treating elements of P ∗ it will be assumed tacitly that λ almost every-
where equal functions are identified.
We introduce next the notation for Lq spaces used thought. For q ∈ [1, ∞), the Lq -
space with respect to the probability measure P ∈ P with density p ∈ P ∗ will be denoted
by
 Z 
q q q
L (P ) = L (p) = f : X −→ IR : |f (x)| p(x)λ(dx) < ∞ .
X

The usual norm of Lq (p) will be denoted by k · kLq (p) . In the special case of the Hilbert
space L2 (p), the natural inner product will be denoted, for all f, g ∈ L2 (p), by
Z
< f, g >P = < f, g >p = f (x)g(x)p(x)λ(dx) .
X

For notational simplicity, sometimes the norm of L2 (p) is denoted by k · kP or k · kp . The


space of L2 (P ) functions that have zero expectation under P is denoted by
 Z 
L20 (P ) = L20 (p) 2
= f ∈ L (P ) : f (x)p(x)λ(dx) = 0 .
X
1.3. BASIC SET-UP 23

Given a set A ⊆ L20 (P ), we denote the closure of A with respect to the topology of L2 (P )
by clL2 (P ) (A) = clL2 (p) (A), and the orthogonal complement of A in L20 (P ) by A⊥ . Finally,
we consider the space L∞ (P ) = L∞ (p) of functions from X to IR essentially bounded with
respect to the probability P ∈ P (or p ∈ P ∗ ) equipped with the norm given, for each
f ∈ l∞ (p), by

kf k∞ = ess supx∈X |f (x)|

(see Dunford and Schwartz, 1964). We stress that for 1 ≤ r ≤ q ≤ ∞, Lq (p) ⊆ Lr (p).
Moreover, for all f ∈ Lq (p),

kf kLr (p) ≤ kf kLq (p) , (1.29)

hence Lq (p) convergence implies Lr (p) convergence.


We introduce next the basic structure and notation common to all the semiparametric
models treated in the thesis. Consider a family of distributions P dominated by a the σ-
finite measure λ with representation
( )
dPθz
P∗ = ( · ) = p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z .

Here θ is a q-dimensional interest parameter and z is a nuisance parameter of arbitrary


nature (typically infinite dimensional). We assume that Θ is open and that the mapping
(θ, z) 7→ p( · ; θ, z) is a bijection between Θ × Z and P ∗ .
Let us assume from now on that for each (θ0 , z0 ) ∈ Θ × Z,

∀x ∈ X , p(x; θ0 , z0 ) > 0 .

We suppose that the partial score function


∇p(x; θ, z0 )|θ=θ0
l(x; θ0 , z0 ) = = (l1 (x; θ0 , z0 ), . . . , lq (x; θ0 , z0 ))T
p(x; θ0 , z0 )
is λ-almost everywhere well defined and assume that for i = 1, . . . , q,

li (x; θ0 , z0 ) ∈ L20 (Pθ0 z0 ) .


24 CHAPTER 1. INTRODUCTION
Part I

General Theory of Semiparametric


Models

25
Chapter 2

Path and Functional Differentiability

2.1 Introduction
We consider in this chapter some aspects of the general theory of non-parametric statistical
models which will be useful for the theory of semiparametric models. The key notions
introduced here are the path differentiability, the associated concept of tangent spaces
and tangent sets, and the notions of functional differentiability.
In section 2.2 we study a range of concepts of path differentiability and comparisons
of those notions are provided. An important point there is the equivalence between the
Hellinger differentiability, often used in the literature (see Bickel et al., 1993), and the
weak differentiability (see Pfanzagl, 1982, 1985 and 1990). Two auxiliary notions of path
differentiability are introduced: strong and mean differentiability. It is proved that weak
(or Hellinger differentiability) is an intermediate notion of path differentiability, weaker
than strong differentiability and stronger than mean differentiability. A new notion of
path differentiability, called essential differentiability, is introduced. We will interpret the
tangents of essential differentiable paths as score functions of one dimensional “regular
submodels” in the classical sense. Since the essential differentiability is weaker than the
other notions provided, this interpretation extends immediately to all the other path
differentiability notions considered.
In section 2.3 some differentiability notions of functionals are studied. In the approach
given a cone contained in the tangent set (i.e. the class of tangents of differentiable paths)
is chosen and the differentiability of the functional in question will be defined relatively
to this cone (termed tangent cone). Alternative notions of functional differentiability are
given by adopting different notions of path differentiability and/or using different tangent
cones. As we will see, the stronger the path differentiability notion used and the smaller is
the tangent cone, the weaker the notion of differentiable functionals induced, in the sense
that more statistical functionals are differentiable. We provide next some lower bounds

27
28 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

for the concentration of “regular” sequences of estimators for a differentiable functional


under a repeated sampling scheme. The weaker the path differentiability required and
the larger the tangent cone adopted, the sharper are the bounds obtained. The theory
will be applied to estimation in semiparametric models in section 2.4.
2.2. DIFFERENTIABLE PATHS 29

2.2 Differentiable paths

The main purpose of this section is to introduce the mathematical machinery necessary to
extend the notion of score function, classically defined for parametric models, to a context
where no (or only a partial) finite dimensional parametric structure is assumed. The key
idea here is to consider one-dimensional submodels of the family P of probability measures
(typically infinite dimensional). These submodels will be called paths. Following the steps
of Stein (1956), one should consider a class of submodels (or paths) sufficiently regular
in order to have a score function well defined and well behaved for each submodel, in the
sense that, at least, each score function should be unbiased (i.e. have expectation zero)
and have finite variance. Stein’s idea is to use the worst possible regular submodel to
assess the difficulty of statistical inference procedures for the entire family P. Evidently,
if the class of ”regular submodels” is too small, no sensible results are to be expected
from that procedure. On the other hand, if the class of ”regular submodels” is too large,
the Stein’s procedure can become intractable or no simplification is really gained, which
is not in the spirit of the method proposed. Hence, when applying the Stein procedure it
is our task to find a class of ”regular submodels” with the adequate size.

The idea of ”regular submodel” mentioned will be formalized by introducing the no-
tion of path differentiability. A range of concepts of path differentiability are studied
in this section, all of them fulfilling the minimal requirement for a ”regular submodel”,
i.e. the score functions of the differentiable paths (viewed as submodels) will be auto-
matically well defined, unbiased and possess finite variances. The strongest notion of
path differentiability considered is the L∞ differentiability (or pointwise differentiability)
and the weakest notion is the essential differentiability. It will turn out that a notion
of path differentiability called “Hellinger differentiability” (or “weak differentiability”) is
the weakest notion that captures some important essential statistical properties of the
model P. Another distinguished notion considered is the L2 differentiability which will
involve calculations with Hilbert spaces, simplifying all the computations required. The
L2 differentiability coincides with the Hellinger differentiability in most of the examples
considered in this thesis. It turns that the L2 differentiability will be useful in the theory
of estimating functions.

This section is organized as follows. Subsection 2.2.1 studies the basic notion of path
differentiability and some general properties of differentiable paths. Some specific concepts
of differentiability are introduced in the subsections 2.2.2, 2.2.3 and 2.2.4 where weak
or Hellinger, Lq and essential differentiability are studied, respectively. The associated
notions of tangent sets and tangent spaces are discussed in subsection 2.2.5.
30 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

2.2.1 General definition of path differentiability


We give next a more precise definition of the terms ”submodel” and ”regular submodel”
informally used in the previous discussion. Recall that we were interested in defining a
one-dimensional submodel contained in the family P for which the score function would
be well defined and well behaved.
Let us consider a subset V of [0, ∞) which contains zero and for which zero is an
accumulation point. The set V will play the role of the parameter space in the ”submodel”
we define. Typical examples are: [0, ) for some  > 0 and {1/n : n ∈ N }∪{0}. A mapping
from V into P ∗ assuming the value p ∈ P ∗ at zero is said to be a path converging to p.
Here the image of V under a path plays the role of the ”submodel” of P and the path acts
as a one-dimensional parametrization of the ”submodel”. It is convenient to represent a
path by a generalized sequence {pt }t∈V = {pt }, where for each t ∈ V , pt ∈ P ∗ is the value
of the path at t.
We introduce next the notion of differentiability which will enable us to formalize more
precisely what in the Stein program is the class of ”regular submodels”. A path {pt }t∈V
(converging to p) is differentiable at p ∈ P ∗ if for each t ∈ V we have the representation
pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) (2.1)
for a certain ν( · ) ∈ L20 (p), and
rt −→ 0, as t ↓ 0 . (2.2)
The convergence in (2.2) is in some appropriate sense to be specified later. In fact, in
the next subsections we explore several notions of path differentiability by introducing
alternative definitions for that convergence. The term rt in (2.1) will be referred to as the
remainder term.
The function ν : X −→ IR given in (2.1) is said to be the tangent associated to the
differentiable path {pt }. Here the tangent plays the role of the score function of the
submodel parametrized by t ∈ V at p0 = p. To see the analogy with the score function
suppose that the convergence of rt in (2.2) is in the sense of the pointwise convergence.
In that case the tangent coincides with the score function of the submodel associated
with the differentiable path {pt } at p0 = p. In the general case, where the convergence
of rt is not necessarily pointwise convergence, the general chain rule for differentiation
of functions in metric spaces (see Dieudonné , 1960) can often be applied to justify our
interpretation of the tangent. We stress that according to our definition, the tangent of
a differentiable path (or alternatively the score of a regular submodel) has automatically
finite variance and mean zero (i.e. it is in L20 (p)).
Before embracing the study of notions of differentiability generated by some specific
definitions of the convergence of rt , we give a useful and trivial general property of re-
mainder terms of differentiable paths. Suppose that a path {pt } is differentiable at p ∈ P ∗
2.2. DIFFERENTIABLE PATHS 31

with representation given by (2.1), with ν ∈ L20 (p). Then we have, for each t ∈ V

pt ( · ) − p( · )
rt ( · ) = − ν( · ) (2.3)
tp( · )

and
Z ( )
Z
pt (x) − p(x)
rt (x)p(x)λ(dx) = − ν(x) p(x)λ(dx) = 0 .
X X tp(x)

2.2.2 Hellinger and weak path differentiability


Most of the estimation theory for non- and semi-parametric models found in the lit-
erature (see Bickel et al., 1993 and references therein) is developed using the notion
of Hellinger differentiability studied next. This notion appears in the literature in two
equivalent forms: weak differentiability (see Pfanzagl 1982, 1985 and 1990) and Hellinger
differentiability (see Hájeck, 1962, LeCam, 1966 and Bickel et al., 1993). This notion of
differentiability plays a central role in the theory presented because it enables us to grasp
some essential statistical properties of the models considered. For instance, the Hellinger
differentiability is equivalent to local asymptotic normality of the submodel defined by
the path. Moreover, the Hellinger differentiability is used in the so called convolution the-
orem, which gives a bound for the concentration of a rich class of estimators (the regular
asymptotic linear estimators).
We begin by introducing the weak differentiability which is in the general form of path
differentiability formulated before. A path {pt }t∈V is weakly differentiable at p ∈ P ∗ if
there exist ν ∈ L20 (p) and a generalized sequence of functions {rt }t∈V such that for each
t∈V

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · )

and
1Z
|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.4)
t {x:t|rt (x)|>1}

Z
|rt (x)|2 p(x)λ(dx) −→ 0 , as t ↓ 0 . (2.5)
{x:t|rt (x)|≤1}

In other words, {pt } is weakly differentiable if it is differentiable according to the general


definition of path differentiability with the convergence of the generalized sequence {rt }
given by (2.4) and (2.5).
32 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Let us introduce now the Hellinger differentiability of paths. The key idea in this ap-
proach is to characterize the family P of probability measures by the class of square roots
of the densities, instead of the densities. The advantage of this alternative characterization
is that the square roots of the densities are in the Hilbert space
 Z 
L2 (λ) = f : X −→ IR : f 2 (x)λ(dx) < ∞ .
X

In this way the statistical model in play is naturally embedded into a space with a rich
mathematical structure. Using the usual topology of L2 (λ) one defines the differentiability
of paths in the sense of Fréchet (or in this case, since the domain of the path is contained
in IR, the equivalent notions of Hadamard and Gateaux differentiability could be used
also). The precise definition of Hellinger differentiability is the following. A path {pt }t∈V
is Hellinger differentiable at p ∈ P ∗ if there exists a generalized sequence {st }t∈V in L20 (p)
converging to zero as t ↓ 0, i.e.

kst kp −→ 0 , as t ↓ 0 (2.6)

and ν ∈ L20 (p) such that

1/2 1
pt ( · ) = p1/2 ( · ) + tp1/2 ( · ) ν( · ) + tp1/2 ( · )st ( · ) . (2.7)
2

The factor 21 in the second term of the right side of (2.7) will serve to accommodate with
the other notions of differentiability. Note that each st is in fact in L20 (p). For, from (2.7)

1/2
p ( · ) − p1/2 ( · ) ν( · )
st ( · ) = t − . (2.8)
tp1/2 ( · ) 2
 1/2
2
pt (x) 1/2
pt (x)λ(dx) = 1 < ∞ , we have that pt ( · )/p1/2 ( · ) ∈
R R
Since X p1/2 (x)
p(x)λ(dx) = X
1/2
 1/2

pt (x) 1 pt (x)
L2 (p), and hence tp1/2 (x)
= t p1/2 (x)
− 1 ∈ L2 (p).

Proposition 1 A path {pt } is Hellinger differentiable if and only if {pt } is weak differ-
entiable.

Proof: See Pfanzagl (1985). u


t
2.2. DIFFERENTIABLE PATHS 33

2.2.3 Lq path differentiability


We study next a useful range of path differentiability notions. These notions will serve
us to graduate how strong is the Hellinger or weak differentiability; and they will be used
auxiliary in the calculation of the weak tangents of weak differentiable paths. In spite
of the secondary role these differentiability notions play in our development, they are
important in the general theory of differentiability of statistical functionals, in particular
in the theory of von Mises functionals. The L2 differentiability defined below will be
useful when studying the use of estimating functions for semiparametric models.
The main idea here is to consider the Lq convergence for the generalized sequence {rt }
appearing in the definition of differentiable paths. The precise definition is the following.
A path {pt }t∈V ⊆ P ∗ is Lq differentiable at p ∈ P ∗ , for q ∈ [1, ∞], if there exist ν ∈ L20 (p)
and a generalized sequence {rt } in Lq (p) such that for each t ∈ V ,

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) (2.9)

and

krt kLq (p) −→ 0 , as t ↓ 0 . (2.10)

The following proposition relates the notions of Lq path differentiability.

Proposition 2 Consider r, q ∈ [1, ∞] such that r ≤ q. If a path is Lq differentiable at


p ∈ P ∗ , then it is also Lr differentiable at p with the same tangent.

Proof: The proposition follows immediately from the fact that convergence in Lq (p)
implies convergence in Lr (p). u
t

There are two distinguished cases of Lq path differentiability: strong and mean dif-
ferentiability corresponding to L2 and L1 differentiability respectively. The L1 differen-
tiability is remarkable because it is the weakest notion of differentiability found in the
literature, and the L2 differentiability distinguish itself because the L2 spaces, when en-
dowed with the natural inner product, are Hilbert spaces, which simplifies significantly
the calculations.
We study next the relation between weak and Lq path differentiability. As we will see
in the propositions 3 and 4 given above, weak differentiability is an intermediate notion
of path differentiability between L2 and L1 differentiability.

Proposition 3 If a path is L2 differentiable at p ∈ P ∗ , then it is weakly (or Hellinger)


differentiable at p, with the same tangent.
34 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Proof: Let {pt } be a differentiable path in the L2 sense with representation (2.9) and
krt kL2 (p) −→ 0 as t ↓ 0. We show that the path {pt } fulfills the conditions (2.4) and (2.5)
for the convergence of the remainder term in the sense of the weak path differentiability.
For,
1Z 1Z
|rt (x)|p(x)λ(dx) ≤ t|rt (x)||rt (x)|p(x)λ(dx)
t {x:t|rt (x)|>1} t {x:t|rt (x)|>1}
Z
= |rt (x)|2 p(x)λ(dx)
{x:t|rt (x)|>1}
Z
≤ |rt (x)|2 p(x)λ(dx)
X
= krt k2p −→ 0 , as t ↓ 0 .
Hence {rt } satisfies (2.4). On the other hand,
Z Z
2
|rt (x)| p(x)λ(dx) ≤ |rt (x)|2 p(x)λ(dx) = krt k2p −→ 0, as t ↓ 0.
{x:t|rt (x)|≤1} X

Hence {rt } satisfies (2.5). We conclude that {pt } is differentiable in the weak sense with
tangent ν. u
t

Proposition 4 If a path is weakly (or Hellinger) differentiable at p ∈ P ∗ , then it is L1


differentiable (or differentiable in mean) at p, with the same tangent.

Proof: Take a path {pt } weakly differentiable at p with tangent ν. There exists a
generalized sequence of functions {rt } satisfying (2.4) and (2.5), such that for all t ∈ V ,
pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) .
Note that (2.4) implies that
Z
|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 , (2.11)
{x:t|rt (x)|>1}

and (2.5) implies that


Z
|rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 . (2.12)
{x:t|rt (x)|≤1}

For, (2.5) is equivalent to L2 (p) convergence of st ( · ) := rt ( · )χ{x:t|rt (x)|≤1} ( · ) to zero.


From (1.29),
Z
|rt (x)|p(x)λ(dx) = kst kL1 (p) ≤ kst kL2 (p) −→ 0 .
{x:t|rt (x)|≤1}
2.2. DIFFERENTIABLE PATHS 35

Combining (2.11) and (2.12) we obtain


Z Z
|rt (x)|p(x)λ(dx) = |rt (x)|p(x)λ(dx)
X {x:t|rt (x)|≤1}
Z
+ |rt (x)|p(x)λ(dx) −→ 0 , as t ↓ 0 .
{x:t|rt (x)|>1}

We conclude that {rt } is L1 differentiable at p with tangent ν. u


t

2.2.4 Essential path differentiability


We study next the weakest notion of path differentiability considered in this text. A
path {pt }t∈V ⊆ P ∗ is essential differentiable at p ∈ P ∗ if there exists ν ∈ L20 (p) and a
generalized sequence {rt : X −→ IR}t∈V of (A, B(IR))- measurable functions such that for
each t ∈ V ,

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) (2.13)

and for any sequence {kn }n∈N ⊆ V such that kn −→ 0 as n → ∞ there is a subsequence
{ki }i∈N ⊆ {kn }n∈N such that rki ( · ) −→ 0 p-almost surely as i → ∞.
We show next that essential differentiability is weaker than differentiability in mean
which, in view of propositions 3 and 4 implies that the essential differentiability is the
weakest notion of path differentiability considered here.

Proposition 5 If a path is L1 differentiable, then it is essential differentiable, with the


same tangent.

Proof: The generalized sequence {rt }t∈V is Cauchy, because it converges in L1 to zero.
Using theorem 3.12 in Rudin (1987, page 68) the essential differentiability follows. u
t
36 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

The following scheme represents the interrelation between the various notions of path
differentiability considered.
L∞ differentiability

p
L differentiability

q
L differentiability

L2 differentiability

Weak differentiability ⇔ Hellinger differentiability

1
L differentiability

essential differentiability
Here 2 < p < r < ∞.

2.2.5 Tangent spaces and tangent sets


Re-taking the Stein approach, the notion of differentiable path formalized the idea of
”regular one-dimensional submodel”, the tangent of a differentiable path playing the role
of the score function of these submodels. Here we elaborate the notion of tangent set
which is the class of all possible tangents of differentiable paths. This will be useful to
work with the idea of ”worst possible case” contained informally in the Stein method,
and to specify global properties common to all the scores of ”regular one-dimensional
submodels”. For technical reasons we need in fact to work in many situations with the
smallest closed subspace containing the tangent set, which is called the tangent space.
In the next section we will define a notion of differentiability for statistical functionals.
There the tangent set will play the role of ”test functions”, analogous to the role of test
functions when one defines the differentiability of tempered distributions (see Rudin,
1973). The notion of tangent space plays a crucial role when studying the theory of
models with nuisance parameters. There we will need to obtain a component of a partial
score function orthogonal (in the L2 sense, i.e. uncorrelated) to the scores of a model
obtained by fixing the parameter of interest and letting the nuisance parameter vary.
This component of the partial score function is obtained by orthogonal projection of the
score function onto the orthogonal complement of the tangent space (or nuisance tangent
2.2. DIFFERENTIABLE PATHS 37

space as we will call the tangent space of the submodel we mentioned). It will then be
comfortable to work with a closed subspace of L2 . We remark that it can be proved that
the tangent set is a pointed cone, but in general not even a vector space. Therefore the
necessity to introduce the notion of tangent space as given here.
The formal definition of tangent space and tangent set depends on the notion of path
differentiability one uses. We give next a general definition of tangent set and tangent
space which will be made precise when we specify the notion of path differentiability we
use. Suppose we adopt a certain definition of path differentiability according to which a
differentiable path at p ∈ P ∗ , say {pt }, has representation, for each t ∈ V ,

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) (2.14)

and

rt −→ 0 , as t ↓ 0 , (2.15)

where the convergence in (2.15) is in a certain sense known. Then the tangent set of P
at p ∈ P ∗ is the class
( )
o o ν ∈ L20 (p) : ∃V, {pt }t∈V ⊆ P ∗ , {rt }t∈V , such that
T (p) = T (p, P) = .
∀t ∈ V, (2.14) and (2.15) hold

The tangent space of P at p ∈ P ∗ is given by

T (p) = T (p, P) = clL20 (p) [span{T o (p, P)}] .

Since the tangent sets and spaces depend on the notion of path differentiability adopted,
we speak of Lq (for q ∈ [1, ∞]), weak (or Hellinger) tangent sets and tangent spaces. When
W
necessary we use the notation T for the weak tangent space. The Lq tangent spaces are
q e
represented by T and the essential tangent spaces by T .
The following proposition relates the notions of tangent sets and tangent spaces given.

Proposition 6 For each p ∈ P ∗ and for 2 < q < r < ∞ we have:


∞ r q 2 W 1 e
T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) ⊆T (p) .

Proof: Straightforward from the interrelations between the notions of path differentia-
bility. u
t

We close this section with two examples of the calculation of tangent spaces.
38 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Example 1 (Full tangent spaces of a large class of distributions) Consider the


class P of all distributions in IR dominated by the Lebesgue measure with continuous
density (with respect to the Lebesgue measure) and with support (of the density) equal to
the whole real line. Denote the class of densities of P by P ∗ . We calculate the tangent
space of P at each p ∈ P ∗ .
Take an arbitrary element ν of Cb ∩ L20 (p). Here Cb denotes the class of continuous
compact supported functions from IR to IR. It is a classical result of analysis that Cb is
dense in L2 (p) (see Rudin, 1966), hence Cb ∩ L20 (p) is dense in L20 (p). We show that
ν ∈ T 0 (p, P) (for any notion of tangent sets defined before). Consider the path {pt } given
for t ∈ [0, ∞) small enough, by

pt ( · ) = p( · ) + tp( · )ν( · ) . (2.16)

We claim that for t sufficiently small, pt ∈ P ∗ , which implies that ν ∈ T 0 (p, P). It suffices
to verify that pt is positive and integrates 1. For t small pt is positive because ν is bounded
and p is bounded in the support of ν, hence the second term in the right hand of (2.16)
is smaller than p (for t small). That pt integrates 1 follows from the fact that ν has
expectation zero (with respect to p). u
t

It is not surprising that the previous enormous class of distributions possesses a “full”
tangent space. The next example show that this could be the case even in families where
we have a lot of information about the distributions of the family.

Example 2 (Full tangent space for families with information on the moments)
Consider the class P of all distributions in IR dominated by the Lebesgue measure with
continuous density (with respect to the Lebesgue measure) and with support (of the density)
equal to the whole real line. Suppose further that the moments of all orders exist and that
there exist a δ > 0, a k ∈ N and the constants m1 , . . . , mk such that for each i ∈ {1, . . . , k}
the moment of order i is contained in the open interval (mi − δ, mi + δ). I claim that the
tangent space of P at any P ∗ is L20 (p). The proof follows the same line of the argument
as given in the previous example. Take a path as in (2.16) with ν ∈ Cb ∩ L20 (p). For t
sufficiently small, pt will be positive, integrate to one, possess finite moments of all orders,
and the moments of order i, for i ≤ k will be contained in the interval (mi − δ, mi + δ).
u
t
2.3. FUNCTIONAL DIFFERENTIABILITY 39

2.3 Functional differentiability


2.3.1 Definition and first properties of functional differentiabil-
ity
We consider in this section a functional φ : P ∗ −→ IRq (for some q ∈ N ) which will
play the role of a parameter of interest that we want to estimate. RTypical examples
are the Rmean and the second moment functionals defined by φ(p) = X xp(x)λ(dx) and
φ(p) = X x2 p(x)λ(dx) respectively. An important non trivial example for the theory of
semiparametric models is the interest parameter functional defined next and studied in
detail in section 2.4.

Example 3 Semiparametric models


Suppose that the family P ∗ of probability densities with respect to a measure λ can be
represented in the form

P ∗ = {p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z} .

Here it is assumed that the mapping (θ, z) 7→ p( · ; θ, z) is a bijection between Θ × Z and


P ∗ . The interest parameter functional φ : P ∗ −→ IRq is defined, for each p( · ; θ, z) ∈ P ∗ ,
by

φ{p( · ; θ, z)} = θ .

u
t

We introduce next a notion of functional differentiability that will enable us to develop


a theory of estimation for the functional φ. Let p be a fixed element of P ∗ . Consider a
non empty subset T (p) of the tangent space at p. A functional φ : P ∗ −→ IRq is said to
be differentiable at p ∈ P ∗ with respect to T (p) if there exists a function φ•p : X −→ IRq ,
such that φ•p ∈ {L20 (p)}q and for each ν ∈ T (p) there is a differentiable path {pt } with
tangent ν and
φ(pt ) − φ(p)
−→ < φ•p , ν >p , as t ↓ 0 . (2.17)
t
Here < φ•p , ν >p is the vector with components given by the inner product of the q
components of φ•p and ν. The function φ•p : X −→ IR is said to be a gradient of the
functional φ at p (with respect to T (p)). Note that φ•p depends on the point p at which
we study the differentiability of the functional φ. If a functional φ is differentiable at each
p ∈ P ∗ we say that φ is differentiable.
40 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Since the definition of functional differentiability depends on the notion of path dif-
ferentiability, we speak of L∞ , Lp , strong (L2 ), weak, mean (L1 ) and essential functional
differentiability. When necessary we superpose a symbol indicating the notion of path
differentiability in play. When we are speaking generically or when it is clear from the
context which notion of path differentiability is in play, we just use the notation φ•p for
the gradient and T 0 (p, P ∗ ) = T 0 (p), T (p, P ∗ ) = T (p) for the tangent set and the tangent
space of P ∗ at p respectively.
Note that the notion of functional differentiability introduced here involves a subset
T (p) of the tangent space and not necessarily the whole tangent space as is current in
the literature. This will give much more flexibility to the estimation theory developed.
Clearly the smaller is the class T (P ) (or the stronger is the notion of path differentiability)
used, the weaker is the related functional differentiability. On the other hand, the larger
is the class T (p), the sharper will be the results of the estimation theory related, in the
sense that the bounds for the lower asymptotic variance will be larger or the optimality
results will include more estimating sequences. In this sense the ideal would be to choose
the larger T (p) (and the stronger path differentiability) that makes differentiable the
functional under study. Of course, we will have to require some mathematical properties
for the classes T (p) in order to obtain a notion of functional differentiability useful for the
estimation theory of differentiable functionals. For instance, it will be assumed through
(and silently) that T (p) is a pointed cone (i.e. if ν ∈ T (p), then for each α ∈ IR+ ∪ {0},
αν ∈ T (p)). We will refer form now on to T (p) as the tangent cone. It will be necessary
sometimes to require the tangent cones to be convex.
We consider next a trivial example that illustrates the mechanics of the functional
differentiability.

Example 4 (Mean functional) Let λ be a σ-finite measure defined on a measurable


space (X , A). Consider a family of probability measures P on (X , A) dominated by λ
given by the following representation
( )
dP
P= ( · ) = p( · ) : (2.19) − (2.22) hold . (2.18)

The conditions to define P are

∀x ∈ X , p(x) > 0 ; (2.19)

Z
p(x)λ(dx) = 1 ; (2.20)
X

p is continuous ; (2.21)
2.3. FUNCTIONAL DIFFERENTIABILITY 41
Z
x2 p(x)λ(dx) ∈ IR+ . (2.22)
X

We denote the class of densities of the elements of P with respect to λ by P ∗ . Define the
functional M : P ∗ −→ IR by, for each p ∈ P ∗
Z
M (p) = x p(x)λ(dx) .
X

We prove that M is a differentiable functional with respect to the L2 tangent space. As


we have seen in the previous section the tangent space of P at any p ∈ P ∗ is the whole
space L20 (p).
Take p ∈ P ∗ fixed and an arbitrary L2 -differentiable path at p, say {pt }t∈V , with
representation given by for each t ∈ V
pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) ,
L2 (p)
where ν ∈ L20 (p), {rt } ⊂ L20 (p) and rt −→
0
0 as t ↓ 0. We have,
M (pt ) − M (p) xpt (x)λ(dx) − X xp(x)λ(dx)
R R
X
= (2.23)
t t
= < ν( · ), ( · ) >p + < rt ( · ), ( · ) >p
−→ < ν( · ), ( · ) >p , as t ↓ 0 .
The last convergence comes from the continuity of the inner product and the L2 (p) con-
vergence of the path remainder term to zero.
Define the function
Z
Mp• ( · ) = (·) − xp(x)λ(dx) .
X

Clearly, Mp• is in L20 (p) and


Z
< ν, Mp• >p =< ν( · ), ( · ) − xp(x)λ(dx) >p =< ν( · ), ( · ) >p . (2.24)
X

Since (2.24) and (2.23) hold for any L2 differentiable path, we conclude that M is differ-
entiable with respect to the L2 tangent set and Mp• is a gradient of M . An argument based
on subsequences (c.f. Labouriau , 1996) yields the differentiability of M with respect to
the essential tangent set, i.e. the mean functional is differentiable with is the strongest
sense we can define in our setup.
Note that in this example (2.17) holds for any differentiable path with tangent ν ∈
T (p). However, according to our definition of functional differentiability it would be
enough if the condition (2.17) holds for one path with tangent ν. u
t
42 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Let φ : P ∗ −→ IRq be a differentiable functional at p ∈ P ∗ with gradient φ•p : X −→ IR.


It follows immediately from the definition of gradient that a function φ?p : X −→ IR in
L20 (a) is also a gradient of φ at p if and only if,
∀ν ∈ T (p), < ν, φ•p >p = < ν, φ?p >p . (2.25)

We conclude from the remark above that if φ•p is a gradient of φ at p and ξ ∈ {T (p)}⊥ (i.e.
ξ is in the orthogonal complement of the tangent space with respect to L20 (p)), then φ•p + ξ
is also a gradient of φ at p. Hence, in general the gradient of a differentiable functional is
not unique.
A gradient φ•p of a differentiable functional at p ∈ P ∗ is said to be a canonical gradient
if φ•p ( · ) ∈ T̄ (p). Here T̄ (p) denotes the L2 closure of the space spanned by T (p). The
following proposition shows that there exists only one canonical gradient (apart from
almost surely equal functions) and gives a recipe to compute the canonical gradient,
namely by orthogonal projecting any gradient onto T̄ (p). We will see that the canonical
gradient plays a crucial rule in the theory of estimation of functionals.

Proposition 7 Let φ : P ∗ −→ IR be a differentiable functional at p ∈ P ∗ . If φ•p :


X −→ IRq is a gradient of φ at p, then the vector formed by the orthogonal projection of
components of φ•p onto T̄ (p), say

{φ•1p |T̄ (p)}, . . . , {φ•qp |T̄ (p)})T ,


Y Y
(

is also a gradient of φ at p. Furthermore, if φ∗p is another gradient of φ at p, then

{φ•1p |T̄ (p)}, . . . , {φ•qp |T̄ (p)})T = ( {φ∗1p |T̄ (p)}, . . . , {φ∗qp |T̄ (p)})T ,
Y Y Y Y
(

p almost surely.

Proof: We prove the proposition for the case where q = 1. The same argument applied
componentwisely proves the case for q ∈ N , but with a more notation. From the projection
theorem we have the following orthogonal decomposition
φ•p = {φ•p |T̄ (p)} + {φ•p |T̄ (p)} .
Y Y

Here T̄ ⊥ (p) is the orthogonal complement of T̄ (p) in L20 (p). Hence

{φ•p |T̄ (p)} = φ•p − {φ•p |T̄ ⊥ (p)} .


Y Y

Since {φ•p |T̄ ⊥ (p)} is orthogonal to T̄ (p), we conclude from (2.25) that {φ•p |T̄ (p)} is a
Q Q

gradient.
2.3. FUNCTIONAL DIFFERENTIABILITY 43

Reasoning analogously we conclude that if φ∗ is another gradient of φ at p, then

{φ∗p |T̄ (p)} = φ∗p − {φ∗p |T̄ ⊥ (p)} .


Y Y

is a gradient of φ at p. From (2.25), for all ν ∈ T̄ (p)

{φ•p |T̄ (p)}, ν >p =< {φ∗p |T̄ (p)}, ν >p


Y Y
<

and hence, for all ν ∈ T̄ (p),

{φ•p |T̄ (p)} − {φ∗p |T̄ (p)}, ν >p = 0 .


Y Y
< (2.26)

In particular (2.26) holds for

{φ•p |T̄ (p)} − {φ∗p |T̄ (p)} ,


Y Y
ν=

which yields

{φ•p |T̄ (p)} − {φ∗p |T̄ (p)}k2L2 (p) = 0 .


Y Y
k

{φ•p |T (p, P)} = {φ∗p |T (p, P)} p almost surely. u


t
Q Q
We conclude that

Example 5 (Mean functional continued) It can be shown that the tangent space of
the model P given by (2.18) at each p ∈ P ∗ is the whole space L20 (p). Hence the gradient
calculated in example 4 is the canonical gradient. Moreover, the canonical gradient is the
only possible gradient for the mean functional. Note that if we drop the condition that
requires the existence of the variance of p (i.e. condition (2.22)), then Mp• is no longer a
gradient (because it is not in L2 ) and M is not differentiable at p. u
t

We consider next a proposition given trivial (but useful) rules for calculating gradients
of “composed” gradients.

Proposition 8 Let Ψ, φ : P −→ IRq be two differentiable functionals with (canonical)


gradient at p ∈ P ∗ Ψ• and φ• respectively. Let g : IRq −→ IRq be a differentiable function.

i) For all a, b ∈ IRq , aΨ + bφ is differentiable at p and its (canonical) gradient is given


by aΨ• + bφ• (aΨ? + bφ? ).

ii) g ◦ φ is differentiable at p functional with gradient ∇g{φ(p)}{φ• ( · )}T . If φ• is the


canonical gradient of φ then ∇g{φ(p)}{φ• ( · )}T is the canonical gradient of g ◦ φ.
44 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Proof:
i) Straightforward.
ii) We give next the proof for the case where q = 1. The general case is obtained in a
similar way. Take an arbitrary differentiable path {pt } with tangent ν. Define ξ(t) = φ(pt ),
we have
φ(pt ) − φ(p)
−→< ν, φ• >p = ξ 0 (0) .
t
Now,
(g ◦ φ)(pt ) − (g ◦ φ)(p)
−→ (g ◦ ξ)0 (0) = g (ξ(0)) < ν, φ• >p
t
= < ν, φ(p)φ• >p .

u
t

2.3.2 Asymptotic bounds for functional estimation


We study next some results concerning the estimation of a differentiable statistical func-
tional under repeated sampling. These results will illustrate the importance of the canon-
ical gradient and will guide the choice of the notion of path differentiability and tangent
cone to be used.
We start by defining sequences of estimators for a given differentiable functional φ :
P −→ IRq (with respect to some tangent cones T (p)) based on samples. A sequence of

functions {φ̂n }n∈N = {φ̂n } such that for each n ∈ N , φ̂n : X n −→ IRq is (An , B(IRq ))-
measurable is said to be an estimating sequence . Next we introduce two notions of
regularity of estimating sequences often found in the literature. An estimating sequence
{φ̂n } is said to be weakly regular (for estimating φ, with respect to the choice of tangent
cones made) if for each p ∈ P ∗ and each ν ∈ T (p) there exists a differentiable path
{pn−1/2 }n∈N converging to p and with domain V = {n−1/2 : n ∈ N }, for which
√ Z
n{φ(pn−1/2 ) − φ(p)} −→ φ• (x, p)ν(x)p(x)λ(dx)
X

and there exists a probability distribution Lpν (not depending on the path) such that
h√
D
i
Lpn−1/2 n{φ̂n ( · )φ(p)} −→ Lpν .
n

If the distributions Lpν above do not depend on the tangent ν, then we say that {φ̂n }n∈N
is regular.
2.3. FUNCTIONAL DIFFERENTIABILITY 45

An important class of estimating sequences are the asymptotic linear sequences defined
next. An estimating sequence {φ̂n } is said to be asymptotic linear (for estimating φ) if
there exists a function ICφ : X × P ∗ −→ IR such that for each p ∈ P ∗ , the function
ICφ ( · ; p) : X −→ IR is in L20 (p) and for each n ∈ N given a sample x = (x1 , . . . , xn ) of
size n, φ̂n admits the following representation
n
1X  
φ̂n (x) = φ(p) + ICφ (xi ; p) + opn n−1/2 . (2.27)
n i=1

The function ICφ is called the influence function of φ. The representation (2.27) can be
re-written as
√ n n
o 1 X
n φ̂n − φ(p) = √ ICφ (xi ; p) + opn (1) .
n i=1

From the central limit theorem and the Slutsky theorem


√ n o
D
n φ̂n − φ(p) −→ N [0, Covp {ICφ ( · ; p)}] ,

where
Z
ovp {ICφ ( · ; p)} = ICφ (x; p)ICφT (x; p)p(x)λ(dx) . (2.28)
X

Theorem 4 Let {φ̂n } be an asymptotic linear estimating sequence with influence function
w0
IC. Suppose that for each p ∈ P ∗ the tangent cone is given by T (p) =T (p, P ∗ ). Then,
{φ̂n } is regular if and only if for all p ∈ P ∗ , φ is differentiable at p (with respect to T (p))
and IC( · ; p) is a gradient of φ at p.

Proof: See Pfanzagl (1990) for the case where q = 1 or Bickel et al. (1995). u
t

The theorem above identifies (influence functions of) regular asymptotic linear se-
quences ofR estimators for estimating the functional φ with the gradients of φ. The co-
variance, φ•p (x)φ•p (x)T p(x)λ(dx), of a gradient φ•p of φ is the asymptotic covariance of
the corresponding regular asymptotic linear estimating sequence (under p) with influence
function φ•p . On the other hand, since the components of the canonical gradient φ∗ of φ
are the orthogonal projection of the components of any gradient onto the tangent space,
we have for a given gradient φ•p and for all p ∈ P ∗

φ•p ( · ) = φ?p ( · ) + R( · ; p) ,
46 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

for some R( · ; p) ∈ {T ⊥ (p; P)}q . A standard argument yields then that, for all p ∈ P ∗ ,
Z n on oT Z n on oT
φ∗p (x) φ∗p (x) p(x)λ(dx) ≤ φ•p (x) φ•p (x) p(x)λ(dx) , (2.29)
X X

with inequality in the sense of the Löwner partial order of matrices. That is, the covari-
ance of the canonical gradient is a lower bound for the asymptotic covariance of regular
asymptotic linear estimating sequences. Moreover, only an asymptotic linear estimating
sequence with influence curve equal to the canonical gradient achieves this bound. We
say that an asymptotic linear estimating sequence is optimal if, for each p ∈ P ∗ , its in-
fluence function is the canonical gradient of φ. The bound (2.29) is sometimes called the
semiparametric Cramèr-Rao bound.
In spite of the elegance of this theory, some care should be observed in applying it.
Firstly, there is a certain degree of arbitrariness in choosing only the class of regular
asymptotic linear estimating sequences. When restricting to that class one can discard
many interesting sequences. This criticism applies, of course, to any optimality approach.
A second, more specific criticism is the following: It occurs very often that the tangent
space of large (semi- or non-parametric models) is the whole space L20 (see the examples
at the end of the section on tangent spaces). In those cases, due to the uniqueness of the
canonical gradient, each differentiable functional possesses only one gradient. We conclude
from the previous discussion that then there is only one possible influence function and
hence all regular asymptotic linear estimating sequences are asymptotically equivalent (as
far as the asymptotic variance is concerned). Therefore an optimality theory for regular
asymptotic linear estimators is meaningless for the models with tangent spaces equal to
the whole L20 . We refine next the optimality theory for functional estimation.
It is convenient to introduce the following notation. Given a differentiable functional
φ with respect to the tangent cones {T (p) : p ∈ P ∗ } and with canonical gradient φ? ( · , p)
at each p ∈ P ∗ , denote X φ?p (x)φ?p (x)T p(x)λ(dx) by Iφ (p). That is Iφ (p) is the covari-
R

ance matrix of the canonical gradient. A weakly regular estimating sequence {φ̂n } is
asymptotically of constant bias at p ∈ P ∗ if for each ν, η ∈ T (p)
Z Z
xdLpν (x) = xdLpη (x) ∈ IRq .

In particular, any regular estimating sequence is asymptotically of constant bias.


Theorem 5 (van der Vaarts extended Crámer-Rao theorem) Let φ : P ∗ −→ IRq
w
be a differentiable at p ∈ P ∗ with respect to T (p) ⊆T (p). Suppose that the sequence {φ̂n }
is weakly regular and asymptotically of constant bias at p ∈ P ∗ . Suppose also that the
covariance matrix of Lp0 exists. Then
Cov(Lp0 ) ≥ Iφ (p) , (2.30)
2.3. FUNCTIONAL DIFFERENTIABILITY 47

where the symbol 00 ≥00 is understood in the sense of the Löwner partial order of matrices
1
. Moreover, the equality in (2.30) occurs only if
√ n
1 X
n{φ̂n − φ(p)} = √ φ?p (xj ) + oP (1) . (2.31)
n j=1

Proof: See van der Vaart (1980). u


t

We see from the theorem above that the larger are the tangent cones T (p) used, the
sharper are the inequalities (2.30). Small tangent cones make more likely the differentia-
bility of the functional but can make also the bound in (2.30) unattainable.
Another important optimality result in the theory of estimation of functionals is the
convolution theorem, which we give the following version.
Theorem 6 (Convolution theorem) Suppose that T (p) is convex and φ : P ∗ −→ IRq
differentiable at p ∈ P ∗ with respect to T (p). Then any limiting distribution Lp of a
regular estimating sequence for φ at p satisfies
Lp = N (0, Iφ (p)) ∗ M , (2.32)
where M is a probability measure on IRq .

w
Proof: See Pfanzagl (1990) for the case where q = 1 and T (p) =T (p) and van der Vaart
(1980) for the general case. u
t

The expression (2.32) shows that, under the assumptions of the convolution theorem, a
regular estimating sequence cannot possess asymptotic covariance smaller than the L2 (p)
squared norm of the canonical gradient. This provides an extension of the interpretation
of the optimality theory for regular asymptotic linear estimating sequences. In fact, even
when the tangent cone is is the whole L20 , the “optimal” regular asymptotic linear esti-
mating sequence attains the bound for the concentration of regular estimating sequences
given by the convolution theorem, provide the functional is differentiable. An advantage
of the version of the convolution theorem presented is that we need not to work with the
whole tangent space but with a convex cone of it. This can be useful when the functional
in study is not differentiable or when the calculation of the (weak) tangent space is not
feasible.
We close this section presenting a theorem that gives a minimax approach to the
problem of estimation of functionals. A function l : IRq −→ IR is sad to be bowl-shaped
if l(0) = 0, l(x) = l(−x) and for all k ∈ IR, {x : l(x) ≤ k} is convex.
1
That is A ≥ B means that A − B is positive definite.
48 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

Theorem 7 (Local asymptotic minimax theorem) Suppose that for each p ∈ P ∗ ,


w
T (p) ⊆T (p)is convex and φ : P ∗ −→ IRq differentiable at p ∈ P ∗ with respect to T (p).
Then
i) For any sequence of estimators which is weakly regular at p and bowl-shaped loss function
l
Z Z
sup l(x)dFpν (x) ≥ l(x)dN (0, Iφ (p))(x) . (2.33)
ν∈T (p)

ii) For any bowl-shaped loss function l and any estimating sequence {φ̂n },
√ Z
lim lim inf sup EQ {l[ n{φ̂n − φ(Q)}]} ≥ l(x)dN (0, Iφ (p))λ(dx) , (2.34)
c→∞ n→∞ Q∈H (p,c)
n

where Hn (p, c) := {Q ∈ P : n {dQ1/2 (x) − p1/2 (x)}2 λ(dx)} is the interception between P
R

and the ball constructed with the Hellinger distance of center p and radius n−1/2 .

Proof: See van der Vaart (1980). u


t

Note that from part i) one can obtain a bound for the concentration of weakly regular
estimating sequences based on the the canonical gradient, provided φ is differentiable with
respect to some convex tangent cones. In particular, if there exist an optimal asymptotic
linear estimating sequences and the assumptions of the theorem hold (i.e. differentiability
of φ and convexity of the tangent cone), then the bound for weak regular estimating
sequences given by (2.33) is attained by this regular asymptotic linear estimating sequence.
In this way, in the case where the tangent space is the whole L20 , the optimality of the
(unique) regular asymptotic linear estimating sequence can be justified. The bound of
the second part of the theorem above holds for the whole class of estimators, however it
is in general not attainable.
2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 49

2.4 Asymptotic bounds for semiparametric models


We consider a family of distributions P dominated by a the σ- finite measure λ with
representation
( )
∗ dPθz
P = ( · ) = p( · ; θ, z) : θ ∈ Θ ⊆ IRq , z ∈ Z .

Here θ is a q- dimensional interest parameter and z is a nuisance parameter of arbitrary
nature. We assume that Θ is open and that the mapping (θ, z) 7→ p( · ; θ, z) is a bijection
between Θ × Z and P ∗ . The interest parameter functional φ : P ∗ −→ IRq is defined, for
each p( · ; θ, z) ∈ P ∗ , by
φ{p( · ; θ, z)} = θ .
We will consider the differentiability of the interest parameter functional φ for a range of
tangent cones.
Recall that we assumed that for each (θ0 , z0 ) ∈ Θ × Z,
∀x ∈ X , p(x; θ0 , z0 ) > 0 ,
that the partial score function
∇p(x; θ, z0 )|θ=θ0
l(x; θ0 , z0 ) = = (l1 (x; θ0 , z0 ), . . . , lq (x; θ0 , z0 ))T
p(x; θ0 , z0 )
is λ- almost everywhere well defined and that for i = 1, . . . , q,
li (x; θ0 , z0 ) ∈ L20 (Pθ0 z0 ) .
Let us consider a fixed (θ, z) ∈ Θ × Z at which we will study the differentiability of
φ. For notational simplicity we denote p( · ; θ, z) by p( · ).
The first tangent cone we consider is
T1 (p) = span{li (x; θ, z) : i = 1, . . . , q} .
Take ν ∈ T1 (p). There exists α ∈ IRq such that ν( · ) = lT ( · ; θ, z)α. Define (for t small
enough) the path
pt ( · ) = p( · ; θ + tα, z) .
Clearly, there exists {rt } such that
p( · ; θ + tα, z) − p( · )
lT ( · ; θ, z)α = + rt ( · ) , (2.35)
tp( · )
50 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

with rt ( · ) −→ 0 λ- almost everywhere. Hence the path {pt } is (L∞ ) differentiable with
tangent lT ( · ; θ, z)α. Moreover,

φ(pt ) − φ(p) θ + tα − θ
= = α.
t t
−1
Defining φ?p ( · ) = Covθz { l( · ; θ, z) }l( · ; θ, z) we obtain,
Z Z
−1
φ?p (x)ν(x)p(x)λ(dx) = Covθz {l( · ; θ, z)}l(x; θ, z)lT (x; θ, z)αp(x)λ(dx)
X X
φ(pt ) − φ(p)
= α = lim .
t→0 t
We conclude that φ is differentiable at p with respect to T1 (p). Moreover,

Covθz ( l( · ; θ, z) )−1 l( · ; θ, z)

is the canonical gradient of φ. Note that we used (in (2.35)) implicitly the L∞ path dif-
ferentiability, however the argument presented holds for any weaker path differentiability.
For, note that the essential point is that we identify (through (2.35)) any element of the
tangent cone T1 (p) with a L∞ differentiable path. If we adopt a path differentiability
weaker than the L∞ differentiability, then the L∞ differentiable paths identified with the
elements of the tangent cone would be differentiable in the current sense also and the
differentiability of the functional φ follows from the argument presented above.
The efficient scores Iφ (p) (i.e. the correlation matrix of the canonical gradient of φ at
p) is the inverse of the correlation matrix of the score function l( · ; θz). The bounds for the
asymptotic variance obtained with this naive choice of tangent cones are not attainable
in general. This will be apparent from the development presented next where sharper
bounds will be presented.
We introduce the notion of nuisance tangent space that plays a fundamental rule in
the estimation theory in semiparametric models. For each θ0 ∈ Θ consider the submodels

Pθ∗0 = {p( · ; θ0 , z) : z ∈ Z} .

The nuisance tangent set at (θ, z) ∈ Θ × Z, TN0 (θ, z), is the tangent set of Pθ∗ , i.e.
TN0 (θ, z) = T 0 (p, Pθ∗ ). The closure of the space spanned by the nuisance tangent set is
called the nuisance tangent space and denoted by TN (θ, z). Here we do not specify the
notion of path differentiability adopted, but when necessary a symbol will be superim-
posed.
An alternative for the tangent cone better than T1 (p) is

T2 (p) = span{li (x; θ, z) : i = 1, . . . , q} ∪ TN0 (θ, z) .


2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 51

We show next that φ is differentiable with respect to T2 (p), no matter which notion of
path differentiability we use. Consider a ν ∈ TN0 (p) ⊂ T2 (p). There is a differentiable
path {pt } contained in Pθ∗ with tangent ν. Since for each t, pt ∈ Pθ∗ , φ(pt ) = θ = φ(p) and
φ(pt ) − φ(p)
= 0.
t
From the definition of functional differentiability, any gradient φ•p of φ should satisfies, for
each ν ∈ TN0 (θ, z),
φ(pt ) − φ(p) Z •
0 = lim = φp (x)ν(x)p(x)λ(dx) . (2.36)
t&0 t X

On the other hand, the argument presented in the case of the tangent cone be T1 (p)
implies that, if ν ∈ span{li (x; θ, z) : i = 1, . . . , q}, say ν( · ) = l( · ; θ, z)T α, for some
α ∈ IRq , then any gradient φ•p of φ satisfies,
Z
α= φ•p (x)ν(x)p(x)λ(dx) . (2.37)
X

Clearly, the conditions (2.36) and (2.37) are sufficient to ensure that φ•p is a gradient
of φ. From these considerations, a natural candidate for being a gradient of φ is the
(standardized) projection of the score function onto the orthogonal complement of the
nuisance tangent space. Formally, define the function lE : X × Θ × Z −→ IRq by, for each
 T
(θ, z) ∈ Θ × Z, lE ( · ; θ, z) = l1E ( · ; θ, z), . . . , lqE ( · ; θ, z) where, for i = 1, . . . , q,

liE ( · ; θ, z) = (li ( · ; θ, z)|TN⊥ (θ, z)) .


Y

Here (g|A) is the orthogonal projection of g ∈ L20 (Pθz ) onto A ⊆ L20 (Pθz ). Moreover,
Q

TN⊥ (θ, z) is the orthogonal complement of TN (θ, z) in L20 (Pθz ). The function lE is called
the efficient score function and we define the efficient score by
Z
J(θ, z) = lE (x; θ, z)lE (x; θ, z)T p(x)λ(dx) .
X

Define
φ?p ( · ) = J(θ, z)−1 lE (x; θ, z) .

Clearly φ?p satisfies (2.36) and (2.37). We conclude that φ?p is a gradient of φ. Moreover,
φ?p is the canonical gradient (with respect to T2 (p)), since φ?p is in the closure of the span
of the tangent cone.
Note that choosing T2 (p) as the tangent cone, the functional φ is still differentiable
and we obtain a bound related with the extended Cramér-Rao inequality sharper than
52 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY

the bound obtained with T1 (p). However, since the T2 (p) is not necessarily convex, it is
impossible to use the convolution theorem and the local minimax theorem.
A third alternative for the tangent cone is

T3 (p) = span{li (x; θ, z) : i = 1, . . . , q} + TN0 (θ, z)


n o
= l( · ; θ, z)T α + η( · ) : α ∈ IRq , η ∈ TN0 (θ, z) .

Clearly T3 (p) is convex, however the functional φ is not necessarily differentiable. We


introduce next an additional assumption in the model that will make φ differentiable.
Suppose that for each α ∈ IRq and each η ∈ TN0 (θ, z) there exists a generalized sequence
{zt } = {zt (θ, z)} such that {pt } ⊂ P ∗ , given by

pt ( · ) = p( · ; tα + θ, zt ) (2.38)

is a differentiable path with tangent lT ( · ; θ, z)α + η( · ). This assumption can be found


often in the literature in an implicit form (see for instance Pfanzagl, 1990, page 17, for
the case where q = 1). We prove differentiability of φ at p with respect to T3 (p) under
(2.38). Given ν( · ) = l( · ; θ, z)T α + η( · ) ∈ T3 (p), and taking a path {pt } as in (2.38) we
obtain
φ(pt ) − φ(p) tα + θ − θ
= = α.
t t
On the other hand,
Z Z
J −1 (θ, z)lE(x; θ, z)ν(x)p(x)λ(dx) = J −1 (θ, z) lE (x; θ, z)lT(x; θ, z)p(x)λ(dx)α
X X
Z
−1
+J (θ, z) lE (x; θ, z)η(x)p(x)λ(dx)
X
φ(pt ) − φ(p)
= α = lim .
t&0 t
Hence φ is differentiable at p with respect to T3 (p) and

φ?p ( · ) = J −1 (θ, z)lE ( · ; θ, z)

is the canonical gradient. In other words, we obtained the same canonical gradient of
φ if we work with T2 (p) or T3 (p) and consequently the extended Cramér-Rao bound is
also the same with the two choices of tangent cone. Note that T3 (p) is convex hence
we can use the convolution and the local asymptotic minimax theorems. This provides
an additional justification of the extended Cramér-Rao bound (via convolution theorem)
and a optimality theory involving a larger class of estimators, namely the weakly regular
2.4. ASYMPTOTIC BOUNDS FOR SEMIPARAMETRIC MODELS 53

asymptotic linear estimating sequences (as in the first part of the local asymptotic mini-
max theorem) or even arbitrary estimating sequences (as in the second part of the local
asymptotic minimax theorem). However, we pay a price for these improvements, we have
to introduce regularity conditions on the model in order to obtain the differentiability of
the interest parameter functional.
It is current in the literature to take the whole (weak or Hellinger) tangent set as the
tangent cone, assume that the tangent set is equal to T3 (p) and use (implicitly) assump-
tions equivalent to (2.38) (see Pfanzagl, 1990 page 17). The strength of the approach
based on tangent cones, and not necessarily on the whole tangent set, is that it allow us
to graduate the regularity conditions. We can avoid the assumptions mentioned above
in the difficult cases or take full advantage of them in the sufficiently regular cases. The
approach based on tangent cones allow us to treat the cases where the tangent set is dif-
ficult (or virtually impossible) to calculate (see for instance the semiparametric extended
L2 - restricted models considered in chapter 5).
We conclude the chapter with a comment regarding reparametrizations. Suppose
that we reparametrize the model by considering the interest parameter g(θ) instead of θ.
Here g is a one-to-one differentiable application from IRq to IRq . The interest parameter
functional becomes g ◦ ψ(Pθz ) = g(θ). An application of the proposition 8 and the chain
rule shows that if an estimating sequence {θ̂n } attains the semiparametric Cramèr-Rao
bound for estimating θ then the transformed sequence {g(θ̂n )} attains the Cramèr-Rao
bound for estimating g(θ).
54 CHAPTER 2. PATH AND FUNCTIONAL DIFFERENTIABILITY
Chapter 3

Estimating and Quasi Estimating


Functions

3.1 Introduction
In this chapter the theory of estimating functions for semiparametric models is studied.
The basic definitions and properties of estimating functions are given in section 3.2.
There a related notion called quasi estimating function is also introduced. Quasi esti-
mating functions are essentially functions of the observations, the interest parameter and
(different from the estimating functions) of the nuisance parameter. They will provide
a way to formalize in a more clear way the theory of estimating function and relate
estimating functions with regular asymptotic linear estimators. In order to construct
an optimality theory for estimating functions, we define a class of what we call regular
estimating functions. Two alternative (and equivalent) characterizations of the regular
estimating functions are provided in the subsections 3.2.2 and 3.2.3. The second charac-
terization is motivated by differential geometric considerations concerning the statistical
model (inspired by Amari and Kawanabe, 1996).
The characterizations referred to are used to derive an optimality theory in section 3.3.
A necessary and sufficient condition for the coincidence of the bound for the concentration
of estimators based on estimating functions and the semiparametric Cramèr-Rao bound
is provided in subsection 3.3.3. This condition says essentially that the nuisance tangent
space should not depend on the nuisance parameter.
The last section contains some complementary material. Subsection 3.4.1 studies a
technique for obtaining optimal estimating functions when the likelihood function can
be decomposed in certain way. In this way an alternative justification for the so called
principle of conditioning will be provided. A generalization of the notion of estimating
function is introduced in subsection 3.4.2. The chapter closes with a result that will allow

55
56 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

us to characterize when the semiparametric Cramèr-Rao bound is attained by estimators


derived from regular estimating functions. This result turns out to be useful in the
subsequent chapters.
3.2. BASIC DEFINITIONS AND PROPERTIES 57

3.2 Estimating functions and quasi- estimating func-


tions: basic definitions and properties
3.2.1 Estimating and quasi-estimating functions
A function Ψ : X ×Θ −→ IRq such that for each θ ∈ Θ, the associated function Ψ( · ; θ, z) :
X −→ IRq is measurable, is termed an estimating function. Estimating functions are
used to define sequences of estimators for the parameter of interest θ in the following way.
Under a repeated independent sample squeme, given a sample x = (x1 , . . . , xn )T of size n
of the (unknown) distribution Pθz ∈ P, define θbn implicitly by the solution of the equation
n
X
Ψ(xi ; θbn ) = 0 . (3.1)
i=1

Under regularity conditions each θbn is well defined and the sequence {θbn } is consistent (for
estimating θ) and asymptotically normally distributed. We explore this fact to construct
an optimality theory.
We introduce next a notion related to estimating functions. A function Ψ : X × Θ ×
Z −→ IRq , of the parameters and the observations, such that for each θ ∈ Θ and each
z ∈ Z, the function Ψ( · ; θ, z) : X −→ IRq is measurable is called a quasi-estimating
function. Each estimating function can be naturally identified with a quasi-estimating
function by making it correspond to a suitable quasi-estimating function constant on
the nuisance parameter. We make no distinction between estimating functions and the
corresponding quasi- estimating functions. This abuse of language causes, in general, no
risk of ambiguity.
A quasi- estimating function Ψ : X × Θ × Z −→ IRq such that the conditions (3.2)-
(3.6) below are satisfied is said to be a regular quasi-estimating function . The conditions
are, with ψi denoting the ith component of Ψ and for all θ0 ∈ Θ, all z ∈ Z and all
i, j ∈ {1, ..., p},
ψi ( · ; θ0 , z) ∈ L20 (Pθ0 z ); (3.2)
the partial derivative with respect to θ is well defined (almost everywhere), i.e.

ψi ( · ; θ, z) |θ=θ0 exists ; (3.3)
∂θj
the order of differentiation with respect to θ and integration can be exchanged in the
following sense
∂ Z Z

j
ψi (x; θ, z)p(x; θ, z)λ(dx)|θ=θ0 = [ψi (x; θ, z)p(x; θ, z)]θ=θ0λ(dx); (3.4)
∂θ ∂θj
58 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

the following q × q matrix is nonsingular


"Z #

Eθz {∇θ Ψ( · ; θ, z)} = ψi (x; θ, z) |θ=θ0 p(x; θ0 z)λ(dx) ; (3.5)
X ∂θj i,j=1,...,q

and
n o Z 
T
Eθz Ψ(·; θ0 , z)Ψ (·; θ0 , z) = ψi (x; θ0 , z)ψj (x; θ0 , z)p(x; θ0 z)λ(dx) (3.6)
X i,j=1,...,q

is positive definite.
It is presupposed that the parametric partial score function is a regular quasi-estimating
function.
A regular quasi-estimating function that does not depend on the nuisance parameter
z is said to be a regular estimating function.

3.2.2 First characterization of regular estimating functions


In this section we give a characterization of the class of regular estimating functions.

Proposition 9 Let Ψ : X × Θ × Z −→ IRq be a regular quasi- estimating function with


components ψ1 , . . . , ψq . For all (θ, z) ∈ Θ × Z and i ∈ {1, . . . , q},

ψi ( · ; θ, z) ∈ TN⊥ (θ, z) .
2
Here and in the rest of this chapter TN (θ, z) =T N (θ, z) and TN⊥ (θ, z) is the orthogonal
2
complement of the nuisance tangent space T N (θ, z) in L20 (Pθz ).

Proof: Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , k} z ∈ Z fixed and ν ∈ TN0 (θ, z) arbitrary.
We prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ). This implies the proposition,
because of the continuity of the inner product.
Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term
{rt }t∈V . Using (2.1), for each t ∈ V ,

hν( · ), ψi ( · ; θ, z)iθz = h[{pt ( · ) − p( · ; θ, z)}/tp( · ; θ, z)] − rt ( · ), ψi ( · ; θ, z)iθz

1Z 1Z
= ψi (x; θ, z)pt (x)dµ(x) − ψi (x; θ, z)p(x; θ, z)dµ(x)
t X t X
−hrt ( · ), ψi ( · ; θ, z)iθz

= −hrt ( · ), ψi ( · ; θ, z)iθz .
3.2. BASIC DEFINITIONS AND PROPERTIES 59

L2 (Pθz )
Since rt −→ 0, from the continuity of the inner product, we conclude that
hν( · ), ψi ( · ; θ, z)iθz = 0 .
u
t

If the quasi- estimating function does not depend on the nuisance parameter (i.e. it
corresponds to a genuine estimating function), then we can obtain a sharper result.
Proposition 10 Let Ψ : X × Θ −→ IRq be a regular estimating function with components
ψ1 , . . . , ψq . For all θ ∈ Θ and i ∈ {1, . . . , q},
TN⊥ (θ, z) .
\
ψi ( · ; θ) ∈
z∈Z

In fact, the proposition above holds for the class of quasi- estimating functions with
expectation invariant with respect to the nuisance parameter.

Proof: Take θ ∈ Θ and i ∈ {1, . . . , k} fixed and arbitrary ξ ∈ Z and ν ∈ TN0 (θ, ξ). We
prove that ν and ψi ( · ; θ) are orthogonal in L2 (Pθz ).
Let {pt }t∈V be a differentiable path at (θ, z) with tangent ν and remainder term
{rt }t∈V . Using (2.1), for each t ∈ V ,
hν( · ), ψi ( · ; θ, z)iθz = −hrt ( · ), ψi ( · ; θ, z)iθz .
L2 (Pθz )
Since rt −→ 0, from the continuity of the inner product, we conclude that
hν( · ), ψi ( · ; θ, z)iθz = 0 .
u
t

3.2.3 Amari-Kawanabe’s geometric characterization of regular


estimating functions
We present in this section a variant of the geometric theory of estimating functions for
semiparametric models given in Amari and Kawanabe (1996). The development presented
is closely connected with the theory given in that paper, however it is not exactly the
same. We point out the most remarkable differences at the end of the section.
Take (θ, z) ∈ Θ × Z fixed. Given a ∈ L20 (Pθz ) and z∗ ∈ Z denote p( · , θ, z) and
p( · , θ, z∗ ) by p( · ) and p∗ ( · ) respectively and define the m-parallel transport of a from z
to z∗ by
(m) p( · )
Πzz∗ a( · ) = a( · ) .
p∗ ( · )
60 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

If a posses a finite expectation under Pθz∗ we define the e-parallel transport of a from z
to z∗ by
(e) Z
Πzz∗ a( · ) = a( · ) − a(x)p(x; θ, z∗ )λ(dx) .
X

The basic properties of the m- and e-parallel transport are given next.

Proposition 11 We have for each z, z∗ ∈ Z and each a, b ∈ L20 (p) ∩ L20 (p∗ ):
Z (m) Z (e)
Πzz∗ b(x)p∗ (x)λ(dx) = Πzz∗ b(x)p∗ (x)λ(dx) = 0 ; (3.7)
X X

(m)
ha, Πzz∗ biθz∗ = ha, biθz ; (3.8)

(e) (m)
hΠzz∗ a, Πzz∗ biθz∗ = ha, biθz (3.9)

and
(m) (m) (e) (e) (m) (e)
Πzz∗ Πzz∗ a( · ) =Πzz∗ Πzz∗ a( · ) =Πzz a( · ) =Πzz a( · ) = a( · ) . (3.10)

Proof: Straightforward from the definition of m-parallel transport. u


t

The parallel transports defined above have their origin in differential geometric consid-
erations for statistical parametric models (α-connections). We will not enter in details of
the geometric theory for semiparametric models, but refer instead to Amari and Kawan-
abe (1996) for an informal discussion. The parallel transports permit us to change the
inner product (see (3.8)), i.e. it permits us to move from one L2 space to another, keeping
to certain extent the structure given by the inner product of the first space. For instance
the L2 orthogonality (i.e. noncorrelation) is preserved after m-parallel transporting. From
the statistical viewpoint the e- and the m-parallel transport corresponds to correcting for
the mean and correcting for the distribution, respectively, when we move from one L2
space to another.
The following class of functions will be of interest in the theory of estimating functions,

r ∈ TN⊥ (θ, z) :
 
 ∀z∗ ∈ Z and ∀ν∗ ∈ TN (θ, z∗ ), 
FIA (θ, z) = (m) .

hΠzz∗ ν∗ , riL2 (Pθz ) = 0 
3.2. BASIC DEFINITIONS AND PROPERTIES 61

(e)
When the e-parallel transport is well defined one can use alternatively the relation hν∗ , Πzz∗
(m)
riL2 (Pθz∗ ) = 0 instead of hΠzz∗ ν∗ , riL2 (Pθz ) = 0. Informally, FIA is the class of functions r in
TN⊥ (θ, z) such that r corrected for the mean or corrected for the distribution is orthogonal
to each TN (θ, z∗ ) under Pθz∗ (for z∗ running in the whole Z).

Proposition 12 For each (θ, z) ∈ Θ × Z, FIA (θ, z) is a closed subspace of L20 (Pθz ).

(m)
Proof: The linearity and the continuity of hΠzz∗ ν∗ , ( · )iL2 (Pθz ) implies that FIA (θ, z) is a
vector subspace and a closed set in L2 (Pθz∗ ), respectively. u
t

The following proposition gives a characterization of regular estimating functions in


terms of the classes of functions FIA ’s.

Proposition 13 Given a regular estimating function Ψ with components ψ1 , . . . , ψq , we


have, for i = 1, . . . , q, for all θ ∈ Θ and all z ∈ Z,

ψi ( · , θ) ∈ FIA (θ, z) .

Proof: Take i = 1, . . . , q, θ ∈ Θ and all z ∈ Z fixed. Given any z∗ ∈ Z and ν∗ ∈ TN (θ, z∗ )


we have from proposition 10 that ψi ( · ; θ) ∈ TN⊥∗ (θ, z∗ ) and then
(m)
hΠzz∗ ν∗ , ψi ( · ; θ)iL2 (Pθz ) = 0 .

Since z∗ was chosen arbitrarily, ψi ( · ; θ) ∈ FIA (θ, z). u


t

The proposition above can be easily sharpened making the components of the regular
estimating functions belong to the intersection (over the nuisance parameter) of the FIA ’s.
However the following theorem shows that this is in fact not necessary, since in fact FIA
does not depend on the nuisance parameter. We will use sometimes the notation FIA (θ).

Proposition 14 For all θ ∈ Θ and all z ∈ Z we have,

FIA (θ, z) = ∩z∗ ∈Z TN⊥∗ (θ, z∗ ) .

Here TN⊥∗ (θ, z∗ ) denotes the orthogonal complement of TN (θ, z∗ ) in L20 (Pθz∗ ).

Proof:
62 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

’⊆’ Take η ∈ FIA (θ, z), z∗ ∈ Z arbitrary and ν∗ ∈ TN (θ, z∗ ). Applying (3.8) yields
(m)
hν∗ , ηiL2 (Pθz∗ ) = hΠzz∗ ν∗ , ηiL2 (Pθz ) = 0 .

Hence η ∈ TN⊥∗ (θ, z∗ ). Since z∗ was choose arbitrarily in Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ).
’⊇’ Take an arbitrary z∗ ∈ Z, η ∈ ∩z∗ ∈Z TN⊥∗ (θ, z∗ ) and ν∗ ∈ TN (θ, z∗ ). Using (3.8) we
obtain
(m)
hΠzz∗ ν∗ , ηiL2 (Pθz ) = hν∗ , ηiL2 (Pθz∗ ) = 0 .

Since z∗ is arbitrary, η ∈ FIA (θ, z). u


t

The proposition 14 shows that the characterization of regular estimating functions


obtained here is equivalent to what we obtained in the last section. We remark that the
characterization based on the intersection of the nuisance tangent spaces can be found
in Jørgensen and Labouriau (1995) and the characterization based on parallel transports
(i.e. based on FIA ) is a variant of the results of Amari and Kawanabe (1996). The
main difference of the variant presented here and the original formulation in Amari and
Kawanabe (1996) is that here we define via the m-parallel transport and there FIA is
constructed through e-parallel transport. Both formulations are equivalent from this
point of view, provided the e-parallel transport is well defined. Moreover, when defining
via the m-parallel transport the class FIA is automatically a closed subspace in L20 .
3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 63

3.3 Optimality theory for estimating functions


3.3.1 Classic optimality theory
Given a regular (estimating) quasi-estimating function Ψ we define the Godambe infor-
mation of Ψ by JΨ : Θ × Z −→ IR2q , where for each θ ∈ Θ and each z ∈ Z,
JΨ (θ, z) = Eθz{∇θ Ψ( · ; θ, z)}Eθz{Ψ( · ; θ, z)ΨT( · ; θ, z)}−1 Eθz{∇θ Ψ( · ; θ, z)}T
= SΨ (θ, z)VΨ−1 (θz)SΨT (θ, z) .
Here
SΨ (θ, z) := Eθz{∇θ Ψ( · ; θ, z)} and VΨ (θz) := Eθz{Ψ( · ; θ, z)ΨT( · ; θ, z)}
are called the sensibility and the variability of Ψ (at (θ, z) ), respectively.
Using standard arguments based on a Taylor expansion of Ψ it can be shown that
under some additional regularity conditions (each ψi twice continuous differentiable with
respect to each component of θ, for instance) a sequence {θbn } of roots of a regular esti-
mating functions is asymptotically normally distributed with asymptotic variance given
by JΨ−1 (θ, z), provided {θbn } is weakly consistent. (see Jørgensen and Labouriau, 1995
for conditions for consistency and asymptotic normality). Hence, we say that a regular
estimating function Ψ is optimal when for all θ ∈ Θ, for all z ∈ Z and for each regular
estimating function Φ,
JΦ (θ, z) ≤ JΨ (θ, z) .
Here ”≤” is understood in the sense of the Löwner partial order of matrices given by the
positive definiteness of the difference.
In the literature of estimating functions it is customary to say that it is possible to
justify the use of some estimators using finite sample arguments via estimating functions
and the Godambe estimation (see the articles of Godambe referred to). The argument
used there is that the Godambe information is a quantity that should be maximized when
using estimating functions. We do not share this point of view. The estimating functions
themselves are not the object of our direct interest. Our concern with estimating functions
is only through the estimators (or inferential procedures) associated with them. Hence
one should judge estimating functions only through the properties of such inferential
procedures. In fact, apart from the asymptotic variance, there are no clear connections
between the Godambe information and the (asymptotic or finite sample) properties of the
estimators associated with regular estimating functions.
We say that two regular quasi-estimating functions, Ψ, Φ : X × Θ × Z −→ IRk , are
equivalent if, for each θ ∈ Θ and z ∈ Z there exists a k × k matrix with full rank K(θ, z),
such that
Ψ(x; θ, z) = K(θ, z)Φ(x, θ, z) , Pθz a.s. .
64 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

We stress that K(θ, z) must not depend on the observation x. Clearly, two equivalent
estimating functions have the same roots almost surely and hence produce essentially the
same estimators, i.e. they are equivalent from the statistical point of view. Moreover,
it is easy to see that two equivalent quasi-estimating functions share the same Godambe
information for each value of the parameters.

3.3.2 Lower bound for the asymptotic covariance of estimators


obtained through estimating functions
We define the information score function, lI : X × Θ × Z −→ IRq , by the orthogonal
projection of the partial score function, l onto FIA (θ). More precisely, for each θ ∈ Θ
and z ∈ Z, the ith component of the information score function (i = 1, . . . , q) at (θ, z) is
given by
liI ( · ; θ, z) = Π{l( · ; θ, z)|FIA (θ)}, .
The space spanned by the components l1I , . . . , lqI of the information score function at
(θ, z) ∈ Θ × Z is denoted by E(θ, z), i.e.
E(θ, z) = span{liI ( · ; θ, z) : i = 1, . . . , q} .
Note that E(θ, z) is a closed (since it is finite-dimensional vector space) subspace of
L20 (Pθz ). Hence given any regular estimating function Ψ : X × Θ× −→ IRq with com-
ponents ψ1 , . . . , ψq we have, for all θ ∈ Θ, z ∈ Z and i ∈ {1, . . . , q} the orthogonal
decomposition
ψi ( · ; θ, z) = ψiA ( · ; θ, z) + ψiI ( · ; θ, z) , (3.11)
where ψiI ( · ; θ, z) ∈ E(θ, z) and ψiA ( · ; θ, z) ∈ A(θ, z) := E ⊥ (θ, z). Here A(θ, z) is the
orthogonal complement of E(θ, z) in L20 (Pθz ). The decomposition above induces the fol-
lowing decomposition of each regular quasi-estimating function
Ψ( · ; θ, z) = ΨA ( · ; θ, z) + ΨI ( · ; θ, z) , (3.12)
where the components ψiA ( · ; θ, z), . . . , ψiA ( · ; θ, z) of ΨA at (θ, z) are in A(θ, z) and the
components ψiI ( · ; θ, z), . . . , ψiI ( · ; θ, z) of ΨI at (θ, z) are in E(θ, z).
We show next that taking the “component” ΨI of a regular (quasi-) estimating function
improves the Godambe information. However, at this stage a technical difficulty appears,
the function ΨI is not necessarily a regular quasi-inference function, and hence does not
necessarily possesses a well-defined Godambe information. For this reason we introduce
next an extension of the notion of sensitivity, and consequently of Godambe information,
which will make us able to speak of Godambe information of some non-regular (quasi)
3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 65

estimating functions. To motivate our extended notion of sensitivity, consider a regular


estimating function Ψ : X × Θ × Z −→ IRq . We characterize the sensitivity of Ψ in an
alternative form that will suggest the extension one should define. For each (θ, z) ∈ Θ×Z
and each i, j ∈ {1, . . . , q} we have
∂ Z
0 = ψj (x; θ, z)p(x; θ, z)dµ(x) (3.13)
∂θi X
(differentiating under the integral sign )
Z

= {ψj (x; θ, z)p(x; θ, z)} dλ(x)
X ∂θi
Z
∂ Z

= {ψj (x; θ, z)} p(x; θ, z)dλ(x) + ψj (x; θ, z) {p(x; θ, z)} dλ(x) .
X ∂θi X ∂θi
Hence
Z

{ψj (x; θ, z)} p(x; θ, z)dλ(x)
X ∂θi
Z

= − ψj (x; θ, z) {p(x; θ, z)} dλ(x)
X ∂θi
Z
= ψj (x; θ, z)li (x; θ, z)p(x; θ, z)dλ(x) = −hψj ( · ; θ, z), li ( · ; θ, z)iθz
X
(decomposing li = liA + liI with liI ∈ FIA and liA ∈ TN )
= −hψj ( · ; θ, z), liI ( · ; θ, z)iθz − hψj ( · ; θ, z), liA ( · ; θ, z)iθz
(Since ψj ∈ FIA and liA orthogonal FIA )
= −hψj ( · ; θ, z), liI ( · ; θ, z)iθz
(decomposing ψj = ψiA + ψiI and using the orthogonality of liI and ψiA )
= −hψjI ( · ; θ, z), liI ( · ; θ, z)iθz .

We conclude that the sensitivity of Ψ at (θ, z) is given by


h ij=1,...,k
SΨ (θ, z) = −hψjI ( · ; θ, z), liI ( · ; θ, z)iθz . (3.14)
i=1,...,k

Here [aij ]j=1,...,k


i=1,...,k denotes the matrix formed by aij ’s with i indexing the columns and j
indexing the lines.
We define the extended sensitivity (or simply the sensitivity ) of Ψ by the matrix
in the right-hand side of (3.14). The (extended) Godambe information is defined in the
same way we did before but using the extended sensitivity instead of the sensitivity. Note
that both, the standard and the extended, versions of the sensitivity (and the Godambe
information) coincide in the case where Ψ is regular. Moreover, the extended sensitivity
66 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

is defined for each quasi-estimating function whose components are in L20 , not only for
regular estimating functions. According to the new definition both Ψ and ΨI posses the
same sensitivity.

Proposition 15 Given a regular inference function Ψ, for all θ ∈ Θ and all z ∈ Z,

JΨ (θ, z) ≤ JΨI (θ, z) .

Proof: For each θ ∈ Θ and z ∈ Z,

JΨ−1 (θ, z) = SΨ−1 (θ, z)VΨ (θ, z)SΨ−T (θ, z)


= SΨ−1I (θ, z){VΨI (θ, z) + VΨA (θ, z)}SΨ−TI (θ, z)
= SΨ−1I (θ, z)VΨI (θ, z)SΨ−TI (θ, z) + SΨ−1I (θ, z)VΨA (θ, z)SΨ−TI (θ, z)
≥ SΨ−1I (θ, z)VΨI (θ, z)SΨ−TI (θ, z) = JΨ−1I (θ, z) .

u
t

The following proposition gives further properties of regular inference functions, which
will allow us to establish an upper bound for the Godambe information.

Proposition 16 Given a regular inference function Ψ, for all θ ∈ Θ and all z ∈ Z, we


have:

(i) ΨI ∼ lI ;

(ii) span{ΨIi ( · ; θ, z) : i = 1, . . . , k} = E(θ, z);

(iii) JΨI (θ, z) = JlI (θ, z).

Proof: Take θ ∈ Θ and z ∈ Z fixed.


(i) Assume without loss of generality that the components of the efficient score func-
tion l1I ( · ; θ, z), . . . , lqI ( · ; θ, z) are orthonormal in L20 (Pθz ). For each i ∈ {1, . . . , q}, ex-
panding ψi ( · ; θ, z) in a Fourier series with respect to a basis whose first q elements are
l1I ( · ; θ, z), . . . , lqI ( · ; θ, z) one obtains

ψi ( · ; θ, z) = hl1I ( · ; θ, z), ψi ( · ; θ, z)iθz l1I ( · ; θ, z)


+ · · · + hlqI ( · ; θ, z), ψi ( · ; θ, z)iθz lkI ( · ; θ, z) + ψiA ( · ; θ, z) .
3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 67

That is,

ψiI ( · ; θ, z) = hl1I ( · ; θ, z), ψi ( · ; θ, z)iθz l1I ( · ; θ, z) (3.15)


+ · · · + hlqI ( · ; θ, z), ψi ( · ; θ, z)iθz lqI ( · ; θ, z) .

Moreover, for j = 1, . . . , q
Z ( )

hljI ( · ; θ, z), ψi ( · ; θ, z)iθz =− ψi (x; θ, z) p(x; θ, z)dλ(x) . (3.16)
X ∂θj

We conclude from (3.15) and (3.16) that ΨI ( · ; θ, z) = SΨ (θ, z)lI ( · ; θ, z), which means
that ΨI and lI are equivalent.
(ii) From the previous discussion span{ΨIi ( · ; θ, z) : i = 1, . . . , q} is the space spanned by
−SΨ (θ, z)lI ( · ; θ, z) which is the span of {liI ( · ; θ, z) : i = 1, . . . , q}, since the sensitivity by
assumption is of full rank.
(iii) Straightforward.
u
t

A consequence of the two last proposition is that JlI is an upper bound for the Go-
dambe information of regular quasi inference functions. This upper bound is attained by
any (if any exists) extended regular inference functions with components in E. In partic-
ular if lI is a regular (quasi-) inference function, then it is an optimal (quasi-) inference
function.

3.3.3 Attainability of the semiparametric Cramér-Rao bound


We study in this section the Attainability of the semiparametric Cramér-Rao bound
through regular estimating function. More precisely, we give a necessary and sufficient
condition for the coincidence of the semiparametric Cramér-Rao bound and the bound
given in the previous section for the asymptotic variance of estimators derived from regular
estimating functions.
Let us consider the interest parameter functional Φ : P ∗ −→ Θ given by, for each
(θ, z) ∈ Θ × Z,

Φ(p( · ; θ, z)) = θ .

As shown in chapter 2, the functional Φ is differentiable at each p( · ; θ, z) := p( · ) ∈


P ∗ , with respect to the tangent cone T (p) = TN (θ, z) ∪ span{li ( · ; θ, z) : i = 1, . . . , q}.
Here we adopt the L2 path differentiability, since this is the path differentiability used
to characterize the class of regular estimating functions. We stress that the theory of
functional differentiability used here is compatible with any notion of path differentiability
68 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

stronger than (or equal to) the weak path differentiability, in particular the L2 path
differentiability is allowed. Moreover, in the examples we have in mind (L2 - restricted
models) the notions of weak and L2 path differentiability coincide. Take a fixed p( · ; θ, z) =
p( · ) in P ∗ . Consider the function

φ•p ( · ) = Covp (lI )−1 lI ( · ; θ, z) .

The following lemma will allow us to connect the optimality theory for estimating func-
tions with the semiparametric Cramér-Rao lower bound.

Lemma 1 The function φ•p ( · ) is a gradient of Φ at p.

Proof: A little of reflection reveals that it is enough to verify that


Z
∀ν ∈ TN0 (θ, z), φ•p (x)ν(x)p(x)λ(dx) = 0 (3.17)
X

and
Z
φ•p (x)lT (x; θ, z)p(x)λ(dx) = Iq , (3.18)
X

where Iq is the q × q identity matrix.


Take ν ∈ TN0 (θ, z). Since each component of uI ( · ; θ, z) is in FIA (θ, z) ⊆ TN⊥ (θ, z),
condition (3.17) holds. On the other hand,
Z Z
φ•p (x)lT (x; θ, z)p(x)λ(dx) = Covp (lI )−1 lI (x; θ, z)lT (x; θ, z)p(x)λ(dx)
X X
Z
= Covp (lI )−1 lI (x; θ, z)lI (x; θ, z)T p(x)λ(dx)
X
= Iq ,

that is the condition (3.18) holds. We conclude that Φ•p is a gradient of Φ at p with respect
to the tangent cone T (p). u
t

According to the lemma above Φ•p is a gradient of Φ at p, but not necessarily the
canonical gradient. In fact the canonical gradient of the functional Φ at p with respect to
T (p) is

Φ?p ( · ) = J −1 (θ, z)lE ( · ; θ, z) ,

where lE ( · ; θ, z) is the efficient score function at (θ, z) and J −1 (θ, z) is the covariance
matrix of lE ( · ; θ, z) under Pθz (see chapter 2). The unicity of the canonical gradient
implies that Φ•p is the canonical gradient if and only if it is equal to Φ?p and this occurs
3.3. OPTIMALITY THEORY FOR ESTIMATING FUNCTIONS 69

if and only if TN⊥ (θ, z) = FIA (θ). The covariance of Φ?p ( · ) (under Pθz ), that is J −1 (θ, z),
gives the semiparametric Cramér-Rao lower bound. On the other hand, the lower bound
for the asymptotic covariance of estimators obtained from regular estimating functions is
the covariance (under Pθz ) of Φ•p . We conclude that the following result holds.

Theorem 8 The semiparametric Cramér-Rao lower bound coincides with the bound for
the asymptotic covariance of estimators defined through regular estimating functions at
(θ, z) ∈ Θ × Z if and only if

TN⊥ (θ, z) = FIA (θ)(= ∩z∗ Z TN⊥ (θ, z∗ )) .

The theorem above implies that estimating functions produce efficient estimators only if
the orthogonal complement of the nuisance tangent space does not depend on the nuisance
parameter. Examples of that situation are the L2 - restricted and the partial parametric
models and models considered in chapter 5 and 6.
70 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

3.4 Further aspects


3.4.1 Optimal estimating functions via conditioning
We present in this section some results which allow us to compute optimal estimating
functions in many practical situations. The results will be in accordance with the so
called conditioning principle. For the sake of simplicity we study here only models with
a one-dimensional parameter of interest.
We study the situation where we have a likelihood factorization of the following form.
Suppose that there exists a statistic T = t(X) such that, for all θ ∈ Θ all z ∈ Z and all
x ∈ X,
p(x; θ, z) = ft (x; θ)h{t(x); θ, z}. (3.19)

Theorem 9 Assume that there exists a statistic T such that one has the decomposition
t t
(3.19). Moreover, suppose that the class {Pθz : z ∈ Z}, where Pθz is the distribution of
T (x) under Pθz (i.e. X ∼ Pθz ), is complete. Then the regular inference function given by

Ψ(x; θ) = log ft (x; θ), ∀x ∈ X , ∀θ ∈ Θ (3.20)
∂θ
is optimal. Moreover, if Φ is also an optimal inference function then Φ is equivalent to
Ψ.

The theorem above gives an alternative justification for the use of conditional inference.
The following technical (and trivial) lemma will be the kernel of the proofs that follow.
But first it is convenient to introduce the following notation. Given a regular inference
function Ψ : X × Θ −→ IRk , we define

Ψ(x; θ, z)
Ψ̃(x; θ) = ,
Eθz {Ψ0 (θ)}

which is called the standardized version of Ψ. Here Ψ0 (θ) = ∇θ Ψ(θ). Along this section
we denote the class of all regular estimating functions by G.

Lemma 2 For each regular inference function Ψ and Φ : X × Θ −→ IR and each θ ∈ Θ


and z ∈ Z, the following assertions hold:

(i)
Eθz {Ψ(θ)l(θ; z)}
= −1,
Eθz {Ψ0 (θ)}
where l(θ; z) is the partial score function at (θ; z);
3.4. FURTHER ASPECTS 71

(ii)
n o h i n o n o
Eθz Φ̃2 (θ) = Eθz {Φ̃(θ) − Ψ̃(θ)}2 + 2Eθz Φ̃(θ)Ψ̃(θ) − Eθz Ψ̃2 (θ) .

Proof: Since Ψ is unbiased, one has


Z
Ψ(x; z)p(x; θ, z)dµ(x) = 0.

Differentiating the expectation above with respect to θ and interchanging the order of
differentiation and integration, we obtain
( )

Eθz Ψ(θ) + Eθz {Ψ(θ)l(θ; z)} = 0
∂θ
which is equivalent to the first part of the lemma. The second part is straightforward. u
t

The following lemma gives a useful tool for computing optimal inference functions.
Lemma 3 Assume the previous regularity conditions. Consider two functions A : Θ −→
IR\{0} and R : X × Θ × Z −→ IR. Suppose that, for each regular inference function Φ,
one has, for each θ ∈ Θ and z ∈ Z,
Z
R(x; θ, z)Φ(x; θ)p(x; θ, z)dµ(x) = 0 .

If a regular inference function Ψ can be written in the form, for all θ ∈ Θ,


Ψ(x; θ) = A(θ)l(x; θ, z) + R(x; θ, z), (3.21)
for x Pθz - almost surely, (Ψ does not depend on z even though l and R do), then Ψ is
optimal. Furthermore, a regular inference function Φ is optimal if and only if for all
(θ, z) ∈ ΘZ,
Φ̃(θ) = Ψ̃(θ) , for x Pθz almost surely,
provided that there exists a decomposition as (3.21) above.

Proof: Take an arbitrary (θ, z) ∈ Θ × Z. Given Φ ∈ G one has


" #
Φ(θ)A(θ)l(θ, z) + Φ(θ)R(·; θ, z)
Eθz {Φ̃(θ)Ψ̃(θ)} = Eθz (3.22)
Eθz {Φ0 (θ)}Eθz {Ψ0 (θ)}
A(θ) Eθz {Φ(θ)l(θ, z)}
=
Eθz {Ψ (θ)} Eθz {Φ0 (θ)}
0

A(θ)
= − .
Eθz {Ψ0 (θ)}
72 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

Hence the value of Eθz {Φ̃(θ)Ψ̃(θ)} does not depend on Φ, in particular,

Eθz {Φ̃(θ)Ψ̃(θ)} = Eθz {Ψ̃2 (θ)} > 0.

On the other hand, from (ii) of Lemma 2, one has

Eθz {Φ̃2 (θ)} = Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + 2Eθz {Φ̃(θ)Ψ̃(θ)} − Eθz {Ψ̃2 (θ)} (3.23)
= Eθ [{Φ̃(θ) − Ψ̃(θ)}2 ] + Eθz {Ψ̃2 (θ)}
≥ Eθz {Ψ̃2 (θ)},

for each Φ ∈ G. Thus, ∀θ ∈ Θ, ∀z ∈ Z, ∀Φ ∈ G,


1 1
JΦ (θ, z) = 2
≤ = JΨ (θ, z). (3.24)
Eθz {Φ̃ (θ)} Eθz {Ψ̃2 (θ)}

We conclude that Ψ is optimal. For the second part of the theorem, note that one has
equality in (3.23), and hence in (3.24), if and only if ∀θ ∈ Θ, ∀z ∈ Z, Eθz [{Φ̃(θ) −
Ψ̃(θ)}2 ] = 0. That is, if a regular inference function Φ is optimal then Φ̃(·; θ) =
Ψ̃(·; θ) Pθz -a.s. , ∀θ ∈ Θ, ∀z ∈ Z. u
t

We can prove now the main theorem of this section.

Proof: (of theorem 9) Take θ ∈ Θ and z ∈ Z fixed. From (3.19),

∂ ∂ ∂
l(x; θ, z) = log p(x; θ, z) = log ft (x; θ) + log h(x; θ, z) . (3.25)
∂θ ∂θ ∂θ
We apply Theorem 3 to prove that ψ is a (“unique”) optimal inference function. More
precisely, defining A(θ) = 1 and R(x; θ, z) = −∂ log h{t(x); θ, z}/∂θ, and using (3.25) we
can write Ψ in the form

Ψ(x; θ) = log ft (; θ) = A(θ)l(x; θ, z) + R(x; θ, z) .
∂θ
According to lemma 3, if R is orthogonal to every regular inference function, then Ψ
is optimal, moreover Ψ is the unique optimal inference function, apart from equivalent
inference functions.
Take an arbitrary regular inference function φ. We show that φ and R are orthogonal.
Note that for each z ∈ Z,
Z Z
0= φ(x; θ)p(x; θ, z)dµ(x) = φ(x; θ)ft (x; θ)h{t(x); θ, z)dµ(x) .
3.4. FURTHER ASPECTS 73
R
On the other hand Eθz (φ|T ) = φ(x; θ)ft (x; θ)dµ(x), which is independent of z. We write
Eθ (φ|T ) for Eθz (φ|T ), and we have Eθz {Eθ (φ|T )} = 0. Since T is complete, Eθ (φ|T ) = 0,
Pθz almost surely. We have then,

Eθz {φ(θ)R(θ, z)} = Eθz {R(θ, z)Eθ (φ|T )} = 0 .

u
t

3.4.2 Generalized estimating functions


A quasi estimating function is said to be an generalized estimating function if it is
equivalent to an estimating function. More precisely, a quasi estimating function Ψ :
X × Θ × Z −→ IRq is an generalized estimating function if for each (θ, z) ∈ Θ × Z there
exists a non-singular q × q matrix A(θ, z) and a measurable function Φ( · , θ) : X −→ IRq
such that

Ψ(x; θ, z) = A(θ, z)Φ(x, θ) ,

for x λ- almost everywhere. If Φ is a regular estimating function, then Ψ is said to be a


regular generalized estimating functions.
generalized estimating functions are used for estimating the interest parameter in the
following way. Given a sample x = (x1 , . . . , xn )T of size n of a unknown probability
measure of the model, define the estimator θ̂n implicitly by the solution of the following
equation
n
X
0 = Ψ(xi ; θ̂n ; z)
i=1
n
X
= A(θ̂n , z) Φ(xi ; θ̂n ) ,
i=1

which is equivalent to
n
X
Φ(xi ; θ̂n ) = 0 .
i=1

In other words, for each generalized estimating function there is an estimating function
that yields the same estimating sequence. In fact generalized estimating functions are just
a tool that will simplify some formalizations. Examples of generalized estimating functions
are the efficient estimating function of most of the models studied in the subsequent
chapters. The following result will be useful latter on.
74 CHAPTER 3. ESTIMATING AND QUASI ESTIMATING FUNCTIONS

Proposition 17 If for each (θ, z) ∈ Θ × Z, FIA (θ) = TN⊥ (θ, z), and the efficient score
function is a generalized estimating function, then the estimating function equivalent to
the efficient score function attains the semiparametric Cramèr-Rao bound.

Proof: Take an arbitrary (θ, z) ∈ Θ × Z. Since FIA (θ) = TN⊥ (θ, z), the efficient score
function lE coincides with the information score function lI at (θ, z). Hence the extended
sensibility of the efficient score function (at (θ, z)) is the q × q identity matrix and the
Godambe information of lE at (θ, z) is
−1 E
JlE (θ, z) = Covθ,z (l ) ,

which is the semiparametric Cramèr-Rao bound. If Φ is an estimating function equivalent


to the efficient score function then its Godambe information is equal to the Godambe
information of the efficient score function, that is Φ attains the semiparametric Cramèr-
Rao bound at (θ, z). The proof follows now from the fact that (θ, z) was chosen arbitrarily.
u
t
Part II

Detailed Study of Some Classes of


Semiparametric Models

75
Chapter 4

Semiparametric Location and Scale


Models

4.1 Introduction
This chapter studies some semiparametric extensions of the location-scale model. The
classic location-scale model is constructed by taking a (fixed) distribution and applying
to it a shift (location) and rescaling (scale) transformation. Now consider the situation
where instead of having a fixed distribution one deals with a given class of distributions. A
shift-rescaling transformation is applied to an unknown element of this class. Our interest
is in estimating the shift and the rescaling, but now in the presence of the indetermination
given by the unknown particular element of the class of distribution that generated the
data, i.e. the location (shift) and the scale (rescaling) are the parameters of interest and
the unknown distribution is the nuisance parameter.
We will consider location-scale models defined for distributions contained in exponen-
tial families and with the support equal to the whole real line. Here we do not know in
which exponential family the supposed data distribution is contained. Additionally, we
consider some models for which some standardized cumulants are fixed and known. This
will allow us to obtain a range of semiparametric models of various sizes.
The main purpose here is to study the behavior of the efficient score function (i.e. the
canonical gradient of the interest parameter functional) for estimating the location and
the scale. This will allow us to gain intuition in a simple example, before treating more
complicate cases. We will not be concerned at this stage with generality and stringent
restrictions on the family of distributions will be largely used in order to keep the math-
ematical argumentation transparent. The location and scale model will be studied as an
example in chapter 5 under less restrictive conditions.
Section 4.2 presents and discusses the semiparametric location-scale model with which

77
78 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

we work. The nuisance tangent spaces are calculated in section 4.3 and some details
supplied in the appendices. Section 4.4 studies the efficient score function and in the
subsections 4.4.1 and 4.4.2 we specialize to the case where the first two and the first three
standardized cumulants are fixed, respectively. Some discussion is provided in section
4.5. There is an appendix with a brief summary of the theory of the Laplace transform,
adapted to the context we need, and we prove a sufficient condition, in terms of the
Laplace transform, for having the class of polynomials dense in the L2 space associated
to a certain probability measure.
4.2. SEMIPARAMETRIC LOCATION-SCALE MODELS 79

4.2 Semiparametric Location-Scale Models


In this section we define the semiparametric location-scale model used in the rest of the
paper.
Let λ be the Lebesgue measure and P a family of probability measures defined on
(IR , B(IR)), dominated by λ and given by
( )
dPµσa 1 · −µ
 
P = Pµσa : (·) = a , µ ∈ IR, σ ∈ IR+ , a ∈ A . (4.1)
dλ σ σ
Here A is the class of functions a : IR −→ IR such that (4.2)-(4.10) given below hold.
∀x ∈ IR , a(x) > 0; (4.2)
Z
a(x)λ(dx) = 1 ; (4.3)
IR

Z
xa(x)λ(dx) = 0 ; (4.4)
IR

Z
x2 a(x)λ(dx) = 1 (4.5)
IR

and
a is differentiable λ-almost everywhere. (4.6)
We consider also the following technical conditions involving the Laplace transform and
the behavior of the function a in the tails. Assume that there exists a δ > 0 (we stress
that δ may depend on a) such that for all s ∈ (−δ, δ)
Z
M [s , a( · )] = esx a(x)λ(dx) < ∞ ; (4.7)
IR

and for all s ∈ (−δ, δ)

{a0 ( · )}2 {a0 (x)}2


" Z #
M s, = esx λ(dx) < ∞ . (4.8)
a( · ) IR a(x)
Assume further that
lim xi a(x) =
∀i ∈ N0 , x→∞ lim xi a(x) = 0 . (4.9)
x→−∞

Here N0 = {0, 1, 2, . . .}.


80 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Let us impose also the following additional condition,


Z
for i = 3, ..., k , xi a(x)λ(dx) = mi , (4.10)
IR
where k is an integer greater than 1, m3 , ..., mk are real quantities supposed given and
fixed (if k = 2 we assume the convention that condition (4.10) vanishes). Note that m1 =
0, m2 = 1, m3 , . . . , mk are in fact the first k standardized cumulants of the distributions of
P, which are hence assumed to be fixed. Here the term standardized cumulants refers to
the moments (about zero) of the standardized distribution (i.e. the distribution shifted
and rescaled in order to have mean zero and variance one).
We will treat a ∈ A as the non-parametric component of a semiparametric model and
(µ, σ) as the parameters of interest. Conditions (4.2) and (4.3) imply that each a ∈ A
is a density of a probability measure with support equal to the whole real line. The
identifiability of the parametrization given is a consequence of (4.4) and (4.5). Clearly,
this is not the only possible way to obtain identifiability, but it turns out to be convenient
for our purposes.
We stress that conditions (4.2)-(4.9) are essential for the development given. On the
other hand, condition (4.10) was imposed only to enable us to control the size of the class
of models to be considered. From the statistical viewpoint condition (4.10) can be used to
study the impact of knowing higher order standardized cumulants of the distribution in
play. A potential field of application for these techniques is in the study of turbulent flow
of fluids where, due to the Kolmogorov theory, one can predict the values of the cumulants
of the distribution involved (see Barndorff-Nielsen, 1978 and Barndorff-Nielsen et al., 1990
). From the theoretical point of view we use condition (4.10) to impose constraints on A,
yielding semiparametric models of differing types. In fact, the model obtained without
condition (4.10) is, according to the classification of Wellner et al. (1994), a nonparametric
model, in the sense that the tangent spaces are the whole L20 space. On the other hand,
by imposing condition (4.10) (with some integer k greater than 2) one obtains a genuine
semiparametric model, in the sense that the tangent spaces are infinite dimensional proper
subsets of the L20 spaces. We will then study the effect of this qualitative change on the
efficient score function. This will illustrate how the estimation problem becomes harder
when one jumps from a nonparametric model to a genuine semiparametric model. The
meaning of the conditions (4.2)-(4.9) is discussed in detail in the following. We fix for the
rest of this section (µ, σ, a) ∈ IR × IR+ × A.
We show now that condition (4.8) implies that the location and scale scores are in
L2 (Pµσa ). The components of the score function with respect to the location and the
scale parameters are given respectively by
 
· −µ

0
−1  a σ 
l/µ ( · ) =   , (4.11)
σ  a · −µ 
σ
4.2. SEMIPARAMETRIC LOCATION-SCALE MODELS 81

and
 
 0 · −µ

−1  · −µ a σ 

l/σ ( · ) = 1+   . (4.12)
σ  σ a · −µ
σ

Note that condition (4.8) together with proposition 19 in appendix 4.6.1 ensures that the
2 2
2 {a ( · ) }
0 2 0
functions {aa(( ·· )}
)
and ( · ) a( · )
are Lebesgue integrable.
It can be seen that condition (4.9) implies that
Z Z
l/µ (x)a(x)λ(dx) = l/σ (x)a(x)λ(dx) = 0 ,
IR IR

i.e. the location and the scale partial scores are unbiased. We will need the condition
(4.9) with polynomials of arbitrary order in the calculation of the nuisance tangent space
and the projection of the score function onto the orthogonal complement of the nuisance
tangent space.
Condition (4.7) implies that the distribution associated with a possesses finite moments
of all orders and that the polynomials are dense in L2 (a) (see the proposition 19 and
theorem 11 in the appendix 4.6.1). Those properties will be crucial in the calculation of
the nuisance tangent space and in the projection of the score function onto the orthogonal
complement of the nuisance tangent space.
The following proposition gives a useful sufficient condition for verifying whether a
given probability density satisfies the technical conditions (4.7)-(4.9).
Proposition 18 Let a : IR −→ IR be a function for which (4.2), (4.3) and (4.6) hold.
Assume moreover that there exists δ > 0 such that for all s ∈ (−δ, δ),
lim esx a(x) = lim esx a(x) = 0
x→+∞ x→−∞

and
a0 (x) a0 (x)
lim esx q = lim esx q = 0
x→+∞ x→−∞
a(x) a(x)
Then the technical conditions (4.7)-(4.9) hold.

Proof: Conditions (4.7) and (4.8) follow from proposition 20 in appendix 4.6.1, part i)
and (4.9) from part ii). u
t

Using the proposition above it is easy to see that the following classic families of
distributions have probability densities satisfying the technical conditions (4.7)-(4.9): the
normal distributions, the hyperbolic distributions, the Gumbel distributions, the double
exponential or Laplace distribution.
82 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.3 Calculation of the nuisance tangent space


In this section we characterize the nuisance tangent spaces of the semiparametric location-
scale model given in section 4.2 in terms of orthonormal polynomials. More precisely,
we will calculate the L2 nuisance tangent space for the semiparametric location-scale
model presented. Here (µ, σ, a) will be treated as a fixed (but arbitrary) point of the
parameter space IR × IR+ × A. We denote by {ei }i∈N0 the result of a Gram-Schmidt
orthonormalization process with respect to the inner product of L2 (a) applied to the
sequence of monomials, say {1, ( · ), ( · )2 , ...}. Let us adopt the convention that for each
i ∈ N0 , the polynomial ei ( · ) is of degree i and introduce the notation
· −µ
 
e∗i ( · ) = ei .
σ
The following theorem gives a polynomial characterization for the L2 - nuisance tangent
space.

Theorem 10 The L2 - nuisance tangent space of the location-scale model given by (4.1)
at (µ, σ, a) ∈ IR × IR+ × A is
2

T N (µ, σ, a) = clL2 (Pµσa ) [span{ei ( · ) : i = k + 1, k + 2, ...}] .

Proof: We give now the main steps of the proof (the details can be found in appendix
4.6.2).
First of all, it can be shown that under a semiparametric location-scale model the L2 -
nuisance tangent space at (µ, σ, a) ∈ IR × IR+ × A is given by
(  )
· −µ 2

2
T N (µ, σ, a) = ν : ν ∈TN (0, 1, a)
σ

(see theorem 12 in the appendix 4.6.3 for a detailed proof). Therefore there is no loss of
generality in restricting our attention to the case where (µ, σ, a) = (0, 1, a) and show that
2
T N (0, 1, a) = clL2 (P01a ) {span [{ei ( · ) : i = k + 1, k + 2, ...}]} .

2
For notational simplicity, in this proof we denote T N (0, 1, a) by TN (and use the
analogous convention for the tangent sets). Define, for each i ∈ N ,

Hi = span{e1 ( · ), ..., ei ( · )} .
4.3. CALCULATION OF THE NUISANCE TANGENT SPACE 83

We prove next that TN ⊆ Hk⊥ , where Hk⊥ is the orthogonal complement of Hk in


L20 (P01a ). Take an arbitrary ν ∈ TN0 and h ∈ Hk . There exists a differentiable path (at
a) {at } ⊂ A with tangent ν. Let {rt } ⊂ L20 (P01a ) be the sequence of remainder terms of
{at }. For each t,
at − a


| < ν, h >a | = < − rt , h >a

ta
1
Z Z 

= h(x)a t (x)λ(dx) −
h(x)at (x)λ(dx) − < rt , h >a
t
(from (4.3)-(4.5) and (4.10) )
= | < rt , h >a | .
L2 (P01a )
Since rt −→ 0, < rt , h >a −→ 0 as t ↓ 0. We conclude that < ν, h >a = 0. Therefore
TN0 ⊆ Hk⊥ and since Hk⊥ is a closed linear space,

TN ⊆ Hk⊥ = clL2 (P01a ) {span [{ei ( · ) : i = k + 1, k + 2, ...}]} .

Next we sketch the proof that Hk⊥ ⊆ TN . The verification of this inclusion above can
be reduced to proving that for each i ∈ {k + 1, k + 2, ...}


 ei ( · ), if i is even,
hi ( · ) =
ei+1 ( · ) − ei ( · ), if i is odd,

is in TN (0, 1, a). The proof is done by showing that for t small enough we have that

at ( · ) = a( · ) + ta( · )hi ( · )

belongs to A. The conditions (4.2)-(4.10) for at are verified in the appendix 4.6.2. There,
the crucial point is the verification of (4.7) and (4.8), which is done with the Laplace
transform properties given in appendix 4.6.1. u
t

We remark that in fact the weak and the L1 tangent spaces coincide with the L2
tangent space for the location and scale models. This will be proved in a more general
context in chapter 5. In the rest of the chapter we suppress the symbol “2” indicating
that we work with the L2 tangent space.
84 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.4 Calculation of the efficient score function


We calculate in this section the efficient score function for the location-scale model (4.1)
by projecting the location and the scale partial scores onto the orthogonal complement of
the nuisance tangent space.
Thought this section (µ, σ, a) is an arbitrary element of the parameter space IR×IR+ ×
A. We denote the probability measure Pµσa by P0 and the inner product and the norm
of L2 (P0 ) by < · , · > and k · k respectively. Moreover, A⊥ will denote the orthogonal
complement of A in L20 (P0 ).
Recall that the L2 - nuisance tangent space at (µ, σ, a) is given by
TN (µ, σ, a) = clL2 (P0 ) [span{e∗k+1 , e∗k+2 , ...}]⊥ .
Since {e∗i }i∈N0 is an orthonormal basis in L2 (P0 ), the orthogonal complement of TN (µ, σ, a)
in L20 (P0 ) is given by
TN⊥ (µ, σ, a) = span{e∗1 , ..., e∗k } . (4.13)
The efficient score function is calculated now by expanding the location and scale
scores in Fourier series and taking the terms of indices 1, . . . , k. More precisely, since
l/µ ( · ) is in L2 (P0 ) we have the following Fourier expansion in terms of the orthonormal
basis {e∗i }i∈N0 ,

c1 e∗1 ( · ) ck e∗k ( · ) ci e∗i ( · )
X
l/µ ( · ) = + ... + + (4.14)
i=k+1

and hence the the location component of the efficient score function at (µ, σ, a) is given
by
E
l/µ ( · ) = c1 e∗1 ( · ) + ... + ck e∗k ( · ) (4.15)
and analogously the scale component of the efficient score function is
E
l/σ ( · ) = d1 e∗1 ( · ) + ... + dk e∗k ( · ) . (4.16)
Here the Fourier coefficients in (4.15) and (4.16) are given, for all i ∈ N , by
ci = < l/µ ( · ), e∗i ( · ) > (4.17)
 
Z
1 a0 x−µ
σ

x−µ 1 x−µ
  
= − 
x−µ
 ei a λ(dx)
IR σ a σ σ σ σ
1
 Z 
e0i (y)a(y)λ(dy)
+∞
= − ei (y)a(y) −∞ −
σ IR
1Z 0
= e (y)a(y)λ(dy) .
σ IR i
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 85

Here we used the condition (4.9). A similar calculation leads to the following formula for
the coefficients of (4.16):

1Z
di = < l/σ ( · ), e∗i ( · ) >= ye0i (y)a(y)λ(dy), for i = 1, . . . , k . (4.18)
σ IR

We discuss now the dependence of the efficient score function on the nuisance param-
eter. First of all, since for each i ∈ N , ei ( · ) is a polynomial of degree i, the coefficient ci
is a linear combination of the standardized cumulants up to order k − 1 of the distribu-
tion with density a (see (4.17)). Moreover, the coefficients di are linear combinations of
the standardized cumulants of the distribution in play up to order k. We conclude that
E E
the coefficients of l/µ ( · ) and l/σ ( · ) (given in (4.15) and (4.16) ) depend on the nuisance
parameter a only through the standardized cumulants up to order k. However, the depen-
dence of the efficient score function on the nuisance parameter is more complex because
the polynomials e0 , e1 , ..., ek generated by the orthonormalization procedure (in L2 (a))
depend on higher order standardized cumulants. In fact the polynomial ek ( · ) depends on
the moments of order up to 2k, because in order to normalize the polynomial of degree
k in the Gram-Schmidt procedure, we have to divide the polynomial by its L2 (a) norm,
which clearly depends on the standardized cumulant of order 2k.
Summing up, the dependence of the efficient score function on the infinite dimensional
parameter for the location-scale model under study occurs here via a finite dimensional
intermediate parameter involving only the standardized cumulants of order up to 2k.

4.4.1 The case where the first two standardized cumulants are
fixed
We study now in detail the case where only the standardized cumulants up to order 2
are fixed (i.e. k = 2). It will be shown that in this case the efficient score function is
equivalent to an estimating function, which is independent of the nuisance parameter and
has the sample mean and standard deviation as roots.
The first three elements of the basis {ei }i∈N0 are

e0 ( · ) = 1 , e 1 ( · ) = ( · ) (4.19)
1 q
e2 ( · ) = {( · )2 − m3 ( · ) − 1} , where ∆2 = m4 − m23 − 1 .
∆2

The detailed calculations are given in appendix 4.6.4. There can be found also an argu-
ment showing that ∆2 > 0, and hence (4.19) is well defined. Using the formulas (4.17)
and (4.18) to calculate the coefficients c0 , c1 , c2 , d0 , d1 , d2 of the efficient score function
86 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

gives
c1 = σ1 c2 = −m
σ∆2
3

2
d1 = 0 d2 = σ∆ 2
.

Inserting the coefficients given above into (4.15) and (4.16) we obtain that the efficient
score function at (µ, σ, a) is given by
E
l/µ ( · ) = c1 e∗1 ( · ) + c2 e∗2 ( · ) (4.20)
" ( 2 )#
1 · −µ m3 · −µ · −µ
  
= − 2 − m3 −1
σ σ ∆2 σ σ
and
( 2 )
2 · −µ · −µ
 
E
l/σ (·) = d2 e∗2 ( · ) = − m3 −1 . (4.21)
σ∆22 σ σ

Under independent repeated sampling, with sample x = (x1 , ..., xn )T we obtain the
following expression for the efficient score function,
 
xi −µ 2
     
Pn xi −µ m3 Pn xi −µ
 i=1 σ
− ∆22 i=1 σ
− m3 σ
−1 
−1
lE (x
 
; µ, σ, a) =σ   .

xi −µ 2
  
2 Pn xi −µ
 
− ∆22 i=1 σ
− m3 σ
−1

Multiplying the efficient score function by the nonsingular matrix

− m23
" #
1
M =σ −m23 −∆22
m3 2

we obtain the following estimating function which is equivalent to the efficient score
function
 
xi −µ̂
 P 
n
i=1
σ̂
M · lE (x ; µ, σ, a) =  Pn  xi −µ̂ 2  .
i=1 σ̂
−n

Note that the matrix M is indeed nonsingular (its determinant is −σ∆22 /2 6= 0, see a
justification in appendix 4.6.4). Hence the efficient score function is equivalent to an
estimating function independent of the nuisance parameter with roots
v
n n
1 u1 X
u
X
µ̂ = xi and σ̂ = t (x i − µ̂)2 .
n i=1 n i=1
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 87

In view of the optimality theory of estimating functions given in Labouriau (1996b),


the efficient score function is optimal.
Clearly, the sample mean and the sample variance are regular asymptotic linear esti-
mators (for µ and σ 2 ). They are efficient, since the full tangent space is the whole L20 .
Note that, in this case, the bound given by the L2 path differentiability is attained (by
regular linear asymptotic estimators), hence it coincides with the bound given by the
weak path differentiability.

4.4.2 The case where the first three standardized cumulants are
fixed
We show in this section that in the case where the standardized cumulants up to order 3
are fixed (i.e. k = 3) the roots of the efficient score function do depend on the nuisance
parameter through the cumulants up to order 6. Moreover, we exemplify some situations
where the roots of the efficient score function are not the sample mean and the sample
variance.
We have computed in the last section the first three coefficients of the efficient score
function, namely c0 , c1 , c2 , d0 , d1 and d2 . We calculate now the coefficients c3 and d3 which
will allow us to compute the efficient score function by using (4.15) and (4.16). Note that
the polynomial e3 is given by (see appendix 4.6.4)
( ! ! !)
1 γ m3 γ γ
e3 ( · ) = ( · )3 − 2
( · )2 − m4 − 2 ( · ) − m3 − 2
∆3 ∆2 ∆2 ∆2
where
γ = m5 − m3 m4 − m3
and
! ! !
γ m3 γ γ
∆3 = ( · )3 − ( · )2 − m4 − 2 ( · ) − m3 − 2

2

∆2 ∆2 ∆2

Note also that according to the argument given in appendix 4.6.4, ∆3 > 0 and hence
e3 ( · ) is well defined.
Using (4.17) and (4.18) we obtain
( )
1Z 0 1 m3 γ
c3 = e (x)a(x)λ(dx) = 3 − m4 − 2
σ IR 3 σ∆3 ∆2
and
( )
1Z 0 1 γ
d3 = xe (x)a(x)λ(dx) = 3m3 − 2 2 . (4.22)
σ IR 3 σ∆3 ∆2
88 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Then the efficient score function is given by


E
l/µ ( · ) = c1 e∗1 ( · ) + c2 e∗2 ( · ) + c3 e∗3 ( · )
" ( 2 )#
1 · −µ m3 · −µ · −µ
  
= − 2 − m3 −1
σ σ ∆2 σ σ
( 3 ! 2
c3 · −µ γ · −µ
+ −
σ∆3 σ ∆22 σ
! !)
m3 γ · −µ γ

− m4 − 2 − m3 − 2
∆2 σ ∆2
and
E
l/σ ( · ) = d2 e∗2 ( · ) + d3 e∗3 ( · )
" ( 2 )#
1 2 · −µ · −µ
 
= − 2 − m3 −1
σ ∆2 σ σ
( 3 ! 2
d3 · −µ γ · −µ
+ −
σ∆3 σ ∆22 σ
! !)
m3 γ · −µ γ

− m4 − 2 − m3 − 2 .
∆2 σ ∆2
We consider now some examples where the efficient score function simplifies a bit.
Example 6 Let us consider the case where m3 = m5 = 0 and m4 = 3,i.e. the standard-
ized cumulants up to order 3 coincide with those of the normal distribution. We have then
that the coefficients c3 and d3 vanish and hence the efficient score function coincides with
the one obtained in the case where only the cumulants up to order 2 were fixed. There-
fore, in this case the roots of the efficient score function are the sample mean and standard
deviation. u
t

Example 7 We study now the case where m3 = m5 = 0 and m4 6= 3. This is the case
for instance for the Laplace distribution (double exponential) or the hyperbolic distribution
with symmetry parameter β vanishing, which are symmetric (hence m3 = m5 = 0) and
have the standardized cumulant of fourth order different from 3.
In this case the coefficients of the efficient score function are given by
n o
c3 = 1
σ∆3
(3 − m4 ) 6= 0 , d3 = σ∆1
3
3m3 − 2 ∆γ2 = 0
2
−m3 2
c2 = σ∆2
=0 , d2 = σ∆ 2
1
c1 = σ
, d1 = 0
4.4. CALCULATION OF THE EFFICIENT SCORE FUNCTION 89

The efficient score function is then of the form


1 ∗ 1
 
E
l/µ (·) = e1 ( · ) + (3 − m4 )e∗3 ( · )
σ ∆3
3 − m4 · − µ 3 3 − m4 + ∆23 · − µ
   
≡ − +
∆23 σ ∆23 σ
and
2
· −µ

E
l/σ (·) = d2 e∗2 ( · ) ≡ +1 .
σ
Under independent repeated sampling, with sample x = (x1 , ..., xn )T , we obtain the fol-
lowing expression for the efficient score function
n  n 
3 − m4 X xi − µ 3 3 − m4 + ∆23 X xi − µ
 
E
l/µ (x ) ≡ − + (4.23)
∆23 i=1 σ ∆23 i=1 σ
and
n  2
E
X xi −µ
l/σ (x ) ≡ +n . (4.24)
i=1 σ
Now, equating (4.24) to zero we obtain
v
n
u1 X
u
σ̂ = t (x i − µ̂)2 . (4.25)
n i=1

Equating (4.23) to zero, inserting (4.25) and rearranging we obtain the following equation
n
!
− n (2A + 1) µ̂3 + (4A + 3) µ̂2
X
xi
i=1
 !2 
n n
!
x2i + 2(A + 1)
X X
− {(n + 1)A + n} xi  µ̂
i=1 i=1
n n n
( ! ! !)
1
x3i x2i
X X X
+ A + (A + 1) xi = 0,
i=1 n i=1 i=1

where A = 3−m ∆23


4
. We obtained then that µ̂ is a root of a polynomial of third degree with
coefficients depending on the standardized cumulants up to order 6 (note that ∆3 depends
on m6 ). Hence µ̂ depends on the nuisance parameter and the efficient score function
cannot be equivalent to an estimating function independent of the nuisance parameter.
u
t
90 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.5 Discussion
There exist in the literature many studies of the extensions of the pure location model. We
refer to Stein (1956), van der Vaart (1988,1991) and Bickel et al. (1993), among others.
In all the studies referred to the unknown distributions are assumed to be symmetric,
which can simplify the mathematical treatment considerably. In this paper we presented
a class of distributions that are not necessarily symmetric and we treat at the same
time the problem of estimating the scale. It is to be noted that we did not attempt to
obtain the largest class (or even a very large class) of distributions, containing asymmetric
distributions, for which it is possible to treat the problem of estimating the location (and
the scale). Rather, we restricted the discussion to some interesting infinite dimensional
classes of distributions. It would be a considerable improvement to eliminate the technical
assumptions on the Laplace transform and the polynomial decay (of all orders) of the tails
of the densities (i.e. conditions (4.7)-(4.9) ).
In the case of the location-scale model where only the first two moments are fixed,
the efficient score function does not essentially depend on the nuisance parameter and
is in fact a regular estimating function. This implies, according to the discussion in
chapter 2 (see also Jørgensen and Labouriau, 1995), that the efficient score function is an
optimal estimating function, in the sense that it maximizes the Godambe information.
The roots of the efficient score function yield the sample mean and the sample standard
desviation as estimators of the location and scale respectively. This is in agreement
with the literature for the location model with symmetry and with the common sense in
statistics. In this way, the theory of estimating functions does not suggest new estimators
and its merit apparently is only to justify the classic statistical estimation procedure.
However, some care should be taken at this point. We showed that the only possible
estimating sequence derived from regular estimating functions, in this example, is the
sample mean and the sample standard desviation. Hence, that estimating sequence is
optimal in a class containing only one element. Moreover, the usual claim that it is
possible to obtain B-robust estimating sequences from estimating functions fails in this
example.
Note that in the case of the location-scale model where only the first two moments are
fixed, the (global) tangent space is the whole space L20 (see Labouriau,1996a). That is, the
model is not a semiparametric model in the terminology of J. Wellner (see Groeneboom
and Wellner, 1992, page 7) but rather is a nonparametric model. This implies that there
is essentially only one sequence of regular linear asymptotic estimators. Clearly, this
sequence of estimators is optimal, even though optimality in a class containing only one
element does not say very much. Hence, a naive application of the direct extension of the
estimating function theory to semiparametric models and the classical theory of regular
linear asymptotic both fail in this example.
4.5. DISCUSSION 91

When one fixes some standardized cumulants of higher order, the model becomes a
genuine semiparametric model, in the sense that its (global) tangent space becomes a
proper subspace of L20 . In this case the estimation problem is harder even though the
family of distributions now is smaller than the “less restricted case” discussed before.
It will be shown in chapter 5 that, in fact, the semiparametric Cramèr-Rao bound is
not attained in this example. The efficient score function and its roots now depend on
the nuisance parameter, however through a finite dimensional intermediate parameter,
namely only through a finite number of standardized cumulants (2k). This suggests
that plugging some reasonable estimators of the standardized cumulants (of order up
to 2k) into the efficient score function, can produce reasonable asymptotic results. In
fact, since the standardized cumulants can be estimated consistently from the sample
moments, pluging these estimators in the efficient score function should produce efficient
estimators. However, since one has to estimate standardized cumulants of high order
(2k), one can expect a poor performance for finite samples of moderate size. The method
of local efficient estimation and the method of sieves can perhaps offer some attractive
alternatives.
92 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.6 Appendices
4.6.1 The Laplace transform and polynomial approximation in
L2
Basic properties of the Laplace Transform
In this section we review the basic properties of the Laplace transform and prove some
technical lemmas required in the study of the semiparametric location-scale model defined
in section 4.2. The properties presented here are essentially well known for the case were
the distribution is concentrated on the positive real line, however the result concerning
densities of polynomials in L2 is original.
Let f : IR −→ [0, ∞) be a function such that for some s ∈ IR the integral
Z
M (s; f ) = esx f (x)λ(dx) (4.26)
IR

converges. The function M ( · ; f ) : IR −→ [0, ∞] such that for each s ∈ IR, M (s; f ) is
given by (4.26) is said to be the Laplace transform of f .
We now study some properties of the functions with finite Laplace transform in a
neighborhood of zero.

Proposition 19 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0
and for all s ∈ (−δ, δ)

M (s; f ) < ∞ . (4.27)

Then f possesses finite moments of all orders, i.e. for all n ∈ N0


Z
xn f (x)λ(dx) ∈ IR .
IR

Proof: Since for all s ∈ (−δ, δ), M (s; f ) < ∞, e|sx| ≤ esx + e−sx and using the series
version of the monotone convergence theorem (see Billingsley 1986 page 214 theorem 16.6
1
) we have
Z Z
∞ > eδx f (x)λ(dx) + e−δx f (x)λ(dx)
IR IR
Z
≥ e|δx| f (x)λ(dx)
IR

1
RP P R
The theorem referred to states: ”If fn ≥ 0, then n fn dµ = n fn dµ.”.
4.6. APPENDICES 93
(∞
|δx|k
Z )
X
= f (x)λ(dx)
IR k=0 k !
(from theorem 16.6 in Billingsley 1986)

|δx|k
(Z )
X
= f (x)λ(dx) ,
k=0 IR k !

and we conclude that the moments of all orders of f are in IR. u


t

The notion of Laplace transform can be extended to functions with range equal to the
whole real line in the following way. Given a function f : IR −→ IR we define the positive
and the negative part of f respectively by

f + ( · ) = f ( · )χ[0,∞) {f ( · )} and f − ( · ) = −f ( · )χ(−∞,0] {f ( · )} .

Here χA ( · ) is the indicator function of the set A. We have clearly the decomposition

f ( · ) = f +( · ) − f −( · ) .

We define the Laplace transform of a function f : IR −→ IR as the function M ( · ; f ) :


IR −→ [−∞, ∞] given by

M ( · ; f ) = M ( · ; f +) − M ( · ; f −) , (4.28)

provided that at least one of the terms of the right side of (4.28) is finite (otherwise the
Laplace transform of f is not defined). The following proposition will be useful for the
calculation of the L2 - nuisance tangent space of the location-scale model considered in
section 4.2.

Proposition 20 Let f : IR −→ IR and δ > 0 be such that for all s ∈ [−δ, δ] M (s; f ) ∈ IR.
Then, for all n ∈ N and all s ∈ (−δ/2, δ/2) we have

M [s; ( · )n f ( · )] ∈ IR .

Proof: Assume without loss of generality that the function f is nonnegative. Take an
arbitrary s ∈ [−δ/2, δ/2] and n ∈ N . By hypothesis, f has finite Laplace transform in
a neighborhood of zero; then, from proposition 19, f has finite moments of all orders, in
particular
Z
x2n f (x)λ(dx) ∈ IR .
IR
94 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Using the Cauchy-Schwartz inequality we obtain

|M [s; ( · )n f ( · )]| = | < e( · )s , ( · )n f ( · ) >λ |


= | < e( · )s f 1/2 ( · ), ( · )n f 1/2 ( · ) >λ |
(Cauchy-Schwartz inequality)

≤ e( · )s f 1/2 ( · ) ( · )n f 1/2 ( · )

Z 1/2Z 1/2
2sx 2n
= e f (x)λ(dx) x f (x)λ(dx) < ∞
IR IR

u
t

Polynomial approximation in L2 (a)


In this section we give a sufficient condition for having the class of polynomials dense in
L2 (a). Here a is a density with respect to the Lebesgue measure λ of a positive finite
measure on (IR, B(IR)) and L2 (a) is endowed with the usual inner product and norm
denoted by < · , · >a and k · ka respectively. The conditions we give will ensure that the
measure a possesses all moments finite, i.e. for all k ∈ N ,
Z
xk a(x)λ(dx) ∈ IR .
IR

In that case we can define the sequence of polynomials {ei ( · )}i∈N0 ⊆ L2 (a) as the result
of a Gram-Schmidt orthonormalization process applied to the sequence {1, ( · ), ( · )2 , ...}.
The following theorem gives a sufficient condition for {ei ( · )} to be a complete sequence
in L2 (a), which implies that the polynomials are dense in L2 (a).

Theorem 11 Let a : IR −→ IR be a function such that

∀x ∈ IR, a(x) > 0; (4.29)

Z
∃δ > 0 such that ∀s ∈ [−δ, δ], M (s; a) = esx a(x)λ(dx) < ∞ . (4.30)
IR

Then the orthonormal sequence {ei ( · )}i∈N0 is complete in L2 (a).

Proof: First of all we observe that condition (4.30) implies that the measure determined
by a possesses finite moments of all orders (see proposition 19).
4.6. APPENDICES 95

Let f : IR −→ IR be a function in L2 (a) such that for all k ∈ N0 ,


Z
xk f (x)a(x)λ(dx) = 0 . (4.31)
IR

We prove that f ( · ) = 0 a-a.e. which implies the theorem (see Luenberg, 1969, Lemma 1,
page 61).
Define for each k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,

fk (x) = (xt)k f (x)a(x) .

We will use a series version of the dominated convergence theorem applied to {fk }. In
the following we find a Lebesgue integrable function dominating uniformly (i.e. for all
k) the functions fk , which will enable as to use the referred theorem. We have for each
n ∈ N , k ∈ N0 , t ∈ [−δ/2, δ/2] and x ∈ IR,
n n n
X X X |xt|k
f k (x) ≤ |fk (x)| = |f (x)|a(x) (4.32)


k=0 k !

k=0 k=0
n ∞
X |xt|k X |xt|k
= |f (x)|a(x) ≤ |f (x)|a(x)
k=0 k ! k=0 k !

= |f (x)|a(x)e|xt| ≤ |f (x)|a(x){ext + e−xt }


 q  q 
xt −xt
= |f (x)| a(x) a(x)(e + e )
= g(x) ,
where the function g is given, for all x ∈ IR, by
 q  q 
g(x) = |f (x)| a(x) a(x)(ext + e−xt ) . (4.33)

We prove that the function g is Lebesgue integrable. For, note that


q 2 Z
|f (x)|2 a(x)λ(dx) = kf ( · )k2a < ∞.

|f ( · )| a( · ) =
2
L (λ) IR

Then the first term in the right side of (4.33) is in L2 (λ). On the other hand,
q 2 Z
a( · )e( · )t e2tx a(x)λ(dx) = M (2t; a) < ∞

2 =
L (λ) IR

and q 2 Z
a( · )e−( · )t e−2tx a(x)λ(dx) = M (−2t; a) < ∞ .

2 =
L (λ) IR
96 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Then the second term in the right side of (4.33) is in L2 (λ). Using the Cauchy-Schwartz
inequality (see Luenberg, 1969, lemma 1, page 47) we obtain
Z q q  
< |f ( · )| a( · ) , a( · ) e( · )t + e−( · )t >λ


g(x)λ(dx) =
IR
q q  
( · )t
+ e−( · )t

≤ |f ( · )| a( · )
2 a( · ) e
< ∞.
L (λ) L2 (λ)

Since (4.32) holds for each n ∈ N , x ∈ IR, t ∈ [−δ/2, δ/2] and g is Lebesgue integrable
we can use the series version of the dominated convergence theorem (see Billingsley, 1986,
theorem 16.7 page 214 2 ) to obtain
(∞
(xt)k
Z Z )
xt
X
e f (x)a(x)λ(dx) = f (x)a(x) λ(dx)
IR IR k=0 k !
(from the series dominated convergence theorem)

(xt)k
(Z )
X
= f (x)a(x)λ(dx) = 0 .
k=0 IR k !

We conclude that for all t ∈ [−δ/2, δ/2],


M [t; f ( · )a( · )] = 0 . (4.34)
We show that (4.34) implies that f ( · ) = 0 a-a.e. . For,
kf ( · )k2a = | < f ( · ), 1 >a |
q q
f ( · ) f ( · ), e( · )δ/4 e−( · )δ/4 >a

= <
q q
( · )δ/4
, f ( · )e−( · )δ/4

= <
f ( · )e >a
(from the Cauchy-Schwartz inequality)
q q
f ( · )e( · )δ/4 f ( · )e−( · )δ/4


a a
Z 1/2 Z 1/2
δ/2x −δ/2x
= f (x)e a(x)λ(dx) f (x)e a(x)λ(dx)
IR IR

= {M [δ/2, f ( · )a( · )]}1/2 {M [−δ/2, f ( · )a( · )]}1/2


= (from (4.34)) = 0 .
u
t

2
P Pn
The theorem states: ”IfP n fn converges almost everywhere andR P| k=1 fk |P
≤ gR almost everywhere,
where g is integrable, then n fn and the fn are integrable, and f
n n dµ = n fn dµ”.
4.6. APPENDICES 97

Functions with exponentially decaying tails


The following proposition gives a sufficient condition for having the Laplace transform
defined in a neighborhood of zero, which is easy to verify.

Proposition 21 Let f : IR −→ [0, ∞) be a continuous function such that for some δ > 0
and for all s ∈ [−δ, δ]

lim esx f (x) = lim esx f (x) = 0 · (4.35)


x→+∞ x→−∞

Then we have:

i) For all s ∈ (−δ, δ) the Laplace transform of f , M (s; f ), is finite.

ii) For all k ∈ N ,


lim xk f (x) = lim xk f (x) = 0
x→+∞ x→−∞

Proof:
i) Take s ∈ (−δ, δ). Condition (4.35) implies that there exists L ∈ IR+ such that for all
x ∈ IR \ [−L, L], eδx f (x) < 1 and e−δx f (x) < 1. We have then
Z
M (s; f ) = esx f (x)λ(dx)
ZIR Z Z
= esx f (x)λ(dx) + esx f (x)λ(dx) + esx f (x)λ(dx)
[−L,L] [L,∞) (−∞,−L]
Z Z Z
= sx
e f (x)λ(dx) + e e f (x)λ(dx) + e(δ−s)x e−δx f (x)λ(dx)
(s−δ)x δx
[−L,L] [L,∞) (−∞,−L]
Z Z Z
≤ esx f (x)λ(dx) + e(s−δ)x λ(dx) + e(δ−s)x λ(dx) < ∞ .
[−L,L] [L,∞) (−∞,−L]

ii) For each k ∈ N

lim xk f (x) = lim {e−δx xk }{eδx f (x)} = 0


x→+∞ x→+∞

and
lim xk f (x) = lim {eδx xk }{e−δx f (x)} = 0
x→−∞ x→−∞

u
t
98 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.6.2 Calculation of the L2 - nuisance tangent space at (0, 1, a)


In this appendix we complete the details of the second part of the proof of theorem ??.
Recall the notational conventions given there, {ei }i∈N0 denotes the result of a Gram-
Schmidt orthonormalization process with respect to the inner product of L20 (a) applied
2
to the sequence of polynomials {1, ( · ), ( · )2 , ...}. Moreover, T N (0, 1, a) is denoted by TN
(and use the analougue convention for the tangent sets). Define, for each i ∈ N ,

Hi = span{e1 ( · ), ..., ei ( · )} .

Lemma 4 Under the location-scale model (4.1) we have Hk⊥ ⊆ TN .

Proof: For each i ∈ {k + 1, k + 2, ...} define




 ei ( · ), if i is even,
hi ( · ) =
ei+1 ( · ) − ei ( · ), if i is odd.

We will prove that for i ∈ {k + 1, k + 2, ...}, hi ( · ) ∈ TN0 ⊆ TN . This implies the lemma.
To see that take any linear combination of {hk+1 ( · ), hk+2 ( · ), ...} in TN . In particular, we
have that if i is even, then ei ( · ) ∈ TN , and if i is odd, then ei ( · ) = hi ( · ) − hi−1 ( · ) ∈ TN .
Hence for all i ∈ {k + 1, k + 2, ...}, ei ( · ) is in TN . Since TN is a closed linear subspace of
L2 (a),
clL2 (P01a ) {[{ei ( · ) : i ∈ {k + 1, k + 2, ...}]} ⊆ TN .
On the other hand, the condition (4.7) and the theorem 11 implies that the polynomials are
dense in L2 (a) and, since Hk = span{e0 ( · ), ..., ek ( · )} and {ei ( · )}i∈N0 is an orthonormal
system, we have

Hk⊥ = clL2 (P01a ) {span [{ei ( · ) : i = k + 1, k + 2, ...}]} .

We conclude that the lemma is proved if we show that for all i ∈ {k + 1, k + 2, ...},
hi ( · ) ∈ TN0 and this is proved below.
Let ν( · ) = hi ( · ) for a fixed but arbitrary i ∈ {k + 1, k + 2, ...}. We prove that for t
small enough,

at ( · ) ≡ a( · ) + ta( · )ν( · ) ∈ A . (4.36)

Note that (4.36) is equivalent to take a differentiable path with vanishing remainder term.
Hence, if we show (4.36) we prove in fact that ν ∈ TN0 . We verify next that each at (for t
in a neighborhood of zero) satisfies the conditions (4.2)-(4.10).
4.6. APPENDICES 99

Verification of (4.2):
Note that
lim ν(x) = lim ν(x) = ∞ .
x→+∞ x→−∞

Hence there exists L > 0 such that for all x ∈ IR \ [−L, L], ν(x) > 0. Then, since a( · ) is
strictly positive, for all x ∈ IR \ [−L, L],

at (x) = a(x) + ta(x)ν(x) > 0 .

On the other hand, since ν( · ) is continuous its restriction to the compact interval [−L, L]
is bounded, and since a( · ) is continuous and strictly positive, the restriction of a( · ) to
[−L, L] is bounded alway from zero. It can be easily shown then that for t small enough
and for all x ∈ [−L, L], at (x) > 0.
Verification of (4.3)-(4.5) and (4.10):
Given i ∈ {1, 2, ..., k} and t ∈ IR+ ,
Z Z Z
i i
x at (x)λ(dx) = x a(x)λ(dx) + t xi ν(x)a(x)λ(dx) = mi + t0 = mi .
IR IR IR

The second last inequality comes from the fact that {ei }i∈N0 is an orthogonal system in
L2 (a) and from (4.3)-(4.5) and (4.10).
Verification of (4.6):
Since the polynomials are of class C ∞ the property follows immediately for at .
Verification of (4.7):
We have for all s ∈ (−δ, δ) (δ = δ(a) ) and all t ∈ IR+

M (s; at ) = M (s; a) + tM (s; aν) .

Since ν is a polynomial and M (s; a) < ∞ by hypothesis, from proposition 20 (in the
appendix on the Laplace transform), M (s; aν) < ∞, then, for all s ∈ (−δ, δ)

M (s, at ) < ∞ .

Verification of (4.8):
A routine calculation yields
{a0t ( · )}2 = p( · ){a0 ( · )}2 + q( · ){a( · )}2 + w( · )a( · )a0 ( · ) , (4.37)
for some polynomials p, q, and w. We will show that for all s ∈ [−δ/2, δ/2] and for t ∈ IR+
small enough,
M [s; p( · ){a0 ( · )}2 /at ( · )] , M [s; q( · )a2 ( · )/at ( · )] , (4.38)
M [s; w( · ){a( · )a0t ( · )}/at ( · )] ∈ IR .
100 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

Using (4.37) together with (4.38) yields


{a0 ( · )}2 {a0 ( · )}2
" # " #
M s; t = M s; p( · ) (4.39)
at ( · ) at ( · )
a2 ( · ) a( · )a0t ( · )
" # " #
+ M s; q( · ) + M s; r( · ) ∈ IR ,
at ( · ) at ( · )
for all s ∈ [−δ/2, δ/2], which implies (4.8).
We prove (4.38). Take t small enough in such a way that for all x ∈ IR, at (x) > 0 and
let s be an arbitrary element of [−δ/2, δ/2]. The Cauchy-Schwartz inequality gives that
#
0
a0 ( · )
"
a( · )a ( · ) a( · )

s
( · ) 2s
= < e( · ) 2 r( · ) q
t
M s; r( · ) ,e >L2 (λ) (4.40)

q
at ( · )
at ( · ) at ( · )

a( · ) ( · )s/2 a0 ( · )

( · )s/2
≤ e
r( · ) q e q
at ( · ) at ( · )


L2 (λ) L2 (λ)

We verify that each of the terms in the right side of (4.40) are finite. Note that since
limx→±∞ ν(x) = ∞, there exists a L > 0 such that for all x ∈ IR \ [−L, L], at (x) > a(x).
We have then
2
a( · ) a2 (x)
Z
( · )s/2 sx 2
e r( · ) q = e r (x) λ(dx) (4.41)

at ( · )

IR at (x)
L2 (λ)
Z
a2 (x)
sx 2
Z
a2 (x)
= e r (x) λ(dx) + esx r2 (x) λ(dx)
[−L,L] at (x) IR\[−L,L] at (x)
Z
a2 (x) Z
a2 (x)
≤ esx r2 (x)) λ(dx) + esx r2 (x) λ(dx) < ∞ .
[−L,L] at (x) IR a(x)
The first integral in the last line is finite because its integrand is continuous and hence
bounded in [−L, L], and the second integral is finite because it coincides with the Laplace
transform of r2 ( · )a( · ) which according to proposition 20 is finite. Moreover,
2
( · )s/2 a0 ( · ) {a0 (x)}2
Z
e q = esx λ(dx) (4.42)

at ( · )

IR at (x)
L2 (λ)
Z
{a0 (x)}2 Z
{a0 (x)}2
≤ esx λ(dx) + esx λ(dx) < ∞ .
[−L,L] at (x) IR a(x)
The first integral in the last line is finite because its integrand is continuous and hence
bounded in [−L, L], and the second integral is finite because it coincides with the Laplace
4.6. APPENDICES 101
0 2
transform of {aat((··)}) which according to condition (4.8) is finite. Inserting (4.41) and
(4.42) in (4.40) we obtain, for all s ∈ [−δ/2, δ/2],
a( · )a0 ( · )
" #

M s; r( · ) <∞.

at ( · )


2
h i
We show now that M s; q( · ) aat (( ·· )) is finite. Using the Cauchy-Schwartz inequality we
obtain
"
a2 ( · )
#
a( · ) ( · )s/2 a( · )

< e( · )s/2 q( · ) q

M s; q( · ) = ,e >L2 (λ) (4.43)

q
at ( · )

at ( · ) at ( · )



a( · ) ( · )s/2 a( · )

( · )s/2
≤ e q( · ) q e q < ∞.
at ( · ) at ( · )


L2 (λ) L2 (λ)

To see that the right side of the expression above is finite note that for every polynomial,
say r, we have

a( · ) {a(x)}2
Z
( · )s/2 sx 2
e r( · ) q = e r (x) λ(dx) (4.44)

at ( · )

IR at (x)
L2 (λ)
Z
sx 2 {a(x)}2 Z
{a(x)}2
≤ e r (x) λ(dx) + esx r2 (x) λ(dx)
[−L,L] at (x) IR\[−L,L] a(x)
Z
{a(x)}2
≤ esx r2 (x) λ(dx) + M [s; r2 ( · )a( · )] < ∞.
[−L,L] at (x)
Note that proposition 20 himplies that M [s; r2 ( · )a( · )] < ∞.
0 2
i
Finally, we show that M s; p( · ) {aat((··)}) is finite. For,

{a0 ( · )}2 0
" Z
2
#
{a (x)}
M s; p( · ) = esx p(x) λ(dx)

at ( · )

IR at (x)
0
Z
{a (x)} 2 Z
{a0 (x)}2
≤ esx |p(x)| λ(dx) + esx |p(x)| λ(dx)
[−L,L] at (x) IR\[−L,L] a(x)
{a0 (x)}2 {a0 ( · )}2
Z " #
sx
≤ e |p(x)| λ(dx) + M s; |p( · )| < ∞.
[−L,L] at (x) a( · )
Verification of (4.9):
Given i ∈ N0 we have for each t ∈ IR+ ,
lim xi at (x) = lim {xi a(x) + xi ν(x)a(x)} = 0 ,
x→±∞ x→±∞
102 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

because xi ν(x) is a polynomial and from (4.9), for each polynomial p( · ) we have

lim p(x)a(x) = 0 .
x→±∞

u
t
4.6. APPENDICES 103

4.6.3 Calculation of the tangent space at an arbitrary point


In this section we show how to extend the calculation of the strong nuisance tangent
space of a semiparametric location-scale model at the point (0, 1, a0 ) to an arbitrary point
(µ0 , σ0 , a0 ) ∈ IR × IR+ × A. More precisely we prove the following theorem

Theorem 12 Under a semiparametric location-scale model the nuisance tangent space at


(µ0 , σ0 , a0 ) ∈ IR × IR+ × A is given by

· − µ0
   
TN (µ0 , σ0 , a0 ) = ν : ν ∈ TN (0, 1, a0 ) .
σ0

The theorem above holds for any of the notions of path differentiability given in Labouriau
(1996a), and in particular for any path differentiability used in this paper. Therefore we
do not specify the path differentiability adopted in this appendix.
The point (µ0 , σ0 , a0 ) will be considered fixed in the rest of this section. We introduce
the following notation:
Pµ0 σ0 a0 = P0 ;
1 · − µ0
 
a• ( · ) = a0 .
σ0 σ0
The inner product and the norm of L2 (P0 ) will be denoted by < · , · >0 and k · k0
respectively. We also use the symbol P0∗ to denote Pµ∗0 σ0 . Note that

1 · − µ0
   
P0∗ = a : a∈A
σ0 σ0

and
( )
ν ∈ L20 (P0 ) : ∃ > 0, {a0t }t∈[0,) ⊆ P0∗ , {rt0 }t∈[0,) ⊆ L2 (P0 )
TN0 (µ0 , σ0 , a0 ) = 0 .
such that (4.45) and (4.46) hold

The conditions required above are

a0t ( · ) = a0 ( · ) + a0 ( · ) ν0 ( · ) + a0 ( · ) rt0 ( · ) ∈ P0∗ (4.45)

and

rt0 ( · ) −→ 0 , as t ↓ 0 , (4.46)

where the convergence above is one of the forms of convergence defined in section ??.
 
· −µ0
Lemma 5 If ν ∈ TN0 (0, 1, a0 ), then ν σ0
∈ TN0 (µ0 , σ0 , a0 ).
104 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS


Proof: Since ν ∈ TN0 (0, 1, a0 ) there exists  > 0, {at }t∈[0,) ⊆ P01 = A and {rt }t∈[0,) ⊆
2
L (P01a0 ) such that for all t ∈ [0, ),

at ( · ) = a0 ( · ) + ta0 ( · ) ν( · ) + ta0 ( · ) rt ( · ) ∈ A (4.47)

and
rt0 ( · ) −→ 0 , as t ↓ 0 .
Using (4.47), for all t ∈ [0, ), we can write
1 · − µ0 1 · − µ0 1 · − µ0 · − µ0
       
at = a0 + t a0 ν (4.48)
σ0 a0 σ0 a0 σ0 a0 a0
1 · − µ0 · − µ0
   
+ t a0 rt
σ0 a0 a0
· − µ0 · − µ0
   
= a• ( · ) + ta• ( · )ν + ta• ( · )rt .
a0 a0
 
Clearly, σ10 at · −µ
a0
0
∈ P0∗ , because at ( · ) ∈ A. Suppose now that the convergence of the
remainder term is in the Lp sense (for q ∈ [1, ∞)). We have then
· − µ0 q x − µ0 q
  Z   

rt = rt a• (x)λ(dx) (4.49)

a0
Lq (P0 ) IR a0
Z
= {rt (y)}q a0 (y)λ(dy) = krt ( · )kqLq (a0 ) .
IR
 
· −µ0
Since, for all t ∈ [0, ), rt ( · ) ∈ Lq (P01a0 ), then from (4.49), rt a0
∈ L2 (P0 ), and
  q
  q 0
since krt ( · )kL2 (P0 ) −→ 0, rt · −µ
a0
0
q −→ 0. We conclude that ν · −µ0
σ0
∈TN
L (P01a0 )
(µ0 , σ0 , a0 ). An analogous argument can be used to prove the lemma for the weak nuisance
tangent spaces. The idea again is to use a suitable change of variables in the integrals
used to define the convergence of the remainder term. u
t

 
· −µ0
Lemma 6 If ν σ0
∈ TN0 (µ0 , σ0 , a0 ), then ν( · ) ∈ TN0 (0, 1, a0 ).

Proof: We give next the argument for the L2 - nuisance tangent space.  For the other no-
· −µ0
tions of path differentiability the argument is analogous. Since ν σ0 ∈ TN0 (µ0 , σ0 , a0 ),
there exist  > 0, {a0t }t∈[0,) ⊆ P0∗ and {rt0 }t∈[0,) ⊂ L2 (P0 ) such that
· − µ0
 
a0t ( · ) = a• ( · ) + ta• ( · ) ν + ta• ( · ) rt ( · ) (4.50)
σ0
4.6. APPENDICES 105

and
0
rt ( · ) −→ 0 , as t ↓ 0 .

0

Since for each t ∈ [0, ) a0t ( · ) ∈ P0∗ , there exists at ( · ) ∈ A such that
1 · − µ0
 
a0t ( · ) = at .
σ σ0
Then (4.50) is equivalent to
1 · − µ0 1 · − µ0 1 · − µ0 · − µ0
       
at = a0 + t a0 ν
σ σ0 σ σ0 σ σ0 σ0
1 · − µ0 0
 
+t a0 rt ( · ) .
σ σ0
1
Eliminating the common factor σ
and changing variables we obtain

at ( · ) = a0 ( · ) + ta0 ( · )ν( · ) + ta0 ( · )rt (σ0 ( · ) + µ0 ) .

Note that
Z
k rt [σ0 ( · ) + µ0 ] k2L2 (P01a0 ) = {rt [σ0 (x) + µ0 ]}2 a0 (x)λ(dx) (4.51)
IR
1 y − µ0
Z  
= {rt (y)}2 a0 λ(dy)
IR σ σ0
= krt ( · )k2L2 (P0 ) .

Then rt (σ0 ( · ) + µ0 ) ∈ L2 (P01a0 ) and rt [σ0 ( · ) + µ0 ]k2L2 (P01a ) −→ 0 as t ↓ 0. We conclude


0
that ν( · ) ∈ TN0 (0, 1, a0 ). u
t

We give now the proof of theorem 12.

Proof: (of theorem 12) From the lemmas 5 and 6 we have


· − µ0
   
TN0 (µ0 , σ0 , a0 ) = ν : ν∈ TN0 (0, 1, a0 ) .
σ0

Using the lemmas 7 and 8 given below we have

TN (µ0 , σ0 , a0 ) = clL2 (P0 ) [span{TN0 (µ0 , σ0 , a0 )}]


· − µ0
    
= clL2 (P0 ) span ν : ν ∈ TN0 (0, 1, a0 )
σ0
(from lemma 7 )
106 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

· − µ0
   
= clL2 (P0 ) ν : ν ∈ span{TN0 (0, 1, a0 )}
σ0
(from lemma 8 )
· − µ0
   
0
= ν : ν ∈ clL2 (a0 ) [span{TN (0, 1, a0 )}]
σ0
· − µ0
   
= ν : ν ∈ TN (0, 1, a0 )
σ0
u
t

We present now the two technical lemmas required in the proof given above.
Lemma 7 Given a class of functions A we have
· −µ · −µ
       
span ν : ν∈A = ν : ν ∈ span(A) .
σ σ

Proof:
’⊆’ n   o
Take h ∈ span ν · −µ
σ0
0
: ν ∈ A . Then there exists n ∈ N , t1 , ..., tn ∈ IR and h1 , ..., hn ∈
n   o
· −µ0
ν σ0
: ν ∈ A such that
n
X
h( · ) = ti hi ( · ) .
i=1
n   o
· −µ0
Since for each i ∈ {1, ..., n}, hi ∈ ν σ0
: ν ∈ A , there exists νi ∈ A such that
· − µ0
 
νi = hi ( · ) .
σ0
Pn
Clearly i=1 ti νi ( · ) ∈ span(A) and
n
· − µ0
X  
h( · ) = ti νi .
i=1 σ0
n   o
· −µ0
Hence h ∈ ν σ0
: ν ∈ span(A) .

’⊇’ n   o
Take z ∈ ν · −µ
σ0
0
: ν ∈ span(A) . Then there exists ν ∈ span(A) such that z( · ) =
 
ν · −µ
σ0
0
. Since ν ∈ span(A), there exists n ∈ N , t1 , ..., tn ∈ IR and ν1 , ..., νn ∈ A such
that ν( · ) = ni=1 ti νi ( · ). We have then
P

n
· − µ0
X  
z( · ) = ti νi
i=1 σ0
4.6. APPENDICES 107
n   o
· −µ
and hence z ∈ span ν σ
: ν∈A . u
t

Lemma 8 Let A be a class of functions contained in L2 (a0 ). Then


· −µ · −µ
       
clL2 (P0 ) ν : ν∈A = ν : ν ∈ clL2 (a0 ) (A) .
σ σ

Proof: Note that for all f ∈ L2 (a0 ) we have


· − µ0 2 · − µ0 1 · − µ0
  Z    
2

f = f a0 λ(dx) (4.52)

σ0
L2 (P0 ) IR σ0 σ0 σ0
Z
= f 2 (y)a(y)λ(dy) = kf ( · )kL2 (a0 ) .
IR
We prove now the lemma.
”⊆” n   o n   o
Take z ∈ clL2 (P0 ) ν · −µ
σ
: ν ∈ A . Then there exists a sequence {zn } ⊆ ν · −µ
σ
: ν ∈ A
L2 (P0 )
 
such that zn −→ z. Moreover, for all n ∈ N we have zn ( · ) = νn · −µ σ0
0
, for some νn ∈ A.
2
Since {zn } is convergent, it is a Cauchy sequence in L (P0 ). From (4.52), {νn } is a Cauchy
sequence in L2 (a0 ) and, since L2 (a0 ) is complete, {zn } is convergent, say
L2 (a0 )
νn −→ ν ,
for some η ∈ clL2 (a0 ) (A). Using (4.52) and the L2 (a0 )-continuity of the norm k · kL2 (a0 )
we obtain
· −µ · −µ
   

ν − z( · ) = lim νn − zn ( · ) = 0.

σ 2
L (P0 ) n↑

σ 2
L (P0 )
 
Hence, z( · ) = ν · −µ
σ
and the inclusion follows.
”⊇” n   o
Take z ∈ ν · −µ σ
: ν ∈ clL2 (a ) (A) . Since ν ∈ clL2 (a ) , there exists a sequence {νn } ⊆ A
0 0

L2 (a0 )
such that νn −→ ν. Note that {νn } is a Cauchy sequence 
in L2 (a0 ). Define the
sequence {zn } in L2 (P0 ) by, for all n, zn ( · ) = νn · −µ σ0
0
. From (4.52) we see that
2
{zn } isna Cauchy 
sequenceo
in L (P0 ) and then it is convergent with limit, say ξ ∈
· −µ0
clL2 (P0 ) ν σ0 : ν ∈ A . Using (4.52) and the L2 (P0 )-continuity of the norm k · kL2 (P0 )
we obtain
· − µ0
 

kξ( · ) − z( · )kL2 (P0 ) = lim zn ( · ) − νn
= 0. (4.53)
n↑ σ0
L2 (P0 )
u
t
108 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS

4.6.4 Calculation of the first four orthogonal polynomials in


L2 (a)
We calculate here the first four orthogonal polynomials in L2 (a) (i.e. e0 , e1 , e2 and e3 ) in
terms of the standardized cumulants of the distribution given by a. Here a is an arbitrary
element of the class of probability densities A defined in section 4.2.
Throughout this appendix the inner product and the norm of L2 (a) willR be denoted by
< · , · > and k · k respectively. We use also the notation, for i ∈ N0 , mi = IR xi a(x)λ(dx).
Recall that {ei }i∈N0 is the result of a Gram-Schmidt orthonormalization procedure
applied to the sequence of polynomials {1, ( · ), ( · )2 , ...}. We start the Gram-Schmidt
procedure by setting
e0 ( · ) = 1 ,
and observing that
ke0 k = 1 .
Taking e1 ( · ) = ( · ) we have
Z
< e0 , e1 >= xa(x)λ(dx) = 0
IR

and
Z
2
ke1 k = x2 a(x)λ(dx) = 1 . (4.54)
IR

Hence e1 satisfies the required orthonormal conditions.


We calculate now e2 . The Gram-Schmidt orthogonalization procedure (see Luenberg,
1969, page 55) gives the following polynomial of degree 2 orthogonal to e0 and e1 :
p2 ( · ) = ( · )2 − < e0 , ( · )2 > e0 ( · )− < e1 , ( · )2 > e1 ( · )
= ( · )2 − m3 ( · ) − 1 .
Note that the polynomials {1, ( · ), ( · )2 } are linearly independent in L2 (a) and p2 is a
linear combination of these polynomials with non-vanishing leading coefficient. Hence
p2 6= 0 and consequently kp2 k =
6 0. We denote kp2 k by ∆2 . We stress that ∆2 > 0, which
is a known result (see Kendall and Stuart, 1952) We have
Z
∆22 = (x2 − m3 x − 1)2 a(x)λ(dx) = m4 − m23 − 1 .
IR

We define
1
e2 ( · ) = {( · )2 − m3 ( · ) − 1} .
∆2
4.6. APPENDICES 109

We compute now e3 . The Gram-Schmidt orthogonalization procedure gives the fol-


lowing polynomial of degree 3 orthogonal to e0 , e1 and e2

p3 ( · ) = ( · )3 − < e0 , ( · )3 > e0 ( · )− < e1 , ( · )3 > e1 ( · )− < e2 , ( · )3 > e2 ( · )


( )
3 m5 − m3 m4 − m3 2 m3 (m5 − m3 m4 − m3 )
=( · ) − ( · ) − m4 − (·)
∆22 ∆22
( )
m5 − m3 m4 − m3
− m3 − .
∆22

Denote kp3 k by ∆3 . Note that p3 is a linear combination of linearly independent elements


of L2 (a) (in fact e0 , e1 , e2 and ( · )3 ) with a non-vanishing leading coefficient. Then p3 ( · ) 6=
0 and consequently ∆3 = kp3 k > 0 (this generalizes the result of Kendall and Stuart (1952)
concerning ∆2 ). We have then
1
e3 ( · ) = p3 ( · )
∆3
" ( )
1 3 −m3 m4 − m3 2 m3 (m5 − m3 m4 − m3 )
= (·) − ( · ) − m4 − (·)
∆3 ∆22 ∆22
( )#
m5 − m3 m4 − m3
− m3 − .
∆22
110 CHAPTER 4. SEMIPARAMETRIC LOCATION AND SCALE MODELS
Chapter 5

Semiparametric Models with L2


Restrictions

5.1 Introduction
This chapter treats a class of semiparametric models defined via restrictions imposed on
the moments of some square integrable functions. The class of models studied include
semiparametric extensions of many important models such as the multivariate location
and shape models, covariance selection models, growth curve models with modeled vari-
ance, generalized linear models, factor analysis, multivariate structural models, linear
structural relationships models, among others. In fact we present a tool to produce
semiparametric extensions of parametric models for which the moment structure plays a
structural rule.
The class of semiparametric models presented possesses a simple mathematical struc-
ture which makes it attractive to be used for testing general inference procedures. We
will be able to calculate explicitly the nuisance tangent spaces, which are the orthogonal
complements of the spaces spanned by the square integrable functions used to introduce
the restrictions defining the model. It is interesting to observe that the different notions
of nuisance tangent spaces defined in chapter 2 coincide for the models considered. More-
over, the orthogonal complement of the nuisance tangent spaces does not depend on the
nuisance parameter. This simplifies the treatment of some classic techniques for inference
in semiparametric models. For instance, any regular estimating function will be a lin-
ear combination of the restriction functions used to define the model corrected by their
means. In fact it will be shown that there is only one possible root for a regular estimating
function. That root is obtained in the form of a moment estimator. The efficient score
function will be a linear combination of the mean corrected restriction functions, however
with the coefficients in general depending on the nuisance parameter. The semiparametric

111
112 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Cramèr-Rao bound for the asymptotic variance of regular asymptotic linear estimating
sequences will be obtained from the quadratic norm of the efficient score function. We
will give a necessary and sufficient condition for attaining this bound with estimating
sequences based on regular estimating functions. There will be presented examples where
the bound is attained and examples where it is not attained by roots of regular estimating
functions.
The chapter is organized in the following way. Section 5.2 introduces the main class
of semiparametric models we deal with and some examples are given. The estimation
via regular asymptotic linear estimating sequences and regular estimating functions is
considered in section 5.4. Most of the technical proofs are given in the section 5.5.
5.2. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS 113

5.2 Semiparametric Models with L2 Restrictions


We define next the class of semiparametric models studied in this chapter. Suppose that
there are k, m ∈ N ∪ {0}, l ∈ N ∪ {0, ∞}, the functions f1 , . . . , fk : X −→ IR and the
functions g1 , . . . , gm : X × Z −→ IR such that for each θ0 ∈ Θ the submodel Pθ∗0 is the
class of functions p : X −→ IR+ for which the conditions (5.1)-(5.6) hold. The conditions
referred are, for j = 1, . . . , k and i = 1, . . . , m:
∀x ∈ X , p(x) > 0; (5.1)

p is of class C l , (5.2)
where for r ∈ N , C r is the class of functions r times continuously differentiable C 0 is the
class of continuous functions and C ∞ is the class of functions infinitely many differentiable;
Z
p(x)λ(dx) = 1; (5.3)
X
Z
fj2 (x)p(x)λ(dx) < ∞; (5.4)
X
Z
fj (x)p(x)λ(dx) = Mj (θ0 ); (5.5)
X
Z
for each z ∈ Z, gi (x, z)p(x)λ(dx) ∈ Bi (θ0 ), (5.6)
X

where Mj (θ0 ) ∈ IR and Bi (θ0 ) is a real open set given.


The conditions (5.1) and (5.3) ensure that p is a probability density of a distribution
with support equal to the whole sample space X . Conditions (5.4)-(5.6) are used to restrict
each submodel Pθ0 (and consequently shrink the model P). These condition could be used
to express a partial a priori knowledge about the phenomena we study or to ensure some
desirable mathematical characteristics of the model, such as identifiability and regularity
of the partial score functions. For instance, the conditions (5.6) can be used to ensure
that the partial score functions are in L2 . Condition (5.2) can be assumed to hold apart
from a λ- null set.
Note that there are redundances in the model. In fact, condition (5.4) can be expressed
in terms of condition (5.6), however it is important to make it explicit that the functions
f1 , . . . , fk are in L2 (p). Furthermore, without loss of generality, we assume that the
functions f1 , . . . , fk are linearly independent (as elements of L2 ); because if f1 , . . . , fk are
linearly dependent, then condition (5.4) could be expressed with a smaller number of
functions.
We refer to the models of the form described above as L2 - restricted semiparametric
models or simply L2 - restricted models.
114 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.3 Examples
5.3.1 Location-scale models
In this example we consider a collection of location-scale models defined on the real line,
for which we fix the first k (for k ∈ N greater or equal 2) standardized cumulants of the
distributions. The models we define are larger than the semiparametric extensions of the
location-scale models considered in chapter 4. In particular, we avoid to use strong condi-
tions on the tails and on the Laplace transform of the distributions, but those conditions
can be incorporated in the model through condition (5.6), without essentially modifying
the results obtained. These models will serve as prototypes of L2 - restricted models.
The sample space here X is the real line, B is the Borel σ-algebra in IR and λ is
the Lebesgue measure defined on (IR, B). Consider the following family of probability
measures dominated by λ
( )
dPµσa 1 · −µ
 
P = Pµσa : (·) = a , µ ∈ IR, σ ∈ IR+ , a ∈ A , (5.7)
dλ σ σ

where A is the class of probability densities a : IR −→ IR+ such that the conditions
(5.8)-(5.16) are satisfied. The conditions referred are:

∀x ∈ X , a(x) > 0; (5.8)

a ∈ C 1; (5.9)

Z
a(x)λ(dx) = 1; (5.10)
IR

Z
xa(x)λ(dx) = 0; (5.11)
IR

Z
x2 a(x)λ(dx) = 1; (5.12)
IR

Z
x2k a(x)λ(dx) < ∞; (5.13)
IR

Z
for j = 3, . . . , k, xj a(x)λ(dx) = mj . (5.14)
IR
5.3. EXAMPLES 115

Here numbers m3 , . . . , mk are fixed values of the standardized cumulants of order up to


k. We convent that condition (5.14) vanishes if k = 2. Furthermore, we assume that the
score function of the submodels defined by fixing the values of a ∈ A are in L2 , i.e.
)2
a0 (x)
Z (
a(x)λ(dx) ∈ (0, ∞) ; (5.15)
IR a(x)

)2
xa0 (x)
Z (
a(x)λ(dx) ∈ (0, ∞) . (5.16)
IR a(x)
The parameter of interest is θ := (µ, σ) ∈ Θ := IR × IR+ . The nuisance parameter is
a ∈ A.
The characterization given above is natural for the location and scale model, however it
is not in the form of a L2 - restricted semiparametric model. In fact, we define the submodel

P(0,1) (= A) and then use the affine transformation (·) 7→ σ(·) + µ to characterize the
whole model P. On the other hand, in the definition of the L2 - restricted model we give
conditions satisfied by each submodel Pθ (for θ ∈ Θ).
We give next an alternative characterization of the location and scale model given by
(5.7), which is in the form required for the L2 - restricted models. For each (µ, σ) ∈ IR×IR+
consider the following class of probability densities on the real line,

Pµσ = {pµ,σ : IR −→ IR+ : (5.18)-(5.24) hold } . (5.17)

The conditions referred in the definition above are, for each (µ, σ) ∈ IR × IR+ ,
∀x ∈ IR, pµ,σ (x) > 0; (5.18)

pµ,σ ∈ C 1 ; (5.19)
Z
pµ,σ (x)λ(dx) = 1; (5.20)
IR

j !
j
Z
j
σ j−i µi mj−i ,
X
for j = 1, . . . , k, x pµ,σ (x)λ(dx) = (5.21)
IR i=0
i

where m3 , . . . , mk are the fixed standardized cumulants, m0 = 1, m1 = 0 and m2 = 1.


The terms in the right of (5.21) come from the binomial expansion of (σx + µ)j with the
ith power of x replaced by mi (i = 0, . . . , j). We assume further that
Z
x2k pµ,σ (x)λ(dx) < ∞; (5.22)
IR
116 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
( 0 )2
Z
p µ,σ (x)
pµ,σ (x)λ(dx) ∈ (0, ∞) ; (5.23)
IR pµ,σ (x)

)2
xp0µ,σ (x)
Z (
pµ,σ (x)λ(dx) ∈ (0, ∞) . (5.24)
IR pµ,σ (x)

It is easy to see that the model given by (5.17) coincides with the model given by (5.7).
Moreover, the model given by (5.17) is clearly a L2 - restricted model. Here the condition
(5.21) corresponds to the condition (5.12) in the definition of L2 - restricted model with
fj ( · ) = ( · )j and Mj (µ, σ) given by the right side of (5.21).
5.3. EXAMPLES 117

5.3.2 Multivariate location-shape models


We define next a semiparametric version of a q- dimensional location and shape model
(q ∈ N ). Even though this model is a straightforward extension of the one- dimensional
location and scale model, it will serve us as a basis for the next example where we illustrate
alternative uses of the conditions (5.5) and (5.6).
Let V = Vq be the class of symmetric and nonsingular q × q matrices. Consider the
following class of probability densities on IRq , for each µ ∈ IRq and Σ ∈ V,
∗ q
Pµ Σ = {p : IR −→ IR+ such that (5.26) - (5.33) hold} . (5.25)

The conditions referred are

∀x ∈ IRq , p(x) > 0; (5.26)

p ∈ C 1; (5.27)

Z
p(x)λ(dx) = 1; (5.28)
IRq

Z
xp(x)λ(dx) = µ; (5.29)
IRq

Z
x · xT p(x)λ(dx) = Σ + µµT ; (5.30)
IRq

Z
(xT · x)2 p(x)λ(dx) < ∞. (5.31)
IRq

Assume moreover that the components of the partial score function are in L2 , i.e.
Z n o2
for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.32)
IRq

Z n o2
for i, j = 1, . . . , q, i ≤ j, l/σij (x) p(x)λ(dx) ∈ (0, ∞) , (5.33)
IRq

where l/µi and l/σij are the components corresponding to the partial derivative with respect
to the ith component of µ and the (i, j)th entry of Σ, respectively, in the score function
for the location and shape generated by the distribution associated with the density p.
Note that by (5.26) and (5.28) p is a probability density on IRq with support equal to the
118 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

whole IRq . Moreover, (5.27) implies that the partial score function referred in (5.32) and
(5.33) are well defined.
The multivariate location and shape model is defined by

P ∗ = ∪µ∈IRq ,Σ∈V PµΣ



.

Clearly P ∗ is a L2 - restricted semiparametric model.


A multivariate location model can be defined in a similar way, simply by dropping the
parameter Σ, eliminating the conditions (5.30) and (5.33) and replacing condition (5.31)
by
Z
xT xp(x)λ(dx) < ∞ .
IRq
5.3. EXAMPLES 119

5.3.3 Covariance selection models


The models considered in this example are constructed by imposing some additional
restrictions on the multivariate location and the multivariate location and shape semi-
parametric models defined previously. We will assume that some entries of the inverse of
the covariance matrix vanish. This corresponds to assuming that some pairs of variables
are conditionally not correlated given the others (see Whittaker, 1990).
We consider here two situations: first the case where the covariance matrix is to be
estimated (the covariance selection model on the semiparametric multivariate location
and shape model) and the case where the covariance matrix is not to be estimated and is
unknown (the covariance selection model on the semiparametric location case). The two
cases mentioned differ significantly: the first is a L2 - restricted model while the second is
not.
It is convenient to denote by σij and σ ij the (i, j)th entry of the matrices Σ and Σ−1 ,
respectively (for i, j ∈ {1, . . . , q}). Clearly each σ ij is a function of the entries of the
matrix Σ (we do not need to specify it now), say

σ ij = f ij (σ11 , σ21 , . . . , σq1 , σ22 , . . . , σq2 , . . . , σqq ) ,

where f ij is a function from IRq+(q−1)+...+1 into IR. Note that in the definition of f ij we
used only the triangle below the diagonal of the matrix Σ, in order to avoid ambiguity due
to the symmetry of Σ. In fact one should adopt a similar convention in the parametrization
of the location and shape model.

Covariance selection for the location and shape model


Let us consider a semiparametric multivariate location and shape model as defined by
(5.25), where for each (µ, Σ) ∈ IRq × V the submodel Pµ ∗
Σ is the class of functions
q
p : IR −→ IR+ such that (5.26)-(5.33) and the additional conditions (5.34) and (5.35)
given below hold.
The additional conditions referred are, for some pairs i, j ∈ {1, . . . , q}, i > j,

σ ij = 0 . (5.34)

and for other pairs k, l ∈ {1, . . . , q}, k > l,

σ kl 6= 0 . (5.35)

Specifying conditions like (5.34) and (5.35) we can define whether each pair of variables
(entries of the random vector of observations) is conditionally correlated or not given the
other variables. This corresponds to specify a covariance selection model.
120 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

The condition (5.34) can be viewed as a particular case of (5.5) by defining fi as a


constant function equal to 0 and M (µ, Σ) = f ij (σ11 , . . . , σqq ). Since the constant functions
are in L2 , the associated condition (5.4) holds also. By the same argument the condition
(5.35) can be expressed as a condition of the type of condition (5.6). We conclude that
the covariance selection model described above is a L2 - restricted model, even though in
a rather artificial form.

Covariance selection for the pure location model


We consider now the case where the correlation matrix is not part of the parameter space
and is unknown. In this case we cannot use (5.34) or (5.35) to state that a certain entry
of the inverse of the covariance matrix vanish or not, because σ ij is not a function of the
fixed parameter of interest as before. We have to introduce instead a condition in terms
of the mixed second moments as done below.
Consider for each µ ∈ IRq a class of probability densities on IRq given by

Pµ∗ = {p : IRq −→ IR+ such that (5.36)-(5.43) } .

The referred conditions are, for some given pairs (i, j) and (k, l), with i, j, k, l ∈ {1, . . . , q},
i > j and k > l,

∀x ∈ IRq , p(x) > 0; (5.36)

p ∈ C 1; (5.37)

Z
p(x)λ(dx) = 1; (5.38)
IRq

Z
xp(x)λ(dx) = µ; (5.39)
IRq

Z
|x|2 p(x)λ(dx) < ∞; (5.40)
IRq

Z n o2
for i = 1, . . . , q, l/µi (x) p(x)λ(dx) ∈ (0, ∞) ; (5.41)
IRq

Z Z 
ij
f x1 x1 p(x)λ(dx), . . . , xq xq p(x)λ(dx) = 0 ; (5.42)
IRq IRq
5.3. EXAMPLES 121
Z Z 
f kl q
x1 x1 p(x)λ(dx), . . . , q
xq xq p(x)λ(dx) 6= 0 , (5.43)
IR IR

where for x ∈ IRq , |x|2 = x21 + . . . + x2q and x1 , . . . , xq are the components of the vector
x. The model in study is given by P ∗ ∪µ∈IRq Pµ∗ .
Note that the conditions (5.42) and (5.43) cannot be expressed in terms of the con-
ditions (5.1)-(5.6) and hence the model presented is not a L2 - restricted model. We will
study latter this kind of models.
We close this example by considering the case where the dimension is q = 3 in details,
which is relatively simple. Let us consider the case where the pair formed by the first and
the second variables is conditionally uncorrelated and the other pairs are conditionally
correlated given the other variables. The conditions in terms of the unknown entries of
the inverse of the covariance matrix are

σ 12 = 0 , σ 13 6= 0 and σ 23 6= 0 .

or equivalently,

σ13 σ23 − σ12 σ33 = 0, σ12 σ23 − σ13 σ22 6= 0 and σ13 σ12 − σ11 σ23 6= 0 .

The conditions of type (5.42) and (5.43) are in this case


R R
x1 x3 p(x)λ(dx) IRR3 x2 x3 p(x)λ(dx)
IRR3
− IR3 x1 x2 p(x)λ(dx) IR3 x3 x3 p(x)λ(dx) = 0
R R
x1 x2 p(x)λ(dx) IRR3 x2 x3 p(x)λ(dx)
IRR3
− IR3 x1 x3 p(x)λ(dx) IR3 x2 x2 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞)
R R
x1 x3 p(x)λ(dx) IRR3 x1 x2 p(x)λ(dx)
IRR3
− IR3 x1 x1 p(x)λ(dx) IR3 x2 x3 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞) .
122 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.3.4 Linear structural relationships models (LISREL)


We consider in this section a very rich class of multivariate models called the Linear
structural relationships models (LISREL). This class contains some important models,
as for example the multivariate linear regression models, the factor analysis models and
the path analysis models.
The scenario of LISREL models is the following. Suppose that we observe a p- di-
mensional random vector Y and a q- dimensional random vector X. The vectors Y and
X are linearly related to two (non observable) latent variables η and ξ, with dimensions
m × 1 and n × 1 respectively, through the following equations,
Y = Λy η + 

X = Λx ξ + δ .
Here Λy and Λx are p × m and q × n matrices of parameters respectively;  and δ are
p × 1 and q × 1 random vectors respectively with
cov() = Θ ; cov(δ) = Θδ ; cov(ξ) = Φ ;
E(ξ) = 0 ; E(η) = 0 .
We assume that Θ , Θδ and Φ are positive definite matrices given. Here E( · ) and cov( · )
are the expectation and covariance operators respectively. The latent variables are related
through the following equation
η = Γξ + ζ ,
where Γ is a m × n matrix of parameters and ζ is a m × 1 random vector with
E(ζ) = 0; cov(ζ) = Ψ .
Here Ψ is a given matrix positive definite matrix. We assume additionally that ζ,  and
δ are mutually uncorrelated;  and η are uncorrelated and δ is uncorrelated with ξ. It
is assumed that m, n, p and q, are such that the number of parameters is less or equal
than the number of entries of the inferior triangle of the joint covariance matrix of Y and
X, i.e. (p + q)(p + q + 1)/2. Some of the entries of Θ , Θδ , Ψ or Φ can be sometimes
considered as parameters.
The class of possible joint distributions of Y and X with the properties mentioned
above is said to be a Linear structural relationships model (LISREL) (see Johnson and
Wichern, 1992). Sometimes in the literature maximum likelihood estimation is performed
based on the assumption of multivariate normal distributions for the random quantities.
However, in general, it is not specified any particular distribution for the random quanti-
ties referred and moment estimation via the covariance matrix is performed. When using
5.3. EXAMPLES 123

LISREL models one is interested in modeling the covariance structure of a group of vari-
ables (for example in factor analysis) and an alternative definition of a LISREL model
is given by taking the class of distributions with a special covariance structure. The
following covariance structure can be easily obtained from the model description given
above:
 
cov(Y ) = Λy ΓΦΓT + Ψ + Θ , (5.44)

cov(X) = Λx ΨΛTx + Θδ (5.45)

and

cov(Y , X) = Λy ΓΛTx . (5.46)

We show that the LISREL models are examples of L2 - restricted model. The conditions
(5.44)-(5.46) can be expressed in terms of conditions of the type of (5.5) with integrand
given by, for all x = (x1 , . . . , xp+q ) ∈ IRp+q and for i, j = 1, . . . , p + q, i > j,

fij (x) = xi xj ,

(here we index the functions fi ’s given in (5.5) by a double index) and the left side of the
equation (i.e. Mij (θ)) determined by the equations (5.44)-(5.46).
If the matrices Θ , Θδ , Ψ or Φ are viewed as parameters (or some entries as parameters
and some entries known) the model is still a L2 - restricted model. In this case the right
side of the equations of type (5.5) becomes more complicated. However, if some entries
of Θ , Θδ , Ψ or Φ are nuisance parameters (i.e. are not interest parameters and are
unknown), then the model in general is no longer a L2 - restricted model. An extension of
the L2 restricted models given in the next chapter will cover these cases.

Example 8 This example is extracted from Johnson and Wichern (1992, page 445). Sup-
pose that we want to model firm performance and managerial talent in a certain economic
system. Since these two quantities cannot be measured directly, one might represent them,
by two latent variables, say η and ξ and use a set of related observable variables in a
LISREL model. In this example the firm performance is characterized by the profit, Y1 ,
and the common stock price, Y2 . The managerial talent is represented by the time of
chief executive experience, X1 and memberships of boards of directors, X2 . A LISREL
model with the specifications given below is pointed by Johnson and Wichern (1992) as an
alternative for modeling this situation. Take m = n = 1, p = q = 2,
" # " # " # " #
1 λ2 θ1 0 θ3 0
Λy = , Λx = , Θ = , Θδ = ,
λ1 1 0 θ2 0 θ4
124 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

var(ξ) = Φ and var(ζ) = Ψ. Denoting the covariance of the ith and the jth entries of
the joint vector (Y T |X T )T by κij we obtain from the covariance structure of a LISREL
model the following equations:

κ11 = (γφ + ψ) + θ1 , κ12 = λ1 (γφ + ψ),


κ22 = λ21 (γ 2 φ + ψ) + θ2 , κ13 = λ2 γφ,
κ14 = γφ, κ23 = λ1 λ2 γφ,
κ24 = λ1 γφ, κ33 = φλ22 + θ3 ,
κ34 = φλ2 , κ44 = φ + θ4 .

The parameters were estimated by replacing the second order moments in the equations
above by the sample moments and solving for the parameters. We will show that this
intuitively appealing procedure can be justified by the theory of semiparametric models.
5.3. EXAMPLES 125

5.3.5 Growth curve models


Consider a growth curve data such that for a given subject, out of several independent
subjects, we observe a data vector X = (X0 , X1 , . . . , Xn )T representing observations taken
at times t0 = 0 < t1 < . . . < tn , respectively. We will assume that X0 and the increments
∆Xi ≡ Xi − Xi−1 , for i = 1, . . . , n, are distributed according to distributions all contained
in a family of distributions F parametrized in an identifiable form by the mean and the
variance, with F (µ, σ 2 ) representing the element of F with mean µ and variance σ 2 . The
model states that, and for i = 1, . . . , n,

X0 ∼ F (α, ηV1 (α/η)) (5.47)

and for i = 1, . . . , n,

∆Xi = F (β∆ti , λ∆ti V1 (β/λ)) (5.48)

where ∆ti ≡ ti − ti−1 , V1 is a suitable function fixed and α, β, η and λ are parameters.
Moreover, suppose that X0 , ∆X1 , . . . , ∆Xn are uncorrelated. It is easy to see that we
have the following moment structure,

E(X) = (α, α + βt1 , . . . , α + βtn )T (5.49)

and
" #
ηV1 (α/η) ηV1 (α/η)eT
Cov(X) = (5.50)
ηV1 (α/η)e ηV1 (α/η)E + λV1 (β/λ)T

where e = (1, . . . , 1)T , E = [Eij ] and T = [Tij ] are n × n matrix with Eij = 1 for all
(i, j) and Tij = Tji = ti for j ≥ i. Hence the model is adequate to study linear growth
with data with non constant variance over time. When the family F is an exponential
dispersion model with variance function V1 and parametrized in a suitable way (such that
X satisfies (5.49) and (5.50)) the model described above coincides with the latent growth
process studied in Jørgensen, Labouriau and Lundbye-Christensen (1996). Consider now
the class F of all families of distributions in IR parametrized by the mean and the variance
in an identifiable form, for which all the members of the family posses finite moments of
fourth order. For each F ∈ F we can generate a linear growth curve model by using
(5.47) and (5.48). The union of all these models, obtained by taking each element of F
and applying the construction described above, provides a semiparametric extension of the
original parametric model described above. Adding suitable differentiability conditions
(satisfied by the exponential dispersion models, for example) one obtain a L2 - restricted
model.
126 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Let us consider now a more complicated structure. Suppose that X is not directly
observable. We can observe only a vector Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,

Yi |Xi ∼ G(Xi , δXi V2 (1/δ)) , (5.51)

where G(µ, σ 2 ) is an element of an other family of distributions G parametrized by the


mean and the variance, with mean µ and variance σ 2 ; V2 is a suitable function given. We
can think on Y as a noisy observation of the latent linear growth process X. We have,
for i = 0, 1, . . . , n,

E(Yi ) = E(Xi ) = α + βti (5.52)

and

V ar(Yi ) = ηV1 (α/η) + λti V1 (β/λ) + (α + βti )δV2 (1/δ) .

Hence we have again a model suitable for linear growth data with variance varying with
time. If the families F and G are exponential dispersion models with variance functions
V1 and V2 respectively suitably parametrized, then the model defined by (5.47), (5.48)
and (5.51) coincides with a linear growth curve considered in Jørgensen, Labouriau and
Lundbye-Christensen (1996). The covariance matrix of Y is

Cov(Y ) = V1 (α/η)E + λV1 (β/λ)T + δV2 (1/δ)D . (5.53)

Here D is a diagonal matrix with ith diagonal entry α+βti , E and T are defined as before.
The component of the covariance matrix proportional to E reflects the between subjects
variation at time t0 , the component proportional to T arises because the X- process has
stationary and independent increments and the third component proportional to D can
be interpreted as a measure of the noisy. If the family G is allowed to vary freely in a
class of distribution families, as we did before with the semiparametric model for X, one
obtains a L2 - restricted model. The restrictions corresponding to (5.5) are obtained from
equations (5.52) and (5.53).
If the matrix D is the identity and the functions V1 and V2 are constant functions, then
the model described above becomes a generalization of the univariate Brownian growth
model of Lundbye-Christensen (1991).
Linear growth in biology occurs only in a few cases. Let us generalize the model
described above to a situation where the growth does not follow a linear pattern on time.
Suppose that there is a unobserved linear latent process X defined as in (5.47) and (5.48).
We observe a growth process Y = (Y1 , . . . , Yn )T with, for i = 1, . . . , n,

Yi |Xi ∼ G{b(Xi ), δXi V2 (1/δ)} , (5.54)


5.3. EXAMPLES 127

where b is a sufficiently smooth one-to-one increasing real function given (the growth
curve). We have, for i = 0, 1, . . . , n,

E(Yi ) = E(Xi ) = b(α + βti ) (5.55)

and Cov(Y ) is given by (5.53) with D diagonal with ith diagonal given by b(α + βti ).
Applying the same construction used before, we obtain a L2 - restricted model.
128 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.4 Estimation via regular asymptotic linear estimat-


ing sequences and regular estimating functions
In this section we study two classic techniques for obtaining estimators of the interest
parameter under the semiparametric L2 restriction models, namely: regular linear asymp-
totic estimating sequences and regular estimating functions. To pursue this task the first
step is to calculate the nuisance tangent spaces.

5.4.1 Calculation of the tangent spaces


It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote the
L20 - linear space spanned by the projection of the functions f1 , . . . , fk onto L20 (Pθz ) by
Hk (θ, z). More precisely,

f1 (x) dPdλθz (x)λ(dx), . . . , 


 
f1 ( · ) −
R

 X 
Hk (θ) = span 
fk (x) dPdλθz (x)λ(dx) 

fk ( · ) −
 R
X

= span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)} .

The orthogonal complement of Hk (θ, z) in L20 (Pθz ) is denoted by Hk⊥ (θ, z). Note that Hk
depends only on θ, but Hk⊥ depends on both parameters, because we took the orthogonal
complement in L20 (Pθz ).
We denote the Lp nuisance tangent space (for 1 ≤ p ≤ ∞) and the weak (or Hellinger)
p w
nuisance tangent space at (θ, z) ∈ Θ × Z by T N (θ, z) and T N (θ, z) respectively (see
Labouriau, 1996a for details).

Theorem 13 Under the L2 - restricted semiparametric model the nuisance tangent spaces
are given, for θ ∈ Θ and z ∈ Z, by
1 w 2

T N (θ, z) =T N (θ, z) =T N (θ, z) = Hk (θ, z) .

The proof of theorem 13 is given in two lemmas in appendix 5.5.1. First it is proved
1
that T N ⊆ Hk⊥ (lemma 9) by using an argument based on the existence of almost surely
convergent subsequences of each L1 convergent sequence. The second step is to prove that
2
Hk⊥ ⊆T N (lemma 10), which is done by directly verifying that the continuous compact
supported elements of Hk⊥ are in the L2 nuisance tangent set. The inclusion follows from
the fact that the class of continuous compact supported functions is dense in L2 , provided
the sample space is locally compact Hausdorff.
5.4. EFFICIENT ESTIMATION 129

A careful analysis of the argument presented shows that in fact the theorem 13 holds
for any Lp nuisance tangent space (for 1 ≤ p ≤ ∞) or even for weaker notions of nuisance
tangent space.
Note that the nuisance tangent space is determined only by the functions involved
in the condition (5.5). The restrictions associated with (5.6) produce no effects on the
nuisance tangent space. Note that condition (5.5) can be improved by replacing the
property of the integral being equal to a constant by the property of belonging to a real
set that pocesses only isolated points.
The condition (5.4) associated with (5.5) is indeed necessary, otherwise the space Hk
would not be in L2 , which would produce irregularities in the nuisance tangent spaces.
That is we should use square integrable functions to restrict the model indeed.

5.4.2 Efficient score functions


We calculate now the so called efficient score function of a well behaved L2 - restricted
model. It is pressuposed that the parametric partial score function is well defined and
regular, in the sense given below. Consider the function l/θ : X × θZ −→ IRq with
components l/θ1 , . . . , l/θq : X × θZ −→ IR given by, for each θ0 ∈ Θ, z0 ∈ Z and i ∈
{1, . . . , q}

∂ Pθo z0
 

l/θi ( · , θ0 , z0 ) = i log ( · ) .
∂θ dλ θ=θ0

It is assumed that for each (θ0 , z0 ) ∈ Θ × Z and each i ∈ {1, . . . , q}, l/θi ( · , θ0 , z0 ) is well
defined and is in L2 (Pθ0 z0 ).
The efficient score function is obtained by orthogonal projecting the parametric partial
score function on the orthogonal complement of the nuisance tangent space. Note that
different versions of the efficient score function are obtained by using alternative definitions
of the nuisance tangent space. However, from theorem 13 the different notions of tangent
space coincide for the models we work with. Hence we can just reffer generically to the
efficient score function.
Consider the function l/θ E
: X × Θ × Z −→ IRq with components l/θ E E
1 , . . . , l/θ q : X ×

θZ −→ IR given, for each θ0 ∈ Θ, z0 ∈ Z and i ∈ {1, . . . , q}, by


Y 
E
l/θ i ( · , θ0 , z0 ) = l/θi ( · , θ0 , z0 )|TN⊥ (θ0 , z0 ) ,

where ( · |A) is the orthogonal projection operator onto A ⊆ L20 (Pθo z0 ) and TN⊥ (θ0 , z0 )
Q

is the orthogonal complement of the nuisance tangent space at (θ0 , z0 ) in L20 (Pθo z0 ). The
E
function l/θ is the efficient score function.
130 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

We calculate now the efficient score function at (θ, z) ∈ Θ × Z. Denote the result of a
Gram-Schmidt orthonormalization process, in the space L2 (Pθz ), applied to the functions
1, f1 , . . . , fk by 1, ξ1θz , . . . , ξkθz . Since ξ1θz , . . . , ξkθz form an orthonormal basis on Hk =
TN⊥ (θ, z), the projection of l/θi ( · , θ, z) onto TN⊥ (θ, z) is, for i = 1, . . . , q,
k
E
hl/θi ( · , θ, z), ξjθz ( · )iL2 (Pθz ) ξjθz ( · ) ,
X
l/θ i ( · , θ, z) = (5.56)
j=1

where h · , · iL2 (Pθz ) is the inner product of L2 (Pθz ). The representation above can be
written in matricial form in the following way, for each (θ, z) ∈ Θ × Z,

lE ( · ; θ, z) = A(θ, z)ξ θz ( · ) , (5.57)

where ξ θz ( · ) = ( ξ1θz ( · ), . . . , ξkθz ( · ) )T and


 
hξ1 , l/θ1 iθz . . . hξ1 , l/θq iθz
. .
 
 
 
A(θ, z) = 
 . . 
 . (5.58)

 . . 

hξk , l/θ1 iθz . . . hξk , l/θq iθz

The covariance matrix of the efficient score function is given by


n o
Covθz lE ( · ; θ, z) = A(θ, z)Covθz {ξ θz ( · )}AT (θ, z) = A(θ, z) AT (θ, z) . (5.59)

The inverse of the covariance matrix above provides a lower bound for the asymptotic
variance of regular asymptotic linear estimating sequences. Here we use the partial order
of matrices (i.e. , A ≥ B iff A − B is positive semidefinite).

5.4.3 Characterization of the class of regular estimating func-


tions
In this section we characterize the class of regular estimating functions for the L2 restricted
models considered so far.
Let us recall the definition of regular estimating function. A function Ψ : X ×Θ −→ IRq
is said to be a regular estimating function when (i)-(v) given below hold. The properties
regular estimating functions reffered to are, for i = 1, . . . , q,

(i) For all (θ, z) ∈ Θ × Z, ψi ( · ; θ) ∈ L20 (Pθz );

(ii) For λ-almost every x, ψi (x; θ) is a differentiable function of θ;


5.4. EFFICIENT ESTIMATION 131

(iii) For j = 1, . . . , q,
( )
∂ Z dPθz Z
∂ dPθz
ψi (x; θ) (x)λ(dx) = ψi (x; θ) (x) λ(dx) ;
∂θj X dλ ∂θj dλ

(iv) For all (θ, z) ∈ Θ × Z, Eθz {∇Ψ( · ; θ)} is a nonsingular matrix;


n o
(v) For all (θ, z) ∈ Θ × Z, Eθz Ψ( · ; θ)ΨT ( · ; θ) is a positive definite matrix.

Here ψ1 , . . . , ψq are the components of the function Ψ and Eθz (X) is the expectation of a
random vector X under Pθz .

Theorem 14 Under the semiparametric L2 - restricted model we have that any regular
estimating function Ψ : X × Θ −→ IRq can be expressed in the following form, for all
θ ∈ Θ and for i = 1, . . . , q,
k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} . (5.60)
j=1

Proof: If Ψ is a regular estimating function then, as shown in chapter 3 or Jørgensen


and Labouriau (1996, chapter 4), for all θ ∈ Θ and i ∈ {1, . . . , q}
2
ψi ( · ; θ) ∈ ∩z∈Z TN⊥ (θ, z) . (5.61)

Note that for each


2
TN⊥ (θ, z) = span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)} .
2
Since, for θ fixed and for i = 1, . . . , k, Mi (θ) does not depend on z, TN⊥ (θ, z) in fact does
not depend on z and (15) becomes,

ψi ( · ; θ) ∈ span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)}

and the expression (5.60) follows immediately. u


t

The representation (5.60) can be written in a matricial form in the following way,

Ψ( · ; θ) = α(θ){f ( · ) − M (θ)} , (5.62)


132 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

where
     
α11 (θ) . . . α1k (θ) f1 ( · ) M1 (θ)
. . . .
     
     
     
α(θ) =  . .  , f( · ) =  .  and M (θ) =  . .
     

 . . 


 . 


 . 

αq1 (θ) . . . αqk (θ) fk ( · ) Mk (θ)

The sensibility matrix of a regular estimating function Ψ with representation (5.60)


(see chapter 3 or Jørgensen and Labouriau, 1996, chapter 4) is, for each θ ∈ Θ and z ∈ Z,

SΨ (θ, z) := Eθz {∇Ψ( · ; θ)} = α(θ){∇M (θ)} . (5.63)

This can be easily seen by differentiating Ψ and observing that Eθz {f ( · ) − M (θ)} = 0.
The variability matrix of Ψ is given by, for each θ ∈ Θ and z ∈ Z,

VΨ (θ, z) := Eθz {Ψ( · ; θ)ΨT ( · ; θ)} = α(θ)Covθz {f ( · )}αT (θ) (5.64)

and the Godambe information is given by

JΨ (θ, z) := SΨT (θ, z)VΨ−1 (θ, z)SΨ (θ, z) = {∇M (θ)}T [Covθz {f ( · )}]−1 {∇M (θ)} .

The condition (iii) in the definition of regular estimating function says that, for all
(θ, z) ∈ Θ × Z, SΨ (θ, z) is non singular. We have then rank(SΨ (θ , z)) = q. Since (see
Mardia et al., 1979 page 464 formula A.4.2.d)

rank(SΨ (θ, z)) = rank(α(θ)∇M (θ)) ≤ min{rank(α(θ), rank(∇M (θ))} ,

α(θ) and ∇M (θ) must be of full rank (i.e. have rank equal to q).
We study now the roots of a regular estimating function Ψ based on a repeated inde-
pendent sample x = (x1 , . . . , xn )T . The estimator of θ based on Ψ and x is given by θ̂
such that
n
X
Ψ(xl ; θ̂) = 0 , (5.65)
l=1

which is equivalent to
n
X n
X
0 = α(θ̂){f (xl ) − M (θ̂)} = α(θ̂) {f (xl ) − M (θ̂)} =
l=1 l=1
n
( )
1X
≡ α(θ̂) f (xl ) − M (θ̂) = α(θ̂){IP n f (x) − M (θ̂)} .
n l=1
5.4. EFFICIENT ESTIMATION 133

Here IP n is the empirical sample operator given by IP n g(x) = 1/n nl=1 g(xl ), for any
P

g : X −→ IRq . Since the rank of α(θ̂) is q, we can construct a q × q matrix γ(θ) which
has the columns equal to q linearly independent columns of α(θ̂). The system (5.65) is
equivalent to
0 = γ(θ̂){IP n f (x) − M (θ̂)} .
Since the rank of γ(θ) is q, the matrix γ(θ) is non singular and the system above has the
unique solution M (θ̂) = IP n f (x). Hence we proved the following.
Theorem 15 Under a L2 - restricted model, the only possible estimator for the interest
parameter θ based on a regular estimating function and a sample x = (x1 , . . . , xn )T is
θ̂n = θ̂n (x) such that
n
1X
M (θ̂n ) = IP n f (x) := f (xl ) . (5.66)
n l=1

In other words, under a L2 - restricted model, the moment estimator (5.66) is the unique
possible estimating sequence that can arise from the method of (regular) estimating se-
quences. This shows how poor is this class of estimators for the models in discussion.
In particular, any optimality theory for estimating sequences (such as maximizing the
Godambe information) is absolutely meaningless for L2 - restricted models.
We give next conditions for having consistency and asymptotic normality of the mo-
ment estimators arising from regular estimating functions.
Theorem 16 Consider a L2 - restricted model, a regular estimating function Ψ and the
estimator θ̂n (based on a sample of size n) given by the solution of (5.66) in theorem 15.
i) If the application θ 7→ M (θ) from Θ ⊆ IRq to IRq is invertible with continuous inverse,
then {θ̂n } is consistent;
ii) If in addition to the assumptions of the previous item, each Mj ( · ) and αij ( · ) are
twice differentiable, then
√ D
 
n(θ̂n − θ) −→ N 0, JΨ−1 (θ, z) .

Proof:
i) Let M −1 be the inverse of M , i.e. for all θ ∈ Θ, M −1 {M (θ)} = θ. The law of large
numbers gives
P
IP n f −→ Eθz (f ) = M (θ) .
134 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Since M −1 is continuous,
P
θ̂n = M −1 {IP n f } −→ M −1 {M (θ)} = θ .

ii) Take (θ, z) ∈ Θ×Z fixed. Note that by the assumptions, the entries of Eθz {∇∇Ψ( · ; θ)}
P
are in IR. Moreover, from the previous item, θ̂n −→ θ. Using the results 4.9 and 4.10 in
Jørgensen and Labouriau (1996, pages 10 and 171) we conclude the proof. u
t

5.4.4 Robustness of estimators derived from regular estimating


functions
We study next the robustness of the estimators of the type (5.66), i.e. the estimators
derived from regular estimating functions. Since the right side of (5.66) is a mean its
Hampel inference function is given by

IFH (x; p) = f (x) − M (θ) .

Assuming M inversible and differentiable and applying the chain rule for the Hampel
influence function (see Labouriau, 1989) we obtain that the Hampel influence function for
the estimator given by (5.66) is

IFH (x; p, θ̂n ) = ∇M (θ){f (x) − M (θ)} .

Hence the Hampel influence function of an estimator derived from a regular estimating
function is bounded if and only if the function f is bounded. That is θ̂n is resistent to
gross errors,i.e. B- robust if and only if the function f is bounded. On the other hand
the Hampel influence function is continuous (on x) if and only if f is continuous, and
that is the condition for having bounded sensibility to lateral shifts, i.e. V - robustness.

5.4.5 Attainability of the semiparametric Cramèr-Rao bound


via estimating functions
We treat next the question of whether the lower bound for the asymptotic covariance of
regular asymptotic linear estimating sequences is attained by estimators obtained from
regular estimating functions. First, note that for each (θ, z) ∈ Θ × Z we have

TN⊥ (θ, z) = Hk = ∩z∗ ∈Z TN⊥∗ (θ, z∗ ) = FIA (θ) .

Hence, from theorem 8 in chapter 3, the semiparametric Cramèr-Rao bound coincides


with the bound for the concentration of estimators derived from estimating functions.
5.4. EFFICIENT ESTIMATION 135

We show that the efficient score function is a generalized estimating function. Recall
that the efficient score function has representation
lE ( · ; θ, z) = A(θ, z)ξ θz ( · ) ,
where A(θ, z) is given by (5.58). On the other hand, f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)
generate the same space that the orthonormalized basis {ξ1θz ( · ), . . . , ξkθz ( · )}. Then there
exists a nonsingular matrix, say B(θ, z) such that
ξ θz ( · ) = B(θ, z){f ( · ) − M (θ)} .
Hence,
lE ( · ; θ, z) = A(θ, z)B(θ, z){f ( · ) − M (θ)} .
We conclude that lE ( · ; θ, z) is equivalent to {f ( · ) − M (θ)} provided A(θ, z) is of full
rank. An application of proposition 17 in chapter 3 shows that in fact the semiparametric
Cramèr-Rao bound is attained by the estimating function given in theorem 15 (i.e. given
by (5.66)) ( provided A(θ, z) is of full rank). We have proved then the following result.
Proposition 22 Suppose that for each (θ, z) ∈ Θ × Z the matrix A(θ, z) given by (5.58)
is of full rank. Then the semiparametric Cramèr-Rao bound is attained by the moment
estimator given by (5.62).
The following construction gives a more precise result about the attainability of the
semiparametric Cramèr-Rao bound for L2 - restricted models. The first step will be to
obtain an alternative representation of regular estimating functions that will permit us
to compare the Godambe information with the covariance of the efficient score function.
 T
Take (θ, z) ∈ Θ × Z fixed. Define ξ θz ( · ) = ξ1θz ( · ), . . . , ξkθz ( · ) , where 1, ξ1θz , . . . , ξkθz is
the result of an orthonormalization process (with respect to the inner product of L2 (Pθz ))
applied to 1, f1 , . . . , fk . That is, the equations of type (5.5) became, for each (θ0 , z0 ) ∈
Θ × Z and for i = 1, . . . , k,
Z
dPθ0 z0
ξiθz (x) (x)λ(dx) = Miθz (θ0 ) .
X dλ
Note that by construction Miθz (θ) = 0. From theorem 14, any regular estimating function
Ψ evaluated at θ can be represented as,
n o
Ψ( · ; θ) = γ θz (θ) ξ θz ( · ) − M θz (θ) = γ θz (θ)ξ θz ( · ) ,
 T
where M θz (θ0 ) = M1θz (θ0 ), . . . , Mkθz (θ0 ) . The sensibility at (θ, z) is given by

SΨ (θ, z) = −γ θz ∇M θz (θ) ,
136 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

the variability is
 
VΨ (θ, z) = γ θz (θ)Covθz ξ θz γ θz (θ)T

and the Godambe information is given by


 
−1
JΨ (θ, z) = ∇M θz (θ)T Covθz ξ θz ∇M θz (θ) = ∇M θz (θ)T ∇M θz (θ) .

Comparing the Godambe information given above with the variance of the efficient score
function given by (5.59) and using theorem 16 yields the following result.
Theorem 17 Consider a L2 - restricted model for which the assumptions of theorem 16
(item ii)) hold. Then the bound for the asymptotic variance of regular asymptotic linear
estimating sequences is attained by a regular estimating function at (θ, z) ∈ Θ × Z if and
only if
∇M θz (θ)T = A(θ, z) , (5.67)
where A(θ, z) is the matrix defined in (5.58).
We will present one example where the semiparametric bound is not attained by estimators
derived from regular estimating functions. In fact, in the example the bound is not
attained even by regular asymptotic linear estimating sequences.

5.4.6 Examples
In this section the examples considered before are revisited.

Location and scale


Let us consider a one dimensional location and scale model, as in section 5.3.1, with the
first k standardized cumulants fixed. The nuisance tangent space at (µ, σ, a) ∈ IR×IR+ ×A
is given by
" ( k )#⊥
· −µ · −µ
 
TN (µ, σ, a) = Hk⊥ = span ,..., − mk .
σ σ
Any regular estimating function Ψ = (ψ1 , ψ2 )T is of the form, for i = 1, 2,
k
( j )
· −µ
αij (µ, σ)
X
Ψi ( · ; µ, σ) = − mj . (5.68)
i=1 σ
The following theorem gives a sufficient condition for characterizing a regular estimating
function.
5.4. EFFICIENT ESTIMATION 137

Theorem 18 Under a semiparametric location and scale model, any function Ψ : IR ×


IR×IR+ −→ IR2 of the form (5.68) is a regular estimating function, provided the functions
αij (i = 1, 2 and j = 1, . . . , k) are differentiable and for all (µ, σ) ∈ IR × IR+ and
  
Xk k
 X 

jα1j (µ, σ) mj−1   jα2j (µ, σ) mj 
j=1 j=1

(5.69)
  
Xk k
 X 
− jα2j (µ, σ) mj−1 jα1j (µ, σ) mj 6= 0,
  
j=1 j=1

where we adopt the convention that m0 = 1.

The proof of the theorem above is given in the appendix 5.5.2.


In the case of k = 2 the condition (5.69) in the theorem above becames equivalent to
say that the matrix α(µ, σ) is non singular. Taking the functions α11 (µ, σ) = α12 (µ, σ) =
α21 (µ, σ) = α22 (µ, σ) = 1, for all µ ∈ IR and σ ∈ IR+ , one obtains an estimating function
which is not regular (does not satisfy (iv) ), but has all its components in ∩a∈A TN⊥ (µ, σ, a).
This shows that we do need condition(5.69) for characterizing regular estimating functions.
We verify next whether the roots of regular estimating function attain the bound for
the asymptotic covariance of regular asymptotically linear estimating sequence in the case
of the location and scale model. Consider first the case where k = 2. Take a fixed value
(µ, σ, a) in the parameter space. Direct calculation yields the following representation for
the efficient score function

lE ( · ; µ, σ, a) = A(µ, σ, a)ξ µσa ( · ) , (5.70)

where ξ µσa ( · ) = (ξ1µσa ( · ), ξ2µσa ( · ))T with


( 2 )
· −µ 1 · −µ
 
ξ1µσa ( · ) = and ξ2µσa ( · ) = − m3 (a) − 1 ,
σ ∆2 (a) σ

where ∆2 (a) = m4 (a) − m3 (a) − 1 and


" 1 −2µ
hξ1µσa , l/µ iµσa hξ1µσa , l/σ iµσa
" # #
σ ∆2 (a)σ 2
A(µ, σ, a) = = .
hξ2µσa , l/µ iµσa hξ2µσa , l/σ iµσa 0 2
∆2 (a)σ

Note that m3 (a) and ∆2 (a) depend on the fixed density a. For each (µ0 , σ0 ) ∈ IR × IR+ ,
µ0 − µ
M1µσa (µ0 , σ0 ) := Eµ0 σ0 a {ξ1µσa ( · )} =
σ
138 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

and
σ0 − 2µµ0 − µ2
( )
1
M2µσa (µ0 , σ0 ) := µσa
Eµ0 σ0 a {ξ2 ( · )} = − m3 (a) − 1 .
∆2 (a) σ2
Hence
" 1 #
µσa 0
∇M (µ0 , σ0 )|(µ0 ,σ0 )=(µ,σ) = σ
−2µ 2 = AT (µ, σ, a) .
∆2 (a)σ 2 ∆2 (a)σ

We conclude from theorem 17 that the bound for regular asymptotically linear estimating
sequence is attained by estimators based on regular estimating functions, for the semi-
parametric location and scale model with two standardized cumulants fixed. It is easy
to see that the moment estimator associated with any regular estimating function in this
example is the sample mean and the sample variance.
An alternative argument is that the global tangent space is the whole L20 , which implies
that the efficiency of any regular asymptotic linear estimating sequence, in particular the
roots of regular estimating functions.
We study now the case where k = 3. The efficient score function has representation
(5.70) with ξ µσa ( · ) = (ξ1µσa ( · ), ξ2µσa ( · ), ξ3µσa ( · ))T , where ξ1µσa and ξ2µσa are as in the case
where k = 2 and
" 3 2
1 · −µ −m3 m4 − m3 · −µ

ξ3µσa ( · ) = −
∆3 σ ∆22 σ
( )
m3 (m5 − m3 m4 − m3 ) · −µ

− m4 −
∆22 σ
( )#
m5 − m3 m4 − m3
− m3 −
∆22
( 3 2 )
1 · −µ · −µ · −µ
  
=: + k2 + k1 + k0 .
∆3 σ σ σ
(see chapter 4 for a detailed calculation). Here ∆3 is a constant depending on the standard-
ized cumulants up to order 6 of the density a. Let us study the case where m3 = m5 = 0.
We have then
" 3 #
1 · −µ · −µ

ξ3µσa ( · ) = + m4 .
∆3 σ σ
Moreover,
3σ 2 + 1
hξ3µσa ( · ), l/µσa iµσa = .
∆3
5.4. EFFICIENT ESTIMATION 139

On the other hand,

M3µσa (µ0 , σ) := Eµ0 σa {ξ3µσa ( · )}


" 3 #
1 µ0 − µ µ0 − µ
 
= 3(µ0 − µ) + − m4 .
∆3 σ σ

and

∂ 1 m4
 
M3µσa (µ0 , σ) = 3− ,

∂µ0
µ0 =µ
∆3 σ

which is not equal to hξ3µσa ( · ), l/µσa iµσa . We conclude, from theorem 17 that the bound
for regular asymptotically linear estimating sequence is not attained by estimators based
on regular estimating function. That is, the moment estimator associated with regular
estimating functions does not generate an optimal regular asymptotic linear estimating
sequence. Note that this conclusion easily extends to the case where k > 3.

Multivariate location and shape models


Let us consider initially the semiparametric multivariate location model introduced in
section 5.3.1.

f1 (x) = x1 , . . . , fq (x) = xq ,

and the following functions of the interest parameter on the right hand of (5.5)

M1 (µ) = µ1 , . . . , Mq (µ) = µq .

Here µ1 , . . . , µq are the components of µ. The orthonormal basis of Hk⊥ is

f 1 ( · ) − µ1 f q ( · ) − µ1
ξ1 ( · ) = , . . . , ξq ( · ) = .
kf1 kL2 (p) kfq kL2 (p)

The Fourier coefficient for the expansion of the efficient score function with respect to
this basis are

for i, j = 1, . . . , q, i 6= j hl/µi , ξj iL2 (p) = 0

and
− xi p(x)λ(dx)
R
for i, j = 1, . . . , qhl/µi , ξi iL2 (p) = := ki 6= 0 .
kfq kL2 (p)
140 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

Hence the efficient score function at p ∈ P ∗ is

(k1 ( · − µ1 ), . . . , kq ( · − µq ))T .

The efficient score function can be written in the matricial form

lE ( · ; µ, p) = A(µ, p){f ( · ) − M (µ)} ,

where A(µ, p) is given by (5.58) and f and M are defined as before. Note that the (i, i)th
(for i = 1, . . . , q) entry of the matrix A(µ, p) is given by

µp 1
aii (µ, p) := hξi , l/µ1 iµp = .
kfi ( · ) − µi kL2 (p)

On the other hand, for each µ∗ ∈ IRq ,

µp µ∗1 − µ1
M1 (µ∗ ) =
kfi ( · ) − µi kL2 (p)

and

∂ µp 1
M1 (µ∗ ) = .
∂µi µ∗ =µ kfi ( · ) − µi kL2 (p)

Hence the diagonal of the matrices A and ∇M are equal. It is easy to see that both A
and ∇M are diagonal matrices and hence they are equal. We conclude from theorem
17 that the bound for regular asymptotically linear estimating sequence is attained by
estimators based on regular estimating functions.
It is easy to see that for the multivariate location and shape model the ortonormal
basis of Hk is composed by the functions

f i ( · ) − µi fij ( · ) − σij
ξi ( · ) = , ξij ( · ) = ,
kfi − µi kL2 (p) kfij − σij kL2 (p)

for i, j = 1, . . . , q, i ≥ j. Here fij (x) = xi xj and σij is the (i, j)th entry of the covariance
matrix. Reasoning like in the case of the multivariate location model given above we
conclude that the semiparametric Cramèr-Rao is attained by estimating sequences derived
from regular estimating functions.
5.4. EFFICIENT ESTIMATION 141

Covariance selection models


The nuisance tangent spaces of the covariance selection model defined on the location and
shape model coincide with the nuisance tangent spaces of the multivariate location and
shape model. This is a immediate consequence of the fact that the additional restrictions
inserted in the location and shape model for obtaining the covariance selection model are of
the type of condition (5.5), but with the integrand in the left part (i.e. the corresponding
functions fi s) equal to zero.
We conclude that the inference for the covariance selection model based on regular
estimating functions (and on regular asymptotic linear estimating sequences) can be done
in two steps: first estimating the mean vector and the covariance matrix as in the location
and shape model and then inverting the estimated covariance matrix.

Linear structural relationships models


The restrictions of type (5.5) for LISREL models are obtained by (5.44)-(5.46), i.e. the
entries of the covariance matrix of (X T , Y T )T should be equal to some functions of the
interest parameters. Hence the orthogonal complements of the nuisance tangent spaces
are given by the space spanned by the functions fij : IRp+q −→ IR for i, j = 1, . . . , p + q,
i ≥ j, corrected by their means. The referred functions are defined by

fij (z) = zi zj ,

where z T = (z1 , . . . , zp+q )T = (X T , Y T )T . Under a repeated sampling scheme the roots


of any regular estimating function will be the solution of the equations (5.44)-(5.46) with
the covariances in the left side replaced by the corresponding sample covariances.
There are basically two estimation procedures with LISREL models descrived in the
literature (see Johnson and Wichern, 1992, and the references therein): maximum like-
lihood estimation under the assumption of multivariate normality and direct solution of
the equations (5.44)-(5.46) using the sample covariance matrix. The previous discussion
shows that the intuitive second estimation procedure can be justified by assuming the
semiparametric embebing of the model as we described and using regular asymptotic
linear estimating sequences or regular estimating functions.
142 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

5.5 Appendices
5.5.1 Technical proofs related with L2 - restricted models
Lemma 9 Under the L2 restricted semiparametric model, for each (θ, z) ∈ Θ × Z,
1

T N (θ, z) ⊆ Hk (θ, z) .

Proof: Take (θ, z) ∈ Θ × Z fixed and denote by p the density of Pθz . For simplicity
1
we drop p, θ and z from the notation along this proof. We prove that TN0 ⊆ Hk⊥ , which
implies the lemma.
10
Suppose by hypothesis of absurdum that there is a ν ∈T N (ν 6= 0) such that ν ∈ / Hk⊥ .
Since ν ∈ L20 , ν ∈ Hk . That is, ν is of the form ν( · ) = αT f ( · ), for some α ∈ IRk and
10
f ( · ) := ( f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ) )T . Since ν ∈T N , there exists a differentiable
path {pt } ⊂ Pθ∗ and {rt } such that

pt ( · ) = p( · ) + tp( · )ν( · ) + tp( · )rt ( · ) , (5.71)

and

krt kL1 −→ 0 , as t & 0 . (5.72)

From (5.71),
pt ( · ) − p( · )
rt ( · ) = − αT f ( · ) .
tp( · )
Hence

p ( · ) − p( · ) pt ( · ) − p( · )
t T T
krt kL1 = −α = α f ( · ) −
f ( · ) (5.73)

tp( · ) tp( · )

1
L1 L

p ( · ) − p( · )
t
≥ kαT f ( · )kL1 − − = kαT f ( · )kL1 > 0 .

tp( · ) 1
L

Here we used that given two functions g and f in L1 , kf ( · ) − g( · )kL1 ≥ kf ( · )kL1 −


· )−p( · )
kg( · )kL1 1 and that − pt ( tp( ·)
1 = 0, which can easily be verified. Clearly, (5.73)
L
contradicts (5.72). The lemma follows by reductio ad absurdum. u
t

1
R R
For, kf − gkL1 = |f − g|pdλ ≥ |f | − |g|pdλ = kf kL1 − kgkL1 .
5.5. APPENDICES 143

Lemma 10 Under the semiparametric model with L2 restrictions, for each (θ, z) ∈ Θ×Z,
2
Hk⊥ (θ, z) ⊆T N (θ, z) . (5.74)

Proof: Take (θ, z) ∈ Θ × Z and ν ∈ Hk⊥ (θ, z) ∩ Cc fixed and denote by p the density
of Pθz . Here Cc is the class of continuous infinitely many differentiable functions from X
into IR with compact support. For simplicity we drop p, θ and z from the notation along
this proof. Define for each t ∈ IR+ the generalized sequence {pt }t∈IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough pt ∈ Pθ∗ , i.e. ν is in the L2 tangent set at (θ, z) and we
conclude that
n o 2
Hk⊥ (θ, z) ∩ Cc ⊆TN0 (θ, z) .

Since Cc is dense in L2 (Pθz ) (because X is a locally compact Hausdorff space, see Rudin,
1987), taking the closure in the expression above yields (5.74) and proves the lemma.
We verify next that for t small enough pt satisfies the conditions (5.1)-(5.6). Since
ν is continuous and compact supported, it is bounded and so is the restriction of the
continuous function p to the support of ν. Hence ν( · )p( · ) is bounded. Moreover, the
restriction of the strictly positive and continuous function p( · ) to the compact support
of ν is bounded away form zero. Hence for t small enough pt is strictly positive, i.e. (5.1)
holds.
Condition (5.2) holds for all t ∈ IR+ , because ν is continuous and infinitely many differ-
entiable.
The function pt (for each t ∈ IR+ ) integrates 1 because ν integrates 0 with respect to
p( · )λ(d · ).
The functions pt satisfy condition (5.4) because for each fj
Z Z Z
fj2 (x)pt (x)λ(dx) = fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)
X X X
Z Z
= fj2 (x)p(x)λ(dx) + t fj2 (x)ν(x)p(x)λ(dx)
X supp(ν)
Z Z
≤ fj2 (x)p(x)λ(dx) + tξ fj2 (x)p(x)λ(dx) ,
X X

where supp(ν) is the support of ν and ξ is a superior bounded for the bounded function
ν.
The condition (5.5) holds for pt because ν is orthogonal to each fj (j = 1, . . . , k).
144 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS

The following argument shows that for t small enough pt satisfies condition (5.6). We
have, for i = 1, . . . , k,
Z Z Z


gi (x)pt (x)λ(dx) − gi (x)p(x)λ(dx) = t gi (x)ν(x)p(x)λ(dx)
X X X
Z
≤ tξ gi (x)p(x)λ(dx) ,
X

where ξ is a superiorR bounded for the bounded R


function ν. Hence for t small enough
the distance between X i g (x)p t (x)λ(dx) and g
X i (x)p(x)λ(dx) is arbitrarily small. Since
X gi (x)p(x)λ(dx) ∈ Bi (θ) and Bi (θ) is a real open set, X gi (x)pt (x)λ(dx) ∈ Bi (θ), for t
R R

small enough. u
t
5.5. APPENDICES 145

5.5.2 Proof of a theorem related with location and scale models


The following theorem corresponds to theorem 19.

Theorem 19 Under a semiparametric location and scale model, any function Ψ : IR ×


IR×IR+ −→ IR2 of the form (5.68) is a regular estimating function, provided the functions
αij (i = 1, 2 and j = 1, . . . , k) are differentiable and for all (µ, σ) ∈ IR × IR+ and
  
Xk k
 X 
jα1j (µ, σ) mj−1 jα2j (µ, σ) mj
  
j=1 j=1

(5.75)
  
Xk k
 X 
−  jα2j (µ, σ) mj−1   jα1j (µ, σ) mj  6= 0,
j=1 j=1

where we adopt the convention that m0 = 1.

Proof:
Let Ψ : IR×IR×IR+ −→ IR2 be a function in the form (5.68), with the functions αij (for
i = 1, 2 and j = 1, . . . , k) being differentiable and satisfying (5.75) for all (µ, σ) ∈ IR×IR+ .
We show that in this case Ψ is a regular estimating function.
· −µ j
 
Condition (i) holds because, for j = 1, . . . , k, the functions σ
− mj are in
2
L0 (Pµσa ).
 j
The differentiability of the functions αij and of · −µ
σ
− mj implies condition (ii).
We check the first part of condition (iii) by direct calculation (i.e. the part related
with differentiability with respect to µ). Take i = 1, 2 and (µ, σ) ∈ IR × IR+ fixed. Since
Ψ is unbiased,
∂ 1 x−µ
Z   
ψi (x; µ, σ) a dλ(x) = 0 .
∂µ σ σ

Hence the first part of property (iii) is equivalent to

∂ 1 x−µ
Z   
ψi (x; µ, σ) a dλ(x) = 0 . (5.76)
∂µ σ σ
Note that
Z ( )
∂ 1 x−µ
 
ψi (x; µ, σ) a dλ(x)
∂µ σ σ
146 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
 )
k
( j
∂ X x−µ 1 x−µ
Z  
= αij (µ, σ) − mj  a dλ(x)
∂µ j=1 σ σ σ
k
"Z ( ) ( j ) #
∂ x−µ 1 x−µ
X  
= αij (µ, σ) − mj a dλ(x)
j=1 ∂µ σ σ σ
k
(Z j−1 )
−j x−µ 1 x−µ
X   
+ αij (µ, σ) a dλ(x)
j=1 σ σ σ σ
k
X −j
= αij (µ, σ)mj−1 . (5.77)
j=1 σ

On the other hand


( )
∂ 1 x−µ
Z 
ψi (x; µ, σ) a dλ(x)
∂µ σ σ
k Z
( j )
x−µ −1 0 x − µ
X  
= αij (µ, σ) − mj a dλ(x)
j=1 σ σ2 σ
k
X αij (µ, σ)
= − mk−1 . (5.78)
j=1 σ

Inserting (5.77) and (5.78) into the left hand of (5.76) leads to the conclusion that the
equality (5.76) holds and the proof of the first part of (iii) is concluded.
We verify the second part of condition (iii). Since Ψ is unbiased,
∂ 1 x−µ
Z   
ψi (x; µ, σ) a dλ(x) = 0 .
∂σ σ σ
Hence the proof of condition the second part of (iii) reduces to the verification of
∂ 1 x−µ
Z   
ψi (x; µ, σ) a dλ(x) = 0 . (5.79)
∂σ σ σ
Proceeding in a similar way as in the proof of the first part of the property (iii), after
some routinary calculations, one obtains
Z ( k
)
∂ 1 x−µ −j
 X 
ψi (x; µ, σ) a dλ(x) = αij (µ, σ)mj . (5.80)
∂σ σ σ j=1 σ

On the other hand,


k
( )
∂ 1 x−µ j
Z  X
ψi (x; µ, σ) a dλ(x) = αij (µ, σ)mj . (5.81)
∂µ σ σ j=1 σ
5.5. APPENDICES 147

Using (5.80) and (5.81) in the left hand side of (5.79) we conclude that (5.79) holds.
We verify property (iv). From the previous calculations (see (5.77) and (5.80) ) we
have
" P #
k Pk
−1 jα1j (µ, σ)mj−1 jα1j (µ, σ)mj
Eµσa {∇Ψ( · ; µ, σ)} = Pj=1
k
j=1
Pk .
σ j=1 jα2j (µ, σ)mj−1 j=1 jα2j (µ, σ)mj

Hence det Eµσa {∇Ψ( · ; µ, σ)} =


6 0 if and only if (5.75) holds.
We use lemma 7.1 in Lehmann (1983) 2 to prove property (v). Note that the functions
( k )
· −µ x−µ
 
,... − mk (5.82)
σ σ

are affinely independent in the sense of the referred lemma, i.e. there do not exist con-
stants a1 , . . . , ak and b such that
k
( j )
X x−µ
aj − mj =b
j=1 σ

with probability 1. We show that ψ1 ( · ; µ, σ) and ψ2 ( · ; µ, σ) are also affinely independent.


For i = 1, 2 at least one among the numbers

αi1 (µ, σ), . . . , αik (µ, σ)

do not vanish, otherwise the determinant in (5.75 vanishes. Moreover, the vectors

(α11 (µ, σ), . . . , α1k (µ, σ))

and
(α21 (µ, σ), . . . , α2k (µ, σ))
are not equal, otherwise the determinant given in (5.75) vanishes. This, together with
the affinely independency of the functions given in (5.82), implies that ψ1 ( · ; µ, σ) and
ψ2 ( · ; µ, σ) are affinely independent. Since ψ1 ( · ; µ, σ) and ψ2 ( · ; µ, σ) are in L2 (Pµσa ), we
conclude that the covariance matrix of ψ1 ( · ; µ , σ) and ψ2 ( · ; µ, σ) is positive definite. u t

2
Lemma 7.1 (Lehmann, 1983, page 124) - For any random variables X1 , . . . , Xn with finite
second moments, the covariance matrix ... is positive semidefinite; it is positive definite if andPonly if the
Xi are affinely independent, that is, there do not exist constants (a1 , . . . , ar ) and b such that ai Xi = b
with probability 1.
148 CHAPTER 5. SEMIPARAMETRIC MODELS WITH L2 RESTRICTIONS
Chapter 6

Two Variants of Semiparametric


Models with L2 Restrictions

6.1 Introduction
We study in this chapter two extensions of the L2 - restricted models which we call:
extended L2 - restricted models and partial parametric models. The first is obtained by
imposing an additional condition involving a function of the expected values of a group
of functions to a L2 - restricted model. Examples of extended L2 - restricted models are
the covariance selection model defined on a pure location model and regression models
with link. The second class of models considered here is constructed by assuming that the
density is known (or known to lie in a parametric model) in a region of the sample space
and belongs to a semiparametric L2 - restricted model out of this region. A typical example
of partial knowledge model are the trimmed models derived from a location model.
As it will be apparent from the text, the mathematical structure of the models con-
sidered in this chapter becomes more complex than the case of L2 - restricted models.

6.2 Extended L2- restricted models


Consider a model defined as a L2 - restricted model with the additional restrictions given
by (6.1) and (6.2) below. More precisely, suppose that each submodel Pθ∗ is the class
of functions p : X −→ IR+ such that (5.1)-(5.6) and the following additional conditions
hold. Assume that there are r, s ∈ N ∪ {0}, the functions b1 , . . . , br : X −→ IR in L2 (p),
h1 , . . . , hs : X × Z −→ IR, b : IRr −→ IR and a continuous h : IRs −→ IR such that
Z Z 
b b1 (x)p(x)λ(dx), . . . , br (x)p(x)λ(dx) = G(θ) (6.1)
X X

149
150 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

and
Z Z 
∀z ∈ Z, h h1 (x, z)p(x)λ(dx), . . . , hs (x, z)p(x)λ(dx) ∈ H(θ) . (6.2)
X X

Here G(θ) ∈ IR and H(θ) is an open set in IR.


The model described above is called an extended L2 - restricted model. Excluding the
trivial choices for g and h, the conditions (6.1) and (6.2) cannot be expressed in terms of
the conditions of the L2 - restricted models. Hence the model described above is indeed
an extension of the class of L2 - restricted models.
Example 9 (Covariance selection model on the location model) Let us consider
the model described in section 5.3.3 given by the multivariate location model with the
additional constraint
Z Z 
f ij q
x1 x1 p(x)λ(dx), . . . , q
xq xq p(x)λ(dx) = 0 ; (6.3)
IR IR
Z Z 
kl
f x1 x1 p(x)λ(dx), . . . , xq xq p(x)λ(dx) 6= 0 . (6.4)
IRq IRq

Clearly (6.3) is a condition of the type of (6.1) and (6.4) is of the type of (6.2). We
conclude that the covariance selection model defined on the location model is an extended
L2 - restricted model. We will work more with this example at the end of this section.
The calculation of the nuisance tangent spaces for an extended L2 - restricted model
is in general a difficult task. We give in the following a sequence of lemmas that will,
under regularity conditions, yield a characterization of the intersection (over the values
of the nuisance parameter) of the nuisance tangent spaces for each value of the interest
parameter. This will allow us to give a representation of the regular estimating functions
and to study the effect of the introduction of restrictions of the type of (6.1) and (6.2) on
the problem of estimation via estimating functions. The proof of the lemmas are given in
appendix 6.4.1.
It is convenient to introduce the following notation. For each (θ, z) ∈ Θ × Z denote
Hk (θ, z) = Hk = span {f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk )}
and
(Hk + Br )(θ, z) = (Hk + Br )
( )
f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),
= span .
b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )
Moreover, Hk⊥ (θ, z) = Hk⊥ and (Hk + Br )⊥ (θ, z) = (Hk + Br )T denote the orthogonal
complement of Hk (θ, z) and (Hk + Br )(θ, z) in L20 (Pθz ), respectively. Note that Hk (θ, z)
in fact does not depend on z, and we write sometimes simply Hk (θ).
6.2. EXTENDED L2 - RESTRICTED MODELS 151

Lemma 11 Under the extended L2 - restricted model, for each (θ, z) ∈ Θ × Z,


2
i) (Hk + Br )⊥ (θ, z) ⊆TN (θ, z) ;
2
ii) TN⊥ (θ, z) ⊆ (Hk + Br )(θ, z) ;
2
iii) ∩z∈Z TN⊥ (θ, z) ⊆ ∩z∈Z (Hk + Br )(θ, z) = Hk (θ), provided for each bi (i = 1, . . . , r)
there exist zi and wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ).

Lemma 12 Under the extended L2 - restricted model, for each (θ, z) ∈ Θ × Z,


1
i) TN (θ, z) ⊆ Hk (θ);
1
ii) Hk (θ) ⊆TN⊥ (θ, z) ;
1
iii) Hk (θ) ⊆ ∩z∈Z TN⊥ (θ, z).

The two lemmas imply the following theorem.

Theorem 20 Under an extended L2 - restricted model that satisfies the condition given
in lemma 11 iii),

i) For each θ ∈ Θ,

2 1
TN⊥ TN⊥
\ \
(θ, z) = (θ, z) = Hk (θ) .
z∈Z z∈Z

ii) If Ψ : X ×Θ −→ IRq is a regular estimating function for θ, with components ψ1 , . . . , ψq ,


then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,

k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .
j=1

The theorem shows that the estimation via regular estimating functions is not affected
by the introduction of the constraints given by (6.1) and (6.2).
152 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

Example 10 (Covariance selection model on the location model) Let us consider


the three dimensional (q = 3) covariance selection model defined on the location model.
The additional conditions added to the location model are
x1 x3 p(x)λ(dx) · IR3 x2 x3 p(x)λ(dx)
R R
IR3
− IR3 x1 x2 p(x)λ(dx) · IR3 x3 x3 p(x)λ(dx) = 0 ,
R R
(6.5)

x1 x2 p(x)λ(dx) · IR3 x2 x3 p(x)λ(dx)


R R
IR3
− IR3 x1 x3 p(x)λ(dx) · IR3 x2 x2 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞)
R R
(6.6)
and
x1 x3 p(x)λ(dx) · IR3 x1 x2 p(x)λ(dx)
R R
IR3
− IR3 x1 x1 p(x)λ(dx) · IR3 x2 x3 p(x)λ(dx) ∈ (−∞, 0) ∪ (0, ∞) .
R R
(6.7)
Condition (6.5) corresponds to (6.1) with
b1 (x) = x1 x3 , b2 (x) = x2 x3 , b3 (x) = x1 x2 , b4 (x) = x3 x3
(where x = (x1 , x2 , x3 )T ) and
b(b1 , b2 , b3 , b4 ) = b1 b2 − b3 b4 .
Defining also b5 (x) = x2 x2 and b6 (x) = x1 x1 , we have that (6.5)-(6.7) are equivalent to
Ep (b1 )Ep (b2 ) − Ep (b3 )Ep (b4 ) = 0 ,
Ep (b2 )Ep (b3 ) − Ep (b1 )Ep (b5 ) 6= 0
and
Ep (b1 )Ep (b2 ) − Ep (b6 )Ep (b2 ) 6= 0 ,
respectively. Note that the condition of part iii) of lemma 11 is satisfied. We verify
condition iii) for b1 . Take p, q ∈ Pθ∗ such that
Ep (b1 ) = Ep (b2 ) = Ep (b2 ) = 1
and
1
Eq (b1 ) = Eq (b2 ) = Eq (b2 ) = .
2
Assume moreover that
1
Ep (b5 ) = Eq (b5 )Ep (b6 ) = Eq (b6 ) = .
4
6.2. EXTENDED L2 - RESTRICTED MODELS 153

A direct verification shows that conditions (6.5)-(6.7) hold for both p and q, however
Ep (bi ) 6= Eq (bi ), for i = 1, . . . , 4.
We conclude that theorem 20 holds for the covariance selection model described above
(it holds also for covariance selection models defined on a location model in general) and
hence the estimation of the mean via estimating functions is not affected by the constraints
defined via the inverse of the covariance matrix. This is in accordance with the literature
for covariance selection models.

There is a situation in which we can calculate the nuisance tangent space for an
extended L2 - restriction model.

Theorem 21 Under an extended L2 - restriction model if the function b is injective, then

TN (θ, z) = [span{f1 − E(f1 ), . . . , fk − E(fk ), g1 − E(g1 ), . . . , gs − E(gs )}]⊥ .

The proof is given in the appendices.


A situation where theorem 21 applies is where s = 1 and g is a bijection, as for
example in the regression models with link and in the classic proportional odds models for
continuous variables used in genetics. Note that when theorem 21 applies, the theory for
estimation through regular estimating functions and regular asymptotic linear estimating
sequences developed in section 5.4 for L2 - restricted models apply in this case with some
minor obvious modifications.
154 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

6.3 Partial parametric models


We study in this section some semiparametric models defined by imposing a restriction of
a different nature than the previous restrictions considered in this thesis. The restriction
says essentially that the distribution is known (or known to be in a certain parametric
model) in a certain region of the sample space. A typical example is the trimmed model
that we study in example 11.
Consider the class of probability measures
P = {Pθz : θ ∈ Θ ⊆ IRq , z ∈ Z} = ∪θ∈Θ Pθ
on (X , A), as before. Suppose that for each θ ∈ Θ there is a function fθ : X −→ IR+ and
a measurable set Iθ such that the class of densities Pθ∗ is given by
Pθ∗ = {p : X −→ IR such that (5.1)-(5.6) and (6.8) hold} .
The additional condition in the definition above says that the density p is equal to fθ in
the set Iθ . More precisely, consider the condition
∀x ∈ Iθ , p(x) = fθ (x), λ − a.e. . (6.8)
The family P is termed a partial parametric model. Note that if (6.8) is suppressed from
the definition above, the model becomes a L2 - restricted model.

Example 11 (Location models with arbitrary tails) Suppose that X = IR, λ is the
Lebesgue measure on the real line, Θ = IR and for each θ ∈ IR
Pθ∗ = {p : X −→ IR such that (6.9)-(6.15) hold} .
Suppose furthermore that
∀x ∈ X , p(x) > 0; (6.9)

p is of class C 1 ; (6.10)
Z
p(x)λ(dx) = 1; (6.11)
X

Z
x2 p(x)λ(dx) < ∞; (6.12)
IR

Z
xp(x)λ(dx) = θ; (6.13)
IR
6.3. PARTIAL PARAMETRIC MODELS 155
)2
p0 (x)
Z (
p(x)λ(dx) < ∞; (6.14)
IR p(x)
and

∀x ∈ Iθ , p(x) = fθ (x), λ − a.e. (6.15)

where Iθ = [θ − ξ1 , θ + ξ2 ] (for
R 1
ξ > 0 and ξ2 > 0) and fθ ( · ) = f0 ( · − θ) for a given
probability density f0 ( · ) with IR xf0 (x)λ(dx) = 0.
The model described above is composed by distributions that coincide with the location
model generated by the density f0 , in the central intervalR Iθ and possesses free tails. If
we define ξ1 and ξ2 in such a way that −∞ f0 (x)λ(dx) = ξ∞
R ξ1
2
f0 (x)λ(dx) = α/2, for some
α ∈ (0, 1), then the model is called the α-trimmed model.

The following theorem characterizes the nuisance tangent spaces and their orthogonal
complements in L20 for the partial parametric models. It is convenient to introduce the
following notation, for each θ ∈ Θ and z ∈ Z,

Iθc = X \ Iθ ;

n o
Hk⊥ (θ, z) = Hk⊥ = ν ∈ L20 (Pθz ) : ν ∈ Hk⊥ (θ, z) and supp(ν) ⊆ Iθc ;

n o⊥
Hk (θ, z) = Hk = Hk⊥ (θ, z) .

Here supp(ν) is the support of the function ν ∈ L20 (Pθz ). We identify the L2 functions
that are almost surely equal and we adopt the convention that supp(ν) ⊆ A means that
ν( · )χAc ( · ) = 0, λ- almost everywhere.

Theorem 22 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z


and m ∈ [1, ∞],
m
i) T N (θ, z) = Hk⊥ (θ, z);
m⊥
ii) T N (θ, z) = Hk (θ, z). Moreover,
" #
{fi ( · )χIθc ( · ) − Ei (θ)} : i = 1, . . . , k}
Hk (θ, z) = span ,
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

fi (x) Pdλθz (x)λ(dx).


R
where Ei (θ) = Iθc
156 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

The proof of this theorem is given in the appendix 6.4.2.


Since
Z
Pθz
Ei (θ) = (x)λ(dx)
fi (x)
Iθ c dλ
Z
Pθz
= Mi (θ) − fi (x) (x)λ(dx)
Iθ dλ
Z
= Mi (θ) − fi (x)fθ (x)λ(dx) ,

q⊥
the function Ei (θ) does not depend on the nuisance parameter. Furthermore, T N (θ, z)
does not depend on the nuisance parameter z and we write sometimes simply Hk (θ).

Corollary 2 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z


and m ∈ [1, ∞],
m⊥
i) ∩z∈Z T N (θ, z) = Hk (θ);

ii) Any regular estimating function Ψ, with components ψ1 , . . . , ψq , can be written in the
form, for i = 1, . . . , q,

k
X
ψi ( · ) = ξi ( · ; θ) + αij (θ){fi ( · )χIθc ( · ) − Ei (θ)} , (6.16)
i=1

where supp(ξi ) ⊆ Iθ and for i = 1, . . . , q and j = 1, . . . , k, αij (θ) ∈ IR.

The regular estimating function Ψ in part ii) of the corollary above has matricial
representation

Ψ( · ; θ) = ξ( · ; θ) + α(θ){f ( · ) − E(θ)} , (6.17)

where ξ( · ; θ) = (ξ1 ( · ; θ), . . . , ξq ( · ; θ))T , α(θ) = [αij (θ)]i,j and E(θ) = (E1 (θ), . . . , Ek (θ))T .

Theorem 23 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (6.17)
and for each θ ∈ Θ and z ∈ Z:

a) The Godambe information is given by

−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}T Covθz (f χIθc ){∇E(θ)} . (6.18)
6.3. PARTIAL PARAMETRIC MODELS 157

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking

ξ( · ; θ) = {∇logfθ ( · )} χint(Iθ ) ( · ) . (6.19)

The proof of this theorem is given in appendix 6.4.2.


We calculate next the efficient score function. Let l/θ ( · ; θ, z) = l/θ be the parametric
partial score evaluated at (θ, z). We assume that for each z ∈ Z, l/θ ( · ; · , z) is regular, in
the sense that the properties of regular estimating functions hold. Take (θ, z) fixed. Note
that, for i = 1, . . . , q,
  ∂ Z
Eθz l/θi χ Iθc = p(x; θ, z)λ(dx) (6.20)
∂θi Iθc

(differentiating under the integral sign)


∂ Z
= p(x; θ, z)λ(dx) = 0 .
∂θi Iθc
The differentiation under the integral sign is allowed because it is assumed that the
parametric partial score is regular. Let ξ1θz , . . . , ξkθz be the result of an orthonormalization
process applied to f1 χIθc − E1 (θ), . . . , fk χIθc − Ek (θ). We have, for i = 1, . . . , k and j =
1, . . . , q,
Z

hξiθz , l/θj iθz = ξ θz (x)
c i
{p(x; θ, z)}λ(dx)
Iθ ∂θj
(from (6.20) )
Z

= {ξiθz (x)}p(x; θ, z)λ(dx) := ℵij (θ, z) .
Iθ ∂θ j
c

The jth component (j = 1, . . . , q) of the efficient score function at (θz) is given by


n
ljE ( · ; θ, z) = ℵij (θ, z)ξiθz ( · ) + χIθ ( · )l/θj ( · ; θ) .
X

i=1

We have the following matricial representation of the efficient score,

lE ( · ; θ, z) = A(θ, z){ξ θz ( · )} + χIθ ( · )l/θ ( · ; θ) , (6.21)

where A(θ, z) is the matrix formed by the ℵij (θ, z)s and ξ θz ( · ) = (ξ1θz ( · ), . . . , ξqθz ( · ))T .
The covariance of the efficient score function is given by,

Covθz {lE ( · ; θ, z)} = A(θ, z) AT (θ, z) + Covθz {l/θ ( · ; θ)χIθ ( · )} . (6.22)


158 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

We give next an alternative representation for a regular estimating function Ψ evalu-


ated at (θ, z) that will allow us to determine whether the semiparametric Cramèr-Rao
bound is attained by regular estimating functions. We use the functions ξiθz ( · ) in-
stead of the functions {fi ( · )χIθc ( · ) − Ei }s. Since each ξiθz ( · ) is a linear combination
of f1 ( · )χIθc ( · ) − E1 , . . . , fk ( · )χIθc ( · ) − Ek , there exists a non singular matrix γ θz such
that
n o
f ( · )χIθc ( · ) − e(θ) = γ θz ξ θz ( · ) .

The matrix γ θz is the matrix of change of basis. Now define the functions f1θz , . . . , fkθz :
X −→ IR such that f θz ( · ) = (f1θz ( · ), . . . , fkθz ( · ))T and f ( · ) = γ θz f θz ( · ). Clearly, for
i = 1, . . . , k, ξiθz ( · ) = fiθz ( · )χIθc ( · ). For any θ0 ∈ Θ, z ∈ Z and i = 1, . . . , k, we have
Z
fiθz (x)p(x; θ0 , z)λ(dx) = Miθz (θ0 ) , (6.23)
X

that is, the integral above does not depend on the choice of the nuisance parameter z. The
condition (6.23) can be used to characterize alternatively the model instead of condition
(5.5). The nuisance tangent spaces can be characterized alternatively in the following
way, for each (θ0 , z0 ) ∈ Θ × Z,

TN⊥ (θ0 , z0 ) = Hk (θ, z)


{fiθz ( · )χIθc ( · ) − Eiθz (θ0 )} : i = 1, . . . , k}
" #
= span 0 ,
∪{f ∈ L20 (Pθ0 z0 ) : supp(f ) ⊆ Iθ0 }

where
Z
Pθ 0 z 0
Eiθz (θ0 ) = Miθz − fiθz (x) (x)λ(dx) .
Iθ0 dλ

We have then the following alternative representation of a regular estimating function, Ψ,


for each θ0 ∈ Θ,

Ψ( · ; θ0 ) = ξ( · ; θ0 ) + αθz (θ0 ){f θz ( · ) − E θz (θ0 )} ,

where E θz (θ0 ) = γ θz E(θ0 ). Using the theory already developed the Godambe information
of Ψ at (θ, z) is given by,

JΨ (θ, z) = Jξ (θ, z) + ∇E θz (θ)T ∇E θz (θ) . (6.24)

Hence we have the following result.


6.3. PARTIAL PARAMETRIC MODELS 159

Theorem 24 Consider a partial parametric model. Suppose that the function E is differ-
entiable. Then, the semiparametric Cramèr-Rao bound is attained by regular estimating
functions at (θ, z) ∈ Θ × Z if and only if,

∇E θz (θ)T (= ∇E θz (θ)T {γ θz }T ) = A(θ, z) and ξ( · ; θ) = l/θ ( · ; θ)χIθ ( · ) .

Example 12 (Location models with arbitrary tails) Consider the location model with
arbitrary tails introduced in example 11. We have k = 1, f1 ( · ) = ( · ) and M1 (θ) = θ.
Suppose that ξ1 = ξ2 and that f0 ( · ) is symmetric about the origin. We have then,
Z θ+ξ1
E1 (θ) = θ − xf0 (x − θ)λ(dx) = θ(1 − Q) ,
θ−ξ1

R ξ1
where Q = −ξ1 f0 (x)λ(dx). Any regular estimating function Ψ is of the form

Ψ( · ; θ) = ξ( · ; θ) + α(θ){( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · ) − θ(1 − Q)} ,

where suppξ ⊆ (θ − ξ1 , θ + ξ1 ). The Godambe information is given by

JΨ (θ, pθ ) = Jξ (θ, pθ ) + ∇E1 (θ)T ∇E1 (θ) = Jξ (θ, pθ ) + (1 − Q)2 .

Note that in this example the change of basis matrix is


1
γ θpθ = .
k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

Hence,
(1 − Q)
∇E θz (θ)T = ∇E θz (θ)T {γ θz }T ) = .
k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

On the other hand,


Z

A(θ, pθ ) = j
{ξiθz (x)}p(x; θ, z)λ(dx)
c
Iθ ∂θ
(1 − Q)
= .
k( · )χIR\(θ−ξ1 ,θ+ξ1 ) ( · )kL2 (pθ )

We conclude from theorem 24 that the semiparametric Cramèr-Rao bound is attainable by


regular estimating functions. An optimal estimating function is obtained by taking
f00 ( · − θ)
ξ( · , θ) = − χ(θ−ξ1 ,θ+ξ1 ) ( · ) .
f0 ( · − θ)
160 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

6.4 Appendices
6.4.1 Technical proofs related with extended L2 - restricted mo-
dels
Lemma 13 (lemma 11) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,
2
i) (Hk + Br )⊥ (θ, z) ⊆TN (θ, z) ;
2
ii) TN⊥ (θ, z) ⊆ (Hk + Br )(θ, z) ;
2
iii) ∩z∈Z TN⊥ (θ, z) ⊆ ∩z∈Z (Hk + Br )(θ, z) = Hk (θ), provided for each bi (i = 1, . . . , r)
there exist zi and wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ).

Proof:
i) Take (θ, z) ∈ Θ × Z fixed and ν ∈ (Hk + Br )⊥ (θ, z). Define for each t ∈ IR+ the
generalized sequence {pt }t∈IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough pt ∈ Pθ∗ . The functions pt (for t small) satisfy the
conditions (5.1)-(5.6) as verified in the proof of lemma 10. We check the conditions (6.1)
and (6.2). For i = 1, . . . , r,
Z Z Z
bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)
X X X
Z
= bi (x)p(x)λ(dx)
X

and we conclude that (6.1) holds. For t small enough and i = 1, . . . , s


Z Z Z


hi (x)pt (x)λ(dx) − hi (x)p(x)λ(dx) = t
hi (x)ν(x)p(x)λ(dx) .
X X X
R R
Hence for t small enough X hi (x)pt (x)λ(dx) is arbitrarily close to X hi (x)p(x)λ(dx) and
(6.2) follows from the continuity of h.
ii) Straightforward from i).
iii) For each i ∈ {1, . . . , r} there are zi , wi ∈ Z such that Eθzi (bi ) 6= Eθwi (bi ). This implies
that bi is not a constant function and that

span {bi ( · ) − Eθzi (bi )} ∩ span {bi ( · ) − Eθwi (bi )} = {0} . (6.25)
6.4. APPENDICES 161

Hence,

∩z∈Z (Hk + Br )(θ, z)


( )
f1 ( · ) − Eθz (f1 ), . . . , fk ( · ) − Eθz (fk ),
= ∩z∈Z span
b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )
( )
f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ),
= ∩z∈Z span
b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )

(since each fi ( · ) − Mi (θ) does not depend on z)

= span {f1 ( · ) − M1 (θ), . . . , fk ( · ) − Mk (θ)}


∪ [∩z∈Z span {b1 ( · ) − Eθz (b1 ), . . . , br ( · ) − Eθz (br )}]

= Hk (θ)

u
t

Lemma 14 (lemma 12) Under the extended L2 - restricted model, for each (θ, z) ∈ Θ×Z,

1
i) TN (θ, z) ⊆ Hk (θ);

1
ii) Hk (θ) ⊆TN⊥ (θ, z) ;

1
iii) Hk (θ) ⊆ ∩z∈Z TN⊥ (θ, z).

Proof:
i) The same argument of lemma 9 yields this part of the lemma.
ii) Straightforward from i).
iii) Follows immediately from taking the interception (over z ∈ Z) in both sides of ii)
and observing that Hk (θ) does not depend on z. u
t

Theorem 25 (theorem 20) Under an extended L2 - restricted model that satisfies the con-
dition given in lemma 11 iii),
162 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

i) For each θ ∈ Θ,

2 1
∩z∈Z TN⊥ (θ, z) = ∩z∈Z TN⊥ (θ, z) = Hk (θ) .

ii) If Ψ : X ×Θ −→ IRq is a regular estimating function for θ, with components ψ1 , . . . , ψq ,


then for all θ ∈ Θ and i ∈ {1, . . . , q} we have the representation,for each θ ∈ Θ,

k
X
ψi ( · ; θ) = αij (θ) {fj ( · ) − Mj (θ)} .
j=1

Proof:
2 1
i) Take (θ, z) ∈ Θ × Z fixed. Since T N (θ, z) ⊆T N (θ, z), taking intersections yields
2 1
∩z∈Z T N (θ, z) ⊆ ∩z∈Z T N (θ, z) . (6.26)
Combining (6.26) with lemma 13 item iii) and lemma 14 item iii) implies the first part
of theorem 25
ii) Take (θ, z) ∈ Θ × Z and i ∈ {1, . . . , q} fixed. The result follows from item i) and
2
because ψi ( · ; θ) ∈ ∩z∈Z T N (θ, z) (see chapter 3). u
t

The following theorem corresponds to theorem 21.


Theorem 26 Under an extended L2 - restricted model if the function b is injective, then
TN (θ, z) = [span{f1 , . . . , fk , g1 , . . . , gs }]⊥ .

Proof:
10 L1
“⊆” Take ν ∈T N . Then, for each t ν = pttp−p − rt , where pt ∈ Pθ∗ and rt −→ 0. Using the
same argument given in the proof of lemma 9 we conclude that hfi , νi = 0. Moreover,
Z Z 
b b1 (x)pt (x)λ(dx), . . . , bs (x)pt (x)λ(dx)
X X
Z Z 
= b b1 (x)p(x)λ(dx), . . . , bs (x)p(x)λ(dx) .
X X

Since b is injective, for i = 1, . . . , s,


Z Z
bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) .
X X
6.4. APPENDICES 163

Hence, for i = 1, . . . , s,
1
Z Z Z 
hbi , νi = bi (x)pt (x)λ(dx) − bi (x)p(x)λ(dx) − rt (x)p(x)λ(dx)
t Z X X X

= − rt (x)p(x)λ(dx) .
X

Using the same argument given in the proof of lemma 9 we conclude that hbi , νi = 0.
“⊇” Take ν ∈ [span{f1 , . . . , fk , g1 , . . . , gs }]⊥ ∩ Cc . Define, for each t ∈ IR+

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough, pt ∈ Pθ∗ . Note that the argument given in the proof of
lemma 10 shows that the conditions (5.1)-(5.6) hold. To check condition (6.1), observe
that, for i = 1, . . . , s,
Z Z Z
bi (x)pt (x)λ(dx) = bi (x)p(x)λ(dx) + t bi (x)ν(x)p(x)λ(dx)
X X X
Z
= bi (x)p(x)λ(dx) ,
X

hence (6.1) holds. Reasoning in a similar way we conclude that (6.2) holds too, which
concludes the proof. u
t
164 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

6.4.2 Technical proofs related with partial parametric models


The following three lemmas prove theorem 22 in section 6.3.
Lemma 15 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z,
" #
{fi ( · )χIθc ( · ) − Ei (θ) : i = 1, . . . , k}
Hk (θ, z) = span ,
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

fi (x) Pdλθz (x)λ(dx).


R
where Ei (θ) = Iθc

Proof: Take (θ, z) ∈ Θ × Z fixed. We suppress θ and z from the notation in of this proof.
Define
" #
{fi ( · )χIθc ( · ) − Ei (θ) : i = 1, . . . , k}
Fk = .
∪{f ∈ L20 (Pθz ) : supp(f ) ⊆ Iθ }

We prove that span(Fk ) ⊆ Hk . Take an arbitrary ν ∈ Hk⊥ . We have, by definition of


Hk⊥ , supp(ν) ⊆ Iθc and ν ∈ Hk⊥ . For i = 1, . . . , k,
hν, fi χIθc − Ei i = hν, fi χIθc i − hν, Ei i = hν, fi χIθc i
= (since supp(ν) ⊆ Iθc ) = hν, fi i
= (since ν ∈ Hk⊥ ) = 0 .
Moreover, for any f with supp(f ) ⊆ Iθ ,
hν, f i = 0 ,
because supp(ν) ⊆ Iθc . Hence Fk ⊆ {Hk⊥ }⊥ = Hk . Since Hk is a vector space, span(Fk ) ⊆
Hk .
We prove now that Hk (θ, z) ⊆ span(Fk ). The proof reduces to show that Fk⊥ ⊆ Hk⊥ .
To see that, note that Fk⊥ ⊆ Hk⊥ implies that span(Fk⊥ ) ⊆ Hk⊥ , which implies that 1
{span(Fk )}⊥ ⊆ Hk⊥ , which implies that Hk ⊆ span(Fk ).
Take ν ∈ Fk⊥ . We have 2
for any f such that supp(f ) ⊆ Iθ , hν, f iθz = 0 ; (6.27)
and
for i = 1, . . . , k, hν, fi χIθc − Ei i = hν, fi χIθc i = 0 . (6.28)
1
Note that [span(A)]⊥ ⊆ span(A⊥ ). For, {a ∈ [span(A)]⊥ } =⇒ {∀b ∈ span(A), a ⊥ b} =⇒ {∀b ∈
A, a ⊥ b} =⇒ {a ∈ A⊥ } =⇒ {a ∈ span(A⊥ )}.
2
Since (A ∪ B)⊥ ⊆ A⊥ ∩ B ⊥ , for any A and B.
6.4. APPENDICES 165

Hence, from (6.27) and (6.28),

for i = 1, . . . , k, hν, fi i = hν, fi χIθc i + hν, fi χIθ i = 0 .

We conclude that Fk⊥ ⊆ Hk⊥ . We check whether supp(ν) ⊆ Iθc . Using (6.27), for any
g ∈ L20 (Pθz ), hν, gχIθc iθz = 0. In particular ν is orthogonal to any function in an orthogonal
basis of the space of functions in L20 (Pθz ) with support contained in Iθ . We conclude that
ν( · )χIθ ( · ) ≡ 0, and hence supp(ν) ⊆ Iθc . u
t

Lemma 16 Under a partial parametric semiparametric model, for each θ ∈ Θ, z ∈ Z,


1

T n (θ, z) ⊆ Hk (θ, z) .

Proof: The lemma follows from an argument similar to the proof of lemma 9 in chapter
5. u
t

Lemma 17 Under a partial parametric model, for each θ ∈ Θ, z ∈ Z,


q
Hk⊥ (θ, z) ⊆T N (θ, z) .

q0
Proof: Take ν ∈ Hk⊥ (θ, z) ∩ Cc . We prove that ν ∈T N (θ, z). Define, for each t ∈ IR+ ,

pt ( · ) = p( · ) + tp( · )ν( · ) .

We show that for t small enough, pt ∈ Pθ∗ , which proves the lemma. The conditions
(5.1)-(5.6) are satisfied by pt , for t small enough, as shown in the proof of lemma 10.
The additional condition (6.8) holds because supp(ν) ⊆ Iθc (by definition of Hk⊥ ). Hence
Hk⊥ ∩ Cc ⊆ TN0 and the lemma follows by taking the closure. u
t

The following theorem proved below corresponds to theorem 23.

Theorem 27 Consider a partial parametric model. Suppose that the function E is dif-
ferentiable. Then we have for any regular estimating function with representation (6.17)
and for each θ ∈ Θ and z ∈ Z:

a) The Godambe information is given by

−1
JΨ (θ, z) = Jξ (θ, z) + {∇E(θ)}Covθz (f χIθc ){∇E(θ)} . (6.29)
166 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS

b) The Godambe information is maximized (at (θ, z)) over the class of regular estimating
functions by taking

ξ( · ; θ) = {∇logfθ ( · )} χint(Iθ ) ( · ) . (6.30)

Proof: The first part follows immediately from the definition of Godambe information
and the representation (6.17).
We prove the second part. Note that the second term of the right hand of the expres-
sion (6.29) does not depend on the choice of Ψ, i.e. it depends only on the functions E
and f that are characteristics of the model and not of the estimating function. Hence, to
maximize the Godambe information of a regular estimating function it is enough to max-
imize Jξ , i.e. we can concentrate on the interval Iθ where the model is parametric. The
rest of the proof follows from the classic theory of estimating functions for semiparametric
models, for example, theorem 4.9 in Jørgensen and Labouriau (1996, pages 150-152). The
details are given below.
We prove that Jξ is maximized by

U ( · ; θ) = (U1 ( · ; θ), . . . , Uq ( · ; θ))T = {∇logfθ ( · )} χint(Iθ ) ( · ) .

Define

U = span{U1 ( · ; θ), . . . , Uq ( · ; θ)} and A = A(θ, z) = U ⊥ .

The function ξ = (ξ1 , . . . , ξq )T can be orthogonally decomposed into


 T  T
ξ = ξ u + ξ a = ξ1u , . . . , ξqu + ξ1a , . . . , ξqa ,

where for i = 1, . . . , q, ξiu ∈ U and ξia ∈ A.


We claim that Sξa (θ, z) = 0. The claim follows from the following. For i, j = 1, . . . , q,
∂ Z a
0 = ξ (x)p(x; θ, z)λ(dx)
∂θi Iθ j
( differentiating under the integral sign)
Z

= {ξja (x)p(x; θ, z)}λ(dx)
Iθ ∂θ i
Z ( )
∂ a
Z
= i
{ξ j (x) p(x; θ, z)}λ(dx) + ξja (x)Ui (x)p(x; θ, z)}λ(dx)
Iθ ∂θ Iθ
( )
Z

= {ξ a (x) p(x; θ, z)}λ(dx) = Sξa (θ, z) .
Iθ ∂θi j
6.4. APPENDICES 167

Note that the differentiation under the integral sign performed above is allowed because
Ψ is a regular estimating function.
We have then

Jξ−1 (θ, z) = Sξ−1 (θ, z)Vξ (θ, z)Sξ−T (θ, z)


= Sξ−1u (θ, z){Vξ a (θ, z)Vξ u (θ, z)}Sξ−Tu (θ, z)
= Jξ−1u (θ, z) + Sξ−1u (θ, z)Vξ a (θ, z)Sξ−Tu (θ, z)
≥ Jξ u (θ, z) .

Hence

Jξ (θ, z) ≤ Jξ u (θ, z) ≤ JU−1 (θ, z) .

The last inequality in the expression above is due to the fact that the components of ξ u
are in the space spanned by the components of U. Obviously the equality in the last
expression is obtained if ξ is equivalent to U. u
t
168 CHAPTER 6. VARIANTS OF THE L2 - RESTRICTED MODELS
Bibliography

[1] Barndorff-Nielsen, O.E. (1978). Hyperbolic distributions and distributions on hyper-


bolae. Scand.J.Statist. 5, 151-157.

[2] Barndorff-Nielsen, O.E. ; Jensen, J.L. and Sørensen, M. (1990). Parametric modelling
of turbulence. Phil. Trans. R. Soc. Lond. A 332, 439-445.

[3] Begun, J.M,; Hall, W.J.; Huang, W.M. and Wellner, J.A. (1983). Information and
asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, 432-
452.

[4] Bickel, P.J.; Klaassen, C.A.J.; Ritov, Y. and Wellner, J.A. (1993). Efficient and
Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press,
London.

[5] Billingsley, P. (1986). Probability and Measure. Second edition. John Wiley and Sons.
New York.

[6] Chow, Y.S. and Teicher, H. (1978). Probability Theory: Independence, Interchange-
ability, Martingales. Springer-Verlag, Heidelberg.

[7] Cramér, H. (1946). Mathematical Methods of Statistics.. Princeton University Press,


Princeton.

[8] Dieudonne, J. (1960). Foundations of Modern Analysis. Academic Press, New York.

[9] Dunford, N. and Schwartz, J.T. (1958). Linear Operators, Part I . Interscience, New
York.

[10] Durbin, J. (1960). Estimation of parameters in time-series regression models. J. Roy.


Statist. Soc. Ser. B 22, 139–153.

[11] Fisher, R.A. (1934). Two new properties of mathematical likelihood. Proc. Royal Soc.
London Ser. A 144, 285–307.

169
170 BIBLIOGRAPHY

[12] Godambe, V.P. (1960). An optimum property of regular maximum likelihood esti-
mation. Ann. Math. Statist. 81, 1208–1212.

[13] Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating
equations. Biometrika 63, 277–284.

[14] Godambe, V.P. (1980). On sufficiency and ancillarity in the presence of a nuisance
parameter. Biometrika 67, 269–276.

[15] Godambe, V.P. (1984). On ancillarity and Fisher information in the presence of a
nuisance parameter. Biometrika 71, 626–629.

[16] Godambe, V.P. and Thompson, M.E. (1974). Estimating equations in the presence
of a nuisance parameter. Ann. Statist. 2, 568–571.

[17] Godambe, V.P. and Thompson, M.E. (1976). Some aspects of the theory of estimating
equations. J. Statist. Plann. Inference 2, 95–104.

[18] Hájek, J. (1962). Asymptotically most powerful rank-order tests. Ann. Math. Statist.
33 1124-1147.

[19] Huber, P.J. (1981). Robust Statistics. Wiley, New York.

[20] Jørgensen, B. and Labouriau, R. (1995). Exponential Families and Theoretical Infer-
ence. Lecture notes at the University of British Columbia, Vancouver.

[21] Jørgensen,B. Labouriau,R. and Lundbye-Christensen,S. (1996). Linear growth curve


analysis based on exponential dispersion models. J. Roy. Statist. Soc. Ser. B 58 ,
573-592.

[22] Kendall, G.M. and Stuart, A. (1952). The Advanced Theory of Statistics. Vol. 1.
Charles Griffin, London.

[23] Kimball, B.K. (1946). Sufficient statistical estimation functions for the parameters
of the distribution of maximum values. Ann. Math. Statist. 17, 299–309.

[24] Labouriau,R. (1989). Robustez Estatı́stica na Famı́lia de Distribuições Exponenciais.


IMPA.

[25] Labouriau, R. (1996a). Path and functional differentiability. To appear.

[26] Labouriau, R. (1996b). Estimating and quasi estimating functions. To appear.


BIBLIOGRAPHY 171

[27] Labouriau, R. (1995c). Characterizing estimating functions and nuisance acillary


quasi estimating functions. To appear.

[28] LeCam, L. (1966). Likelihood functions for large number of independent observations.
In: Research Papers in Statistics. Festschrift for J. Neyman (F.N. David, ed.), 167-
187, Wiley, London.

[29] Luenberg, D.G. (1969). Optimization by Vector Space Methods. John Wiley and Sons.
New York.

[30] Lukacs, E. (1975). Stochastic Convergence. Second Edition. Academic Press, Inc;
New York.

[31] Lundbye-Christensen (1991). A multivariate growth curve model for pregnancy. Bio-
metrics 47, 637-657.

[32] McLeish, D.L. and Small, C.G. (1987). The Theory and Applications of Statistical
inference Functions. Lecture Notes in Statistics 44, Springer-Verlag, New York.

[33] Pfanzagl, J. (1982). Contributions to a General Asymptotic Theory. Lecture Notes


in Statistics 13. Springer-Verlag.

[34] Pfanzagl, J. (1985). Asymptotic Expansions for General Statistical Models. Lecture
Notes in Statistics 31. Springer-Verlag.

[35] Pfanzagl, J. (1990). Estimation in Semiparametric Models: Some Recent Develop-


ments. Lecture Notes in Statistics 63. Springer-Verlag.

[36] Plausonio, A. (1996) De Re Ætiopia. 74th edition. Editora Rodeziana, São Paulo-
Barcelona.

[37] Rudin, W. (1966). Real and Complex Analysis. McGraw-Hill, New York.

[38] Rudin, W. (1973). Functional Analysis. McGraw-Hill, New York.

[39] Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley,


New York.

[40] Stein, C. (1956). Efficient nonparametric testing and estimation. Proc. Third Berkeley
Symp. Math. Statist. 1, 187-195, Univ. California Berkeley.

[41] Vaart, A.W. van der (1988). Estimating a real parameter in a class of semiparametric
models. Ann. Statist. 16 4 , 1450-1474.
172 BIBLIOGRAPHY

[42] Vaart, A.W. van der (1988). Efficiency and Hadamard differentiability. Scand. J.
Statist. 18, 63-75.

[43] Vaart, A.W. van der (1988). Statistical Estimation in Large Parameter Spaces. CWI
Tracts 44, Amsterdam.

Potrebbero piacerti anche