LECTURE NOTE ON Treatment Eﬀects Analysis
YUWEI HSIEH
New York University
First Draft: Sep 1, 2009
⃝c
Email: yuwei.hsieh@nyu.edu
YuWei Hsieh (all rights reserved).
Contents
1 Introduction to Average Treatment Eﬀects 
1 

1.1 Rubin’s Statistical Causality Model . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
1 

1.2 Selection Bias . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
3 

1.3 Identiﬁcation and Estimation under Exogeneity . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
4 

1.3.1 Regression . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
6 

1.3.2 Matching 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
8 

1.3.3 OLS v.s. Matching . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
10 

1.4 Propensity Score 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
10 

1.4.1 Speciﬁcation Testing of the Propensity Score 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
14 

1.4.2 Regression as a Dimension Reduction Method . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
15 

1.4.3 Propensity Score Weighting Estimator 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
16 

2 Quantile Treatment Eﬀects 
17 

2.1 Quantile Regression 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
18 

2.2 Weighting Estimator 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
18 

3 Instrumental Variables I: Local Treatment Eﬀects 
19 

3.1 Instrumental Variable : A Review 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
19 

3.2 Restrictions on the Selection Process : Local Treatment Eﬀects 
21 

3.3 Case Studies 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
25 

3.3.1 Vietnam Draft Lottery . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
25 

3.3.2 Randomized Eligibility Design 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
26 

3.4 Other Identiﬁed Features 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
26 

3.5 Nonbinary Treatments and Instruments 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
29 

3.6 Nonparametric Estimation of LTE with Covariate . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
29 

3.7 Parametric and Semiparametric Estimation of LTE with Covariate 
31 

4 DiﬀerenceinDiﬀerence Designs 
34 

4.1 Linear DiﬀerenceinDiﬀerence model 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
36 

4.2 Nonparametric DiﬀerenceinDiﬀerenc Models . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
37 

4.3 Nonlinear DiﬀerenceinDiﬀerence . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
40 

4.4 The ChangeinChange Model . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
42 

4.5 Quantile DiﬀerenceinDiﬀerence 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
45 

i 
5
Nonparametric Bounding Approaches
48
5.1 
Noassumption Bound 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
49 

50 

5.3 
Restrictions on the Moment Conditions: Monotone Instrumental Variables 
51 

5.4 
Monotone Treatment Selection 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
51 

5.5 
Shape Restrictions on the Response Functions: Monotone Treatment Re . sponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. 
52 

5.6 
Restrictions on the Selection Mechanism . 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
54 

5.7 
Some Remarks 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
55 

References 
56 

ii 
1
Introduction to Average Treatment Eﬀects
Suppose the company of national health insurance launched a new reimbursment scheme— quality payment—for physicians, linking their salary with patients’ health outcome. If patients’ health status are getting better, the insurance company then gives physicians extra bonus. By providing proper ﬁnancial incentives to physicians, it can encourage them to treat patients more carefully, hence leading to higher cure rate. In this case, the new payment program is called a treatment, while the cure rate is called response. The group of subjects receiving treatment is called treatment group, while the group re ceiving control treatment (typically, no treatment) is called control or comparison group. We want to know whether a treatment has an impact on the response variable. If yes, does the treatment cause a positive or negative eﬀect? In this section we introduce the statistical causality model, proposed by Rubin (1974), to quantify the eﬀects of a certain treatment. More discussions about this model can be found in Holland (1986).
1.1 Rubin’s Statistical Causality Model
For subject i, the Bernoulli random variable Y _{1}_{i} represents whether the patient is cured or not, if he go to the hospital that participates in the quality payment program (treatment group); and Y _{0}_{i} if he go to the hospital that does not participate in that program. We call Y _{1}_{i} and Y _{0}_{i} potential response. Intuitively, the individual treatment eﬀect on subject i can be deﬁne as Y _{1}_{i} − Y _{0}_{i} . If Y _{1}_{i} − Y _{0}_{i} > 0 then the treatment has a positive eﬀect on subjects’ health status: it makes subjects fully recover from an illness. Let D _{i} = 1 if subject i is in the treatment group, and D _{i} = 0 if subject i is in the control group. The treatment indicator D _{i} is called observed treatment, indicating whether unit i receives treatment or not. Deﬁne observed response Y _{i} = D _{i} Y _{1}_{i} + (1 − D _{i} )Y _{0}_{i} . We observe Y _{1}_{i} if unit i is in the treatment group (D _{i} = 1), and we observe Y _{0}_{i} if unit i is in the control group (D _{i} = 0). However, unit i cannot be assigned to treatment and control group at the same time. Unit i can only be treated or untreated at a speciﬁc time. Because we can only observe either Y _{1}_{i} or Y _{0}_{i} , we face a missing data problem that up to 50% data is missed. Therefore, it is impossible to identify the individual eﬀect due to the missing data problem. The unobservable missing outcome is termed counterfactual outcome. For example, if we observe Y _{1}_{i} , then Y _{0}_{i} is the counterfactual.
Sometimes, it is cumbersome for policy maker to learn the individual eﬀect for each subject. Instead, we are interested in other summary statistics, such as the average
1
treatment eﬀect:
AT E = E[Y _{1} − Y _{0} ]
ATE not only describes the average treatment eﬀect on subjects, but also transfers the impossibletoidentify individual eﬀect into a possibletoestimate statistical problem. It is because we can utilize information contained in the sampling processes to learn the average eﬀect, without exactly knowing the individual eﬀect for each unit. However, the missing data problem raises the identiﬁcation problem of ATE because we want to learn the features of (Y _{1} , Y _{0} , D) but only (Y, D) is observed. In this lecture note several identi ﬁcation strategies will be discussed under diﬀerent assumptions of selection mechanisms, exclusion restrictions, source of exogeneous variations, functional form restrictions, and heterogeneity. In order to identify ATE, the mechanism that how units be selected into treatment group or not lies in the central heart of treatment eﬀect analysis. Now we introduce an identiﬁcation condition for ATE. Suppose the treatment assignment D sat isﬁes:
Assumption 1.1 (Y _{1} , Y _{0} )⊥D, ⊥ denotes independence.
Assumption 1 is an exogenous condition. For example, a randomized experiment au tomatically satisﬁes this condition. Another example satisfying this assumption is that the government launched a new program, and some people are required to participate in that program. Since these people cannot choose to join the program or not, D is independent of (Y _{1} , Y _{0} ). This situation is termed quasiexperiment or natural experiment in the literature. Under this assumption, ATE can be identifed by group mean diﬀerence:
E[Y D = 1] − E[Y D = 0]. E[Y D = 1] is the mean of treatment group, and E[Y D = 0] is the mean of control group. Since Y = DY _{1} + (1 − D)Y _{0} , we have:
E[Y D = 1] − E[Y D = 0] = E[Y _{1} D = 1] − E[Y _{0} D = 0],
and by Assumption 1,
E[Y _{1} D = 1] − E[Y _{0} D = 0] = E[Y _{1} ] − E[Y _{0} ] = E[Y _{1} − Y _{0} ] = AT E.
(1)
Here is an important implication of assumption 1. Since (Y _{1} , Y _{0} )⊥D, we have E[Y _{0} ] = E[Y _{0} D = 0] = E[Y _{0} D = 1]. E[Y _{0} D = 1] means the average response of units in the treatment group, had them not been treated. However, one can only observe Y _{1} in the treatment group; therefore E[Y _{0} D = 1] is counterfactual. Under assumption 1, we can use the observable E[Y _{0} D = 0] to impute the counterfactual E[Y _{0} D = 1]. This
2
is because assumption 1 guarantee the treatment group and the control group is very similar, so that we can compare each other. We can use the information of control group to impute the counterfactual Y _{0} of treatment group, and we can also use the information of treatment group to impute the counterfactual Y _{1} of control group. In fact, a weaker condition is suﬃcient to identify ATE:
Assumption 1.2 (mean independent) E[Y _{0} D] = E[Y _{0} ], and E[Y _{1} D] = E[Y _{1} ].
When the parameter of interest is quantile treatment eﬀects, or we want to estimate the asymptotic variance of ATE, weaker conditions like assumption 2 is not enough. There fore, throughout this lecture we will impose stronger condtions, even though weaker conditions are suﬃcient for identiﬁcation. One should know E[Y _{0} D = 0] does not nec essarily equal to E[Y _{0} D = 1], that is, the control group need not be a good proxy for the treatment group. The identiﬁcation condition thus governs the mechanism how units be selected into treatment group or not. Also, it determines whether the experiment is statistically sound or not. The principle of treatment eﬀect analysis is under what conditions the treatment group and the control group is quite similar, namely, we can compare the comparable. To be speciﬁc, the meaning of comparing the comparable is an imputation procedure for the counterfactual that can also remove the selection bias. Almost all estimators discussed in this note intrinsically implement this principle.
1.2 Selection Bias
However, in most of social science studies hardly can we have data generated by ran domized experiments. Even in natural science studies, sometimes random sample is also unavailable. For example, we want to study if a certain virus is lethal. It is against ethic and law to conduct a randomized experiment that makes some people infected and then studies the mortality rate. Instead of collecting the data from a randomized experiment controlled by an experimenter, in many cases we conduct a nonrandomized observational study. The challenge of observational study is the treatment assignment D _{i} may depend on other factors that also aﬀect the response variable. Moreover, D _{i} may be endogenous as well. Both situations invalidate the independent assumption. In our example, hospitals choose to participate in the program or not. Therefore, D _{i} is not a random assignment, and assumption 1 may be violated. To see this, D _{i} may depend on the scale of hospital, X: Big hospitals, such as teaching hosptials, are more likely to participate in the program. One reason is that the insurance company asked these hos
3
pitals to participate in it. By providing better health care services, hospitals may incur more cost so that it may not be proﬁtable to participate in it. Big hospitals may provide health care services in a more costeﬀective manner. Therefore, big hospitals are more likely to join the program. In other words, D _{i} is a function of X. However, the scale of hospital may also aﬀect patient’s health outcome; Y _{i} is also a function of X. Since X is the common factor of D _{i} and Y _{i} , obviously the condition (Y _{1} , Y _{0} ) ⊥ D is violated. X is termed confounder or covariate or pretreatment variable or exogeneous variable. If the confounder eﬀect is not controlled for, we will ﬁnd positive ATE by using group mean diﬀerence along to estimate it. The ideal is as follow. Big hospitals have higher cure rate, and higher probability to participate in the program. Intuitively, one scenior is that the program has virtually no eﬀect on patients’ health outcome. It is simply because there are more big hospitals in the treatment group, so we ﬁnd the patients treated in the treatment group hospitals have better health outcome. The treatment and control group is not comparable, if the confounders are not controlled for.
The bias created by not controlling for observable confounders is classifed as overt bias. Overt bias is analogous to the omitted variable problem in regression analysis. Overt bias focuses on covariates that are observable to econometricians, while hidden bias puts emphasis on unobservable covariate. Hidden bias is also known as endogeneity problem. For example, the genetic characteristics of patients or the talent of workers, can be view as unobservable covariates. Another source of hidden bias is due to selfselection into treatment, known as selfselection bias. For example, suppose the government launches a job training program for workers. The decision whether to join the program or not may relate to the beneﬁt Y _{1} − Y _{0} . This case also renders assumption 1 implausible. When these biases are present, the group mean diﬀerence no longer identify ATE. However, as long as we can come up with some methods to control for selection biases, ATE can still be identiﬁed. Typically, it evolves understandings of the data generating process of D _{i} . We discuss the identiﬁcation and estimation under exogeneity in the following section. The issue of hidden bias and selfselection is defered to the sections of repeated observations, instrumental variable and control function.
1.3 Identiﬁcation and Estimation under Exogeneity
In this section we introduce the identiﬁcation and estimation problems of ATE when there is only overt bias. The framework of this section mainly follows Imbens (2004). Moreover, both Wooldridge (2001) and M.J. Lee (2005) are excellent references on this topics.
4
We will show how to remove overt bias and then identify ATE under mild exogeneous conditions. First we impose two key assumptions on the treatment assignment:
Assumption 1.3 (unconfoundedness) (Y _{1} , Y _{0} )⊥DX,
and
Assumption 1.4 (overlap or support condition) 0 < prob(D = 1X) < 1.
Assumption 3 thus modiﬁes assumption 1, the independence assumption, to conditional independence assumption. This assumption means all overt biases can be removed by conditioning on the vector of all observable confounders. Intuitively, after controlling for X, D is somewhat like random assignment. Therefore, each subgroup of the treat ment group and control group, deﬁned by the same value of covariates, is compara ble. Unconfoundedness is interchangeable with ignorable treatment (Rosenbaum and Ru bin, 1983), conditional independence (Lechner, 1999), selection on observables (Barnow, Cain, and Goldberger, 1980). Assumption 4 guarantees at least we can compare the treatment and control group. For example, suppose X = 1 stands for male subjects. prob(D = 1X = 1) = 1 signiﬁes all male subjects receive treatment; there are no male subjects in the control group. If the gender diﬀerence has inﬂuence on the potential responses, one technique to control for the gender eﬀect is comparing male subjects who receive treatment, with male subjects who do not receive treatment; and comparing fe male subjects who receive treatment, with female subjects who do not receive treatment. But when all male subjects are in the treatment group, we cannot not control for the gender eﬀect, then the two groups is not comparable. Next, we deﬁne some notations:
Deﬁnition 1.1 (conditional mean and conditional variance) µ(x, d) = E[Y X = x, D = d], µ _{d} (x) = E[Y _{d} X = x], σ ^{2} (x, d) = V (Y X = x, D = d), σ _{d} (x) = V (Y _{d} X = x).
2
Under assumption 3, µ(x, d) = µ _{d} (x), σ ^{2} (x, d) = σ _{d} (x). Now, we introduce OLS approach to estimate ATE, and then move to recent advances in nonparametric and semiparametric methods.
2
5
1.3.1
Regression
First we postulate the constant eﬀect model: Y _{1}_{i} − Y _{0}_{i} = τ ∀i, and assume Y _{0}_{i} = α + X β + ε _{i} . Then we have:
′
i
Y _{i} = D _{i} Y _{1}_{i} + (1 − D _{i} )Y _{0}_{i} = Y _{0}_{i} + D _{i} (Y _{1}_{i} − Y _{0}_{i} ) = α + τD _{i} + β ^{′} X _{i} + ε _{i} ,
which is nothing but a dummy variable regression. It is easy to see that OLS estimator τˆ is an estimator for ATE. Since X _{i} is added into the regression function, we have controlled for the confounder eﬀect. This setting also highlights the relationship between the unconfoundedness assumption and the exogeneous assumption in regression analysis. Assumption 3 is equivalent to D ⊥ εX, characterizing exogeneity of D _{i} . Next, we use a more general setting to study regressionbased ATE estimators. Under Assumption 3, we have:
E[Y D = 1, X] − E[Y D = 0, X] = E[Y _{1} D = 1, X] − E[Y _{0} D = 0, X]
= E[Y _{1} X] − E[Y _{0} X] = E[Y _{1} − Y _{0} X] ≡ τ (X).
(2)
By using the conditional group mean diﬀerence E[Y D = 1, X ] − E[Y D = 0, X ], we identify the conditional ATE, τ (X). Taking expectation with respect to distribution of X , we can identify ATE:
τ ≡ E[Y _{1} − Y _{0} ] = E ^{[} E[Y _{1} − Y _{0} X] ^{]} = E[τ (X)].
(3)
The corresponding sample counterpart is given by:
τˆ =
^{1}
N
N
∑ [ µˆ _{1} (X _{i} ) − µˆ _{0} (X _{i} ) ] .
i=1
(4)
µˆ _{1} (X _{i} ) and µˆ _{0} (X _{i} ) is the estimator for E[Y D = 1, X] and E[Y D = 0, X], respec tively. In (4), the diﬀernce of the two estimated conditional mean functions estimates τ (X ). While E[τ (X )] is estimated by averaging τ (X _{i} ) over the empirical distribution of X. From this expression, the estimation problem of ATE can be view as the estima tion problem of the conditional mean function, E[Y D, X]. Suppose µ _{d} (x) is linear in treatment assignment and covariates:
µ _{d} (x) = α + τd + β ^{′} x,
then the corresponding dummy variable regression is given by:
Y _{i} = α + β ^{′} X _{i} + τD _{i} + ε _{i} .
6
The OLS estimator τˆ can estimates ATE:
1
N
N
∑
i=1
[ µˆ _{1} (X _{i} ) − µˆ _{0} (X _{i} ) ] =
^{1}
N
N
∑
i=1
[ (ˆα + τˆ + β ^{′} x _{i} ) − (ˆα + β ^{′} x _{i} ) ] = τ.ˆ
ˆ
ˆ
Instead of specifying a dummy variable regression, we can also estimate two separate
regression functions:
′ µ _{1} (x) = α _{1} + β _{1} x, if 
D _{i} = 1 
′ µ _{0} (x) = α _{0} + β _{0} x, if 
D _{i} = 0. 
One property of OLS estimator is the average of predicted response equals the average
of observed response:
∑ D _{i} µˆ _{1} (X _{i} ) = ^{∑} D _{i} Y _{i} , and
i i
∑ (1 −
i
D _{i} )ˆµ _{0} (X _{i} ) = ^{∑} (1
i
− D _{i} )Y _{i} .
Plugging this algebraic property into (4), τˆ can be decomposed into:
τˆ =
^{1}
N
N
∑
i=1
D _{i} [Y _{i} − µˆ _{0} (X _{i} )] + (1 − D _{i} )[ˆµ _{1} (X _{i} ) − Y _{i} ].
(5)
Many ATE estimators have the above representation. The above expression has a nice
interpretation. For instance, if unit i receives treatment (D _{i} = 1), in fact we calculate
Y _{i} − µˆ _{0} (X _{i} ) and Y _{i} = Y _{1}_{i} . Since Y _{1}_{i} is observed, the remaining task is to impute the
counterfactual Y _{0}_{i} , and we impute it by µˆ _{0} (X _{i} ). µˆ _{0} (X _{i} ) describes the average response
of the units who have covariate value X _{i} , had them not been treated.
Besides OLS method, there is a vast literature on estimating conditional mean func
tions. Since OLS may raise model misspeciﬁcation problems, current researches pay much
more attention on nonparametric and semiparametric models for ATE. Hahn (1998) de
rives the eﬃciency bound for ATE, and proposes an eﬃcient estimator for ATE using
series estimator. ^{1} Heckman, Ichimura, Todd (1997) focus on kernel regression approach.
We outline the estimator proposed by Heckman et al. Suppose X is one dimensional (it
can be generalized to multidimensional), the kernel estimator for µ _{d} (x) is given by:
µˆ _{d} (x) =
^{∑} =d Y _{i} K((X _{i} − x)/h) / ^{∑}
i:D _{i}
i:D _{i} =d
K((X _{i} − x)/h),
^{1} see Pagan and Ullah (2005) for an introduction.
7
where K(.) is the kernel function, h is the bandwidth, jointly determine the weighting
scheme. To ensure consistency, h should grow as sample size increases but at a slower
rate. Let T denotes the treatment group and C denotes the control group, it can also be
decomposed into the form of (5):
1.3.2
τˆ =
+
1
N ∑
i∈T
1
N ∑
j∈C
^{{} Y i − ^{∑} _{j}_{∈}_{C} K((X _{j} − X _{i} )/h)Y _{j}
^{∑} _{j}_{∈}_{C} K((X _{j} − X _{i} )/h)
}
^{{}
^{∑} _{i}_{∈}_{T} K((X _{j} − X _{i} )/h)Y _{i}
−
^{∑} _{i}_{∈}_{T} K((X _{j} − X _{i} )/h)
Y _{j} } .
Matching
(6)
Matching estimates ATE by comparing subjects with similar covariates. Unlike regression
that directly estimates the two conditional mean functions, matching directly implements
the principle of comparing the comparable, though it also estimates E[Y D, X] implicitly.
Before matching subjects with similar covariate, ﬁrst we should deﬁne the criterion to
measure similarity. First we introduce nearest neighbor matching estimator.
Nearest Neighbor Matching
For unit i in the treatment group, we pick up M units in the control group whose
covariates are closest to X _{i} , then average the responses of these M units to impute
the counterfactual of unit i. The same procedure applies to the units in the control
group. Following Abadie and Imbens (2006), we can use Euclidean norm x = (x ^{′} x) ^{1}^{/}^{2}
to measure closeness. We can also use x _{V} = (x ^{′} V x) ^{1}^{/}^{2} A V is a positive deﬁnite
symmetric matrix. Let j _{m} (i) be an index satisfying
D _{j} = 1 − D _{i} ,
and
∑ 1 ^{{} ∥ X _{l} − X _{i} ∥≤∥ X _{j} − X _{i} ∥ ^{}} = m.
l:D _{l} =1−D _{i}
It indicates the unit in the opposite treatment group that is mth closest to unit i with
respect to the Euclidean norm. Then deﬁne J _{M} (i) as the set of indices for the ﬁrst M
matches for unit i.
J _{M} (i) = ^{{} j _{1} (i),
, j _{M} (i) ^{}} .
The imputation procedure is given by:
8
and
ˆ
_{Y} 0i
_{=} ^{{} ^{Y} ^{i} ^{,}
1
_{M}
Y 1i = {
ˆ
1
_{M}
Y _{i} ,
^{∑} _{j}_{∈}_{J} _{M} _{(}_{i}_{)}
^{∑} _{j}_{∈}_{J} _{M} _{(}_{i}_{)}
if 
D _{i} = 0, 

Y _{j} , 
if 
D _{i} = 1, 
Y _{j} , 
if 
D _{i} = 0, 
if 
D _{i} = 1. 
The nearest neighbor matching estimator for ATE is:
τˆ _{M} =
^{1}
N
N
∑
i=1
(
Y 1i − Y 0i ^{)}
ˆ
ˆ
=
^{1}
N
N
∑
i=1
D _{i} [ Y _{i} −
1
M
j∈J _{M} (i) Y _{j} ] + (1 − D _{i} ) [
∑
1
M
j∈J _{M} (i) Y _{j} − Y _{i} ] .
∑
(7)
Kernel Matching
It is interesting that Heckman, Ichimura, and Todd (1997) is not only a nonparamet
ric regressiontype estimator, but also a matching estimator. It use kernel function to
measure closeness. To see this, let K (.) in (6) be the Bartlett kernel: ^{2}
K(x) = {
1 − x, 
x ≤ 1, 
0, 
otherwise. 
It imputes the counterfactual of treated units by:
Y 0i = ^{∑} _{j}_{∈}_{C} K((X _{j} − X _{i} )/h)Y _{j}
ˆ
^{∑} _{j}_{∈}_{C} K((X _{j} − X _{i} )/h) ^{,}
(X _{j} −X _{i} ) measures the diﬀerence between the covariate of treated unit i, and the covariate
of untreated unit j. If (X _{j} − X _{i} ) is large, namely, j is a distant observation relative to
i in terms of kernel metric, it will receive smaller weight. When X _{j} − X _{i}  ≥ h, it will
receive zero weight; X _{j} is not included in the imputation of Y _{0}_{i} .
^{2} We use Bartlett kernel only for exposition purpose. K should satisﬁes ^{∫} _{z} z ^{r} K(z)dz = 0, ∀r ≤ dim(X) in Heckman et al. Obviously Bartlett kernel violates this condition.
9
1.3.3
OLS v.s. Matching
In this section we discuss the fundamental diﬀerences between OLS and matching esti mator for ATE. First, OLS use a linear model to estimate ATE, as well as the eﬀects of covariates on the response variables. It may suﬀer from model misspeciﬁcation. Matching avoids this problem, but raises the question of userchoosen parameter M . In matching, the role of the covariates is to determine which unit is a good match; hence we can only identify ATE without knowing the eﬀects of the covariates on the response.
Second, they use diﬀerent methods to remove selection biases. Matching compares units with similar covariate, directly using the unconfoundedness condition. Let τˆ be the OLS estimator for the linear model Y _{i} = α + τD _{i} + β ^{′} X _{i} + ε _{i} . By FrischWaughLovell theorem, τˆ is the estimate that already removed the linear inﬂuence of X _{i} on Y _{i} and on D _{i} .
Finally, which sample should be included in the imputation is diﬀerent. Let X _{1} and X _{0} denote the covariates of the treatment and the control group, respectively. Let s(X _{1} ) and s(X _{0} ) stand for the corresponding support. The deﬁnition of support is s(X) = {x : f (x) > 0}A f (x) is the pdf. Matching only use the sample around the overlap of the support, s(X _{1} ) ^{∩} s(X _{0} ). If there is no suﬃcient overlap between s(X _{1} ) and s(X _{0} ), we can only use limited sample to impute the counterfactual. This situation is termed support problem. The extreme case is that there is no overlap, i.e., s(X _{1} ) ^{∩} s(X _{0} ) = ∅. For example, if the government lauched a subsidy program that all household with annual income less then 10, 000 are required to participate in this program. In this case, we cannot match the income variable because all treated units are low income family. We will dicuss this issue in the section of regression discontinuity design. By contrast, OLS will use all sample to estimate ATE regardless of whether there is a suﬃcient common support. Not because OLS does not suﬀer from the support probelm, it assumes a linear model to solve it.
Because OLS virtually has no mechanism to deal with the support problem, and it also estimates the eﬀects of covariates on the response, it would be vary sensitive to the entire distribution of X _{1} and X _{0} . By contrast, only the common support will aﬀect the precision of matching estimator.
1.4 Propensity Score
Without conditioning on some covariates which potentially aﬀect D and (Y _{0} , Y _{1} ), com parisons between two groups will be biased. This is because the selection processes induce
10
imbalance covariates distributions between the treatment and the control group. Two groups are comparable in statistical sense if their covariates distributions are the same. In the previous section we demonstrated that under conditional unconfoundness, overt biases can be removed by various conditioning strategies such as regression or matching. Another identiﬁcation strategies, the balancing score and the propensity score, solve the selection bias problem by creating balance covariate distributions between two groups. These concepts are ﬁrst introduced by Rosenbaum and Rubin (1983).
Deﬁnition 1.2 (balancing score)
A balancing score, b(X), is a function of the observed covariate X such that the condi
tional distribution of X given b(X) is the same for treated and control units; that is, in Dawid’s (1979) notation, X ⊥ Db(X).
Conditional on the balancing score, the covariate distributions are balanced between the treatment and the control group; hence, they become comparable. Obviously, X is a balancing score. It will be useful if there exists some lower dimensional balancing scores.
Deﬁnition 1.3 (propensity score) propensity score e(x) is the conditional probability of receiving the treatment:
e(x) = P(D = 1X = x).
Rosenbaum and Rubin (1983) show that the propensity score is a balancing score.
Theorem 1.1 (Balancing Property)
If D is binary, then X ⊥ De(X).
proof:
P (X ≤ x, D = 1e(X)) = E[D · I _{{}_{X}_{≤}_{x}_{}} e(X)]
=
E [ E[D · I _{{}_{X}_{≤}_{x}_{}} X] ^{} ^{} e(X) ]
(e(X) is measurable w.r.t X)
= E [ I _{{}_{X}_{≤}_{x}_{}} · E[DX] ^{} ^{} e(X) ] = e(X) · P (X ≤ xe(X)).
Moreover,
P (D = 1e(X)) = E[De(X)] = E [ E[DX] ^{} e(X) ] = E[e(X)e(X)] = e(X).
Therefore,
P (X ≤ x, D = 1e(X)) = e(X) · P (X ≤ xe(X)) = P (D = 1e(X)) · P (X ≤ xe(X)).
11
Alternatively, we can prove this theorem by the following argument. Because P (D = 1X, e(X )) = P (D = 1, X e(X ))/P (X e(X )), if we can show P (D = 1X, e(X )) = P (D = 1e(X )) the we are done. Obviously, P (D = 1X, e(X )) = E[DX, e(X )] = E[DX] = e(X) = P (D = 1e(X)).
Note that this theorem is implied by the deﬁnition of the propensity score. No distri butional assumptions and unconfoundness conditions are needed to prove this property. Rosenbaum and Rubin (1983) further show that the propensity score is the most con densed balancing score. Namely, the σalgebra induced by the propensity score is the coarsest within the class of balancing scores.
Theorem 1.2 (Most Condensed Information) b(X) is a balancing score; i.e., X ⊥ Db(X) if and only if b(X) is ﬁner than e(X) in the sense that e(X) = f (b(X)) for some function f . ^{3}
proof:
⇐ : Suppose b(X) is ﬁner than e(X) (has ﬁner σalgebra), then P (D = 1b(X)) = E[Db(X)] = E[E[DX]b(X)] = E[e(X)b(X)] = e(X). Also, P (D = 1X, b(X)) = E[DX, b(X)] = E[DX] = e(X). Therefore b(X) is a balanc ing score following the same argument in the proof of theorem 1.1.
⇒ : Suppose b(X) is a balancing score but b(X) is not ﬁner than e(X). Therefore, there
exists b(x _{1} ) = b(x _{2} ) but e(x _{1} ) ̸= e(x _{2} ). However, this
P(D = 1X = x _{2} ) which means that D and X are not conditional independent given b = b(x _{1} ) = b(x _{2} ), a contradiction.
implies P (D = 1X = x _{1} ) ̸=
Conditioning on the balancing score can equalize the covariates distribution between the treated and control units. Intuitively, the selection problem are resolved because the treatment group and control group are now comparable, after conditional on b(X). Indeed, we have a formal statement for this intuition:
Theorem 1.3 (Conditional Unconfoundness) Suppose assumption 1.3 and 1.4 hold, then (Y _{0} , Y _{1} ) ⊥ Db(X). Namely, instead of con ditioning on the entire covariate X, conditioning solely on b(X) suﬃce for removing the selection biases.
^{3} f (X) will only reduce the information of X. This is because the same value of x will have the same value of f (x). However, diﬀerent value of x may have same f (x). Therefore f (x) can only induces coarser σalgebra. For example, if f (·) is a constant function, then σ(f (X)) is the trivial σalgebra.
12
proof: By Bayesian rule we know that P (D = 1, Y _{0} , Y _{1} b(X)) = P (D = 1Y _{0} , Y _{1}
Molto più che documenti.
Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.
Annulla in qualsiasi momento.