Sei sulla pagina 1di 63

LECTURE NOTE ON Treatment Effects Analysis

YU-WEI HSIEH

New York University

First Draft: Sep 1, 2009

c

E-mail: yuwei.hsieh@nyu.edu

Yu-Wei Hsieh (all rights reserved).

Contents

1 Introduction to Average Treatment Effects

 

1

1.1 Rubin’s Statistical Causality Model .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

1.2 Selection Bias .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3

1.3 Identification and Estimation under Exogeneity .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4

1.3.1 Regression .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

1.3.2 Matching

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

1.3.3 OLS v.s. Matching .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

1.4 Propensity Score

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

1.4.1 Specification Testing of the Propensity Score

.

.

.

.

.

.

.

.

.

.

.

.

14

1.4.2 Regression as a Dimension Reduction Method .

.

.

.

.

.

.

.

.

.

.

.

15

1.4.3 Propensity Score Weighting Estimator

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16

2 Quantile Treatment Effects

 

17

2.1 Quantile Regression

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18

2.2 Weighting Estimator

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18

3 Instrumental Variables I: Local Treatment Effects

 

19

3.1 Instrumental Variable : A Review

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19

3.2 Restrictions on the Selection Process : Local Treatment Effects

 

21

3.3 Case Studies

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

3.3.1 Vietnam Draft Lottery .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

3.3.2 Randomized Eligibility Design

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26

3.4 Other Identified Features

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

26

3.5 Non-binary Treatments and Instruments

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29

3.6 Nonparametric Estimation of LTE with Covariate .

.

.

.

.

.

.

.

.

.

.

.

.

.

29

3.7 Parametric and Semiparametric Estimation of LTE with Covariate

 

31

4 Difference-in-Difference Designs

 

34

4.1 Linear Difference-in-Difference model

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

36

4.2 Nonparametric Difference-in-Differenc Models .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

37

4.3 Nonlinear Difference-in-Difference .

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

40

4.4 The Change-in-Change Model .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

42

4.5 Quantile Difference-in-Difference

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

45

 

i

5

Nonparametric Bounding Approaches

48

5.1

No-assumption Bound

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

49

50

5.3

Restrictions on the Moment Conditions: Monotone Instrumental Variables

51

5.4

Monotone Treatment Selection

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

51

5.5

Shape Restrictions on the Response Functions: Monotone Treatment Re- .

sponse

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

52

5.6

Restrictions on the Selection Mechanism .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

54

5.7

Some Remarks

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

55

References

 

56

 

ii

1

Introduction to Average Treatment Effects

Suppose the company of national health insurance launched a new reimbursment scheme— quality payment—for physicians, linking their salary with patients’ health outcome. If patients’ health status are getting better, the insurance company then gives physicians extra bonus. By providing proper financial incentives to physicians, it can encourage them to treat patients more carefully, hence leading to higher cure rate. In this case, the new payment program is called a treatment, while the cure rate is called response. The group of subjects receiving treatment is called treatment group, while the group re- ceiving control treatment (typically, no treatment) is called control or comparison group. We want to know whether a treatment has an impact on the response variable. If yes, does the treatment cause a positive or negative effect? In this section we introduce the statistical causality model, proposed by Rubin (1974), to quantify the effects of a certain treatment. More discussions about this model can be found in Holland (1986).

1.1 Rubin’s Statistical Causality Model

For subject i, the Bernoulli random variable Y 1i represents whether the patient is cured or not, if he go to the hospital that participates in the quality payment program (treatment group); and Y 0i if he go to the hospital that does not participate in that program. We call Y 1i and Y 0i potential response. Intuitively, the individual treatment effect on subject i can be define as Y 1i Y 0i . If Y 1i Y 0i > 0 then the treatment has a positive effect on subjects’ health status: it makes subjects fully recover from an illness. Let D i = 1 if subject i is in the treatment group, and D i = 0 if subject i is in the control group. The treatment indicator D i is called observed treatment, indicating whether unit i receives treatment or not. Define observed response Y i = D i Y 1i + (1 D i )Y 0i . We observe Y 1i if unit i is in the treatment group (D i = 1), and we observe Y 0i if unit i is in the control group (D i = 0). However, unit i cannot be assigned to treatment and control group at the same time. Unit i can only be treated or untreated at a specific time. Because we can only observe either Y 1i or Y 0i , we face a missing data problem that up to 50% data is missed. Therefore, it is impossible to identify the individual effect due to the missing data problem. The unobservable missing outcome is termed counterfactual outcome. For example, if we observe Y 1i , then Y 0i is the counterfactual.

Sometimes, it is cumbersome for policy maker to learn the individual effect for each subject. Instead, we are interested in other summary statistics, such as the average

1

treatment effect:

AT E = E[Y 1 Y 0 ]

ATE not only describes the average treatment effect on subjects, but also transfers the impossible-to-identify individual effect into a possible-to-estimate statistical problem. It is because we can utilize information contained in the sampling processes to learn the average effect, without exactly knowing the individual effect for each unit. However, the missing data problem raises the identification problem of ATE because we want to learn the features of (Y 1 , Y 0 , D) but only (Y, D) is observed. In this lecture note several identi- fication strategies will be discussed under different assumptions of selection mechanisms, exclusion restrictions, source of exogeneous variations, functional form restrictions, and heterogeneity. In order to identify ATE, the mechanism that how units be selected into treatment group or not lies in the central heart of treatment effect analysis. Now we introduce an identification condition for ATE. Suppose the treatment assignment D sat- isfies:

Assumption 1.1 (Y 1 , Y 0 )D, denotes independence.

Assumption 1 is an exogenous condition. For example, a randomized experiment au- tomatically satisfies this condition. Another example satisfying this assumption is that the government launched a new program, and some people are required to participate in that program. Since these people cannot choose to join the program or not, D is independent of (Y 1 , Y 0 ). This situation is termed quasi-experiment or natural experiment in the literature. Under this assumption, ATE can be identifed by group mean difference:

E[Y |D = 1] E[Y |D = 0]. E[Y |D = 1] is the mean of treatment group, and E[Y |D = 0] is the mean of control group. Since Y = DY 1 + (1 D)Y 0 , we have:

E[Y |D = 1] E[Y |D = 0] = E[Y 1 |D = 1] E[Y 0 |D = 0],

and by Assumption 1,

E[Y 1 |D = 1] E[Y 0 |D = 0] = E[Y 1 ] E[Y 0 ] = E[Y 1 Y 0 ] = AT E.

(1)

Here is an important implication of assumption 1. Since (Y 1 , Y 0 )D, we have E[Y 0 ] = E[Y 0 |D = 0] = E[Y 0 |D = 1]. E[Y 0 |D = 1] means the average response of units in the treatment group, had them not been treated. However, one can only observe Y 1 in the treatment group; therefore E[Y 0 |D = 1] is counterfactual. Under assumption 1, we can use the observable E[Y 0 |D = 0] to impute the counterfactual E[Y 0 |D = 1]. This

2

is because assumption 1 guarantee the treatment group and the control group is very similar, so that we can compare each other. We can use the information of control group to impute the counterfactual Y 0 of treatment group, and we can also use the information of treatment group to impute the counterfactual Y 1 of control group. In fact, a weaker condition is sufficient to identify ATE:

Assumption 1.2 (mean independent) E[Y 0 |D] = E[Y 0 ], and E[Y 1 |D] = E[Y 1 ].

When the parameter of interest is quantile treatment effects, or we want to estimate the asymptotic variance of ATE, weaker conditions like assumption 2 is not enough. There- fore, throughout this lecture we will impose stronger condtions, even though weaker conditions are sufficient for identification. One should know E[Y 0 |D = 0] does not nec- essarily equal to E[Y 0 |D = 1], that is, the control group need not be a good proxy for the treatment group. The identification condition thus governs the mechanism how units be selected into treatment group or not. Also, it determines whether the experiment is statistically sound or not. The principle of treatment effect analysis is under what conditions the treatment group and the control group is quite similar, namely, we can compare the comparable. To be specific, the meaning of comparing the comparable is an imputation procedure for the counterfactual that can also remove the selection bias. Almost all estimators discussed in this note intrinsically implement this principle.

1.2 Selection Bias

However, in most of social science studies hardly can we have data generated by ran- domized experiments. Even in natural science studies, sometimes random sample is also unavailable. For example, we want to study if a certain virus is lethal. It is against ethic and law to conduct a randomized experiment that makes some people infected and then studies the mortality rate. Instead of collecting the data from a randomized experiment controlled by an experimenter, in many cases we conduct a nonrandomized observational study. The challenge of observational study is the treatment assignment D i may depend on other factors that also affect the response variable. Moreover, D i may be endogenous as well. Both situations invalidate the independent assumption. In our example, hospitals choose to participate in the program or not. Therefore, D i is not a random assignment, and assumption 1 may be violated. To see this, D i may depend on the scale of hospital, X: Big hospitals, such as teaching hosptials, are more likely to participate in the program. One reason is that the insurance company asked these hos-

3

pitals to participate in it. By providing better health care services, hospitals may incur more cost so that it may not be profitable to participate in it. Big hospitals may provide health care services in a more cost-effective manner. Therefore, big hospitals are more likely to join the program. In other words, D i is a function of X. However, the scale of hospital may also affect patient’s health outcome; Y i is also a function of X. Since X is the common factor of D i and Y i , obviously the condition (Y 1 , Y 0 ) D is violated. X is termed confounder or covariate or pretreatment variable or exogeneous variable. If the confounder effect is not controlled for, we will find positive ATE by using group mean difference along to estimate it. The ideal is as follow. Big hospitals have higher cure rate, and higher probability to participate in the program. Intuitively, one scenior is that the program has virtually no effect on patients’ health outcome. It is simply because there are more big hospitals in the treatment group, so we find the patients treated in the treatment group hospitals have better health outcome. The treatment and control group is not comparable, if the confounders are not controlled for.

The bias created by not controlling for observable confounders is classifed as overt bias. Overt bias is analogous to the omitted variable problem in regression analysis. Overt bias focuses on covariates that are observable to econometricians, while hidden bias puts emphasis on unobservable covariate. Hidden bias is also known as endogeneity problem. For example, the genetic characteristics of patients or the talent of workers, can be view as unobservable covariates. Another source of hidden bias is due to self-selection into treatment, known as self-selection bias. For example, suppose the government launches a job training program for workers. The decision whether to join the program or not may relate to the benefit Y 1 Y 0 . This case also renders assumption 1 implausible. When these biases are present, the group mean difference no longer identify ATE. However, as long as we can come up with some methods to control for selection biases, ATE can still be identified. Typically, it evolves understandings of the data generating process of D i . We discuss the identification and estimation under exogeneity in the following section. The issue of hidden bias and self-selection is defered to the sections of repeated observations, instrumental variable and control function.

1.3 Identification and Estimation under Exogeneity

In this section we introduce the identification and estimation problems of ATE when there is only overt bias. The framework of this section mainly follows Imbens (2004). Moreover, both Wooldridge (2001) and M.-J. Lee (2005) are excellent references on this topics.

4

We will show how to remove overt bias and then identify ATE under mild exogeneous conditions. First we impose two key assumptions on the treatment assignment:

Assumption 1.3 (unconfoundedness) (Y 1 , Y 0 )D|X,

and

Assumption 1.4 (overlap or support condition) 0 < prob(D = 1|X) < 1.

Assumption 3 thus modifies assumption 1, the independence assumption, to conditional independence assumption. This assumption means all overt biases can be removed by conditioning on the vector of all observable confounders. Intuitively, after controlling for X, D is somewhat like random assignment. Therefore, each subgroup of the treat- ment group and control group, defined by the same value of covariates, is compara- ble. Unconfoundedness is interchangeable with ignorable treatment (Rosenbaum and Ru- bin, 1983), conditional independence (Lechner, 1999), selection on observables (Barnow, Cain, and Goldberger, 1980). Assumption 4 guarantees at least we can compare the treatment and control group. For example, suppose X = 1 stands for male subjects. prob(D = 1|X = 1) = 1 signifies all male subjects receive treatment; there are no male subjects in the control group. If the gender difference has influence on the potential responses, one technique to control for the gender effect is comparing male subjects who receive treatment, with male subjects who do not receive treatment; and comparing fe- male subjects who receive treatment, with female subjects who do not receive treatment. But when all male subjects are in the treatment group, we cannot not control for the gender effect, then the two groups is not comparable. Next, we define some notations:

Definition 1.1 (conditional mean and conditional variance) µ(x, d) = E[Y |X = x, D = d], µ d (x) = E[Y d |X = x], σ 2 (x, d) = V (Y |X = x, D = d), σ d (x) = V (Y d |X = x).

2

Under assumption 3, µ(x, d) = µ d (x), σ 2 (x, d) = σ d (x). Now, we introduce OLS approach to estimate ATE, and then move to recent advances in nonparametric and semiparametric methods.

2

5

1.3.1

Regression

First we postulate the constant effect model: Y 1i Y 0i = τ i, and assume Y 0i = α + X β + ε i . Then we have:

i

Y i = D i Y 1i + (1 D i )Y 0i = Y 0i + D i (Y 1i Y 0i ) = α + τD i + β X i + ε i ,

which is nothing but a dummy variable regression. It is easy to see that OLS estimator τˆ is an estimator for ATE. Since X i is added into the regression function, we have controlled for the confounder effect. This setting also highlights the relationship between the unconfoundedness assumption and the exogeneous assumption in regression analysis. Assumption 3 is equivalent to D ε|X, characterizing exogeneity of D i . Next, we use a more general setting to study regression-based ATE estimators. Under Assumption 3, we have:

E[Y |D = 1, X] E[Y |D = 0, X] = E[Y 1 |D = 1, X] E[Y 0 |D = 0, X]

= E[Y 1 |X] E[Y 0 |X] = E[Y 1 Y 0 |X] τ (X).

(2)

By using the conditional group mean difference E[Y |D = 1, X ] E[Y |D = 0, X ], we identify the conditional ATE, τ (X). Taking expectation with respect to distribution of X , we can identify ATE:

τ E[Y 1 Y 0 ] = E [ E[Y 1 Y 0 |X] ] = E[τ (X)].

(3)

The corresponding sample counterpart is given by:

τˆ =

1

N

N

∑ [ µˆ 1 (X i ) µˆ 0 (X i ) ] .

i=1

(4)

µˆ 1 (X i ) and µˆ 0 (X i ) is the estimator for E[Y |D = 1, X] and E[Y |D = 0, X], respec- tively. In (4), the differnce of the two estimated conditional mean functions estimates τ (X ). While E[τ (X )] is estimated by averaging τ (X i ) over the empirical distribution of X. From this expression, the estimation problem of ATE can be view as the estima- tion problem of the conditional mean function, E[Y |D, X]. Suppose µ d (x) is linear in treatment assignment and covariates:

µ d (x) = α + τd + β x,

then the corresponding dummy variable regression is given by:

Y i = α + β X i + τD i + ε i .

6

The OLS estimator τˆ can estimates ATE:

1

N

N

i=1

[ µˆ 1 (X i ) µˆ 0 (X i ) ] =

1

N

N

i=1

[ α + τˆ + β x i ) α + β x i ) ] = τ.ˆ

ˆ

ˆ

Instead of specifying a dummy variable regression, we can also estimate two separate

regression functions:

µ 1 (x) = α 1 + β 1 x,

if

D i = 1

µ 0 (x) = α 0 + β 0 x,

if

D i = 0.

One property of OLS estimator is the average of predicted response equals the average

of observed response:

D i µˆ 1 (X i ) = D i Y i , and

i i

(1

i

D i µ 0 (X i ) = (1

i

D i )Y i .

Plugging this algebraic property into (4), τˆ can be decomposed into:

τˆ =

1

N

N

i=1

D i [Y i µˆ 0 (X i )] + (1 D i )[ˆµ 1 (X i ) Y i ].

(5)

Many ATE estimators have the above representation. The above expression has a nice

interpretation. For instance, if unit i receives treatment (D i = 1), in fact we calculate

Y i µˆ 0 (X i ) and Y i = Y 1i . Since Y 1i is observed, the remaining task is to impute the

counterfactual Y 0i , and we impute it by µˆ 0 (X i ). µˆ 0 (X i ) describes the average response

of the units who have covariate value X i , had them not been treated.

Besides OLS method, there is a vast literature on estimating conditional mean func-

tions. Since OLS may raise model misspecification problems, current researches pay much

more attention on nonparametric and semiparametric models for ATE. Hahn (1998) de-

rives the efficiency bound for ATE, and proposes an efficient estimator for ATE using

series estimator. 1 Heckman, Ichimura, Todd (1997) focus on kernel regression approach.

We outline the estimator proposed by Heckman et al. Suppose X is one dimensional (it

can be generalized to multi-dimensional), the kernel estimator for µ d (x) is given by:

µˆ d (x) =

=d Y i K((X i x)/h) /

i:D i

i:D i =d

K((X i x)/h),

1 see Pagan and Ullah (2005) for an introduction.

7

where K(.) is the kernel function, h is the bandwidth, jointly determine the weighting

scheme. To ensure consistency, h should grow as sample size increases but at a slower

rate. Let T denotes the treatment group and C denotes the control group, it can also be

decomposed into the form of (5):

1.3.2

τˆ =

+

1

N

iT

1

N

jC

{ Y i jC K((X j X i )/h)Y j

jC K((X j X i )/h)

}

{

iT K((X j X i )/h)Y i

iT K((X j X i )/h)

Y j } .

Matching

(6)

Matching estimates ATE by comparing subjects with similar covariates. Unlike regression

that directly estimates the two conditional mean functions, matching directly implements

the principle of comparing the comparable, though it also estimates E[Y |D, X] implicitly.

Before matching subjects with similar covariate, first we should define the criterion to

measure similarity. First we introduce nearest neighbor matching estimator.

Nearest Neighbor Matching

For unit i in the treatment group, we pick up M units in the control group whose

covariates are closest to X i , then average the responses of these M units to impute

the counterfactual of unit i. The same procedure applies to the units in the control

group. Following Abadie and Imbens (2006), we can use Euclidean norm ||x|| = (x x) 1/2

to measure closeness. We can also use ||x|| V = (x V x) 1/2 A V is a positive definite

symmetric matrix. Let j m (i) be an index satisfying

D j = 1 D i ,

and

1 { X l X i ∥≤∥ X j X i } = m.

l:D l =1D i

It indicates the unit in the opposite treatment group that is m-th closest to unit i with

respect to the Euclidean norm. Then define J M (i) as the set of indices for the first M

matches for unit i.

J M (i) = { j 1 (i),

, j M (i) } .

The imputation procedure is given by:

8

and

ˆ

Y 0i

= { Y i ,

1

M

Y 1i = {

ˆ

1

M

Y i ,

jJ M (i)

jJ M (i)

 

if

D i = 0,

Y j ,

if

D i = 1,

Y j ,

if

D i = 0,

if

D i = 1.

The nearest neighbor matching estimator for ATE is:

τˆ M =

1

N

N

i=1

(

Y 1i Y 0i )

ˆ

ˆ

=

1

N

N

i=1

D i [ Y i

1

M

j∈J M (i) Y j ] + (1 D i ) [

1

M

j∈J M (i) Y j Y i ] .

(7)

Kernel Matching

It is interesting that Heckman, Ichimura, and Todd (1997) is not only a nonparamet-

ric regression-type estimator, but also a matching estimator. It use kernel function to

measure closeness. To see this, let K (.) in (6) be the Bartlett kernel: 2

K(x) = {

1 − |x|,

|x| ≤ 1,

0,

otherwise.

It imputes the counterfactual of treated units by:

Y 0i = jC K((X j X i )/h)Y j

ˆ

jC K((X j X i )/h) ,

(X j X i ) measures the difference between the covariate of treated unit i, and the covariate

of untreated unit j. If |(X j X i )| is large, namely, j is a distant observation relative to

i in terms of kernel metric, it will receive smaller weight. When |X j X i | ≥ h, it will

receive zero weight; X j is not included in the imputation of Y 0i .

2 We use Bartlett kernel only for exposition purpose. K should satisfies z z r K(z)dz = 0, r dim(X) in Heckman et al. Obviously Bartlett kernel violates this condition.

9

1.3.3

OLS v.s. Matching

In this section we discuss the fundamental differences between OLS and matching esti- mator for ATE. First, OLS use a linear model to estimate ATE, as well as the effects of covariates on the response variables. It may suffer from model misspecification. Matching avoids this problem, but raises the question of user-choosen parameter M . In matching, the role of the covariates is to determine which unit is a good match; hence we can only identify ATE without knowing the effects of the covariates on the response.

Second, they use different methods to remove selection biases. Matching compares units with similar covariate, directly using the unconfoundedness condition. Let τˆ be the OLS estimator for the linear model Y i = α + τD i + β X i + ε i . By Frisch-Waugh-Lovell theorem, τˆ is the estimate that already removed the linear influence of X i on Y i and on D i .

Finally, which sample should be included in the imputation is different. Let X 1 and X 0 denote the covariates of the treatment and the control group, respectively. Let s(X 1 ) and s(X 0 ) stand for the corresponding support. The definition of support is s(X) = {x : f (x) > 0}A f (x) is the pdf. Matching only use the sample around the overlap of the support, s(X 1 ) s(X 0 ). If there is no sufficient overlap between s(X 1 ) and s(X 0 ), we can only use limited sample to impute the counterfactual. This situation is termed support problem. The extreme case is that there is no overlap, i.e., s(X 1 ) s(X 0 ) = . For example, if the government lauched a subsidy program that all household with annual income less then 10, 000 are required to participate in this program. In this case, we cannot match the income variable because all treated units are low income family. We will dicuss this issue in the section of regression discontinuity design. By contrast, OLS will use all sample to estimate ATE regardless of whether there is a sufficient common support. Not because OLS does not suffer from the support probelm, it assumes a linear model to solve it.

Because OLS virtually has no mechanism to deal with the support problem, and it also estimates the effects of covariates on the response, it would be vary sensitive to the entire distribution of X 1 and X 0 . By contrast, only the common support will affect the precision of matching estimator.

1.4 Propensity Score

Without conditioning on some covariates which potentially affect D and (Y 0 , Y 1 ), com- parisons between two groups will be biased. This is because the selection processes induce

10

imbalance covariates distributions between the treatment and the control group. Two groups are comparable in statistical sense if their covariates distributions are the same. In the previous section we demonstrated that under conditional unconfoundness, overt biases can be removed by various conditioning strategies such as regression or matching. Another identification strategies, the balancing score and the propensity score, solve the selection bias problem by creating balance covariate distributions between two groups. These concepts are first introduced by Rosenbaum and Rubin (1983).

Definition 1.2 (balancing score)

A balancing score, b(X), is a function of the observed covariate X such that the condi-

tional distribution of X given b(X) is the same for treated and control units; that is, in Dawid’s (1979) notation, X D|b(X).

Conditional on the balancing score, the covariate distributions are balanced between the treatment and the control group; hence, they become comparable. Obviously, X is a balancing score. It will be useful if there exists some lower dimensional balancing scores.

Definition 1.3 (propensity score) propensity score e(x) is the conditional probability of receiving the treatment:

e(x) = P(D = 1|X = x).

Rosenbaum and Rubin (1983) show that the propensity score is a balancing score.

Theorem 1.1 (Balancing Property)

If D is binary, then X D|e(X).

proof:

P (X x, D = 1|e(X)) = E[D · I {Xx} |e(X)]

=

E [ E[D · I {Xx} |X] e(X) ]

(e(X) is measurable w.r.t X)

= E [ I {Xx} · E[D|X] e(X) ] = e(X) · P (X x|e(X)).

Moreover,

P (D = 1|e(X)) = E[D|e(X)] = E [ E[D|X] e(X) ] = E[e(X)|e(X)] = e(X).

Therefore,

P (X x, D = 1|e(X)) = e(X) · P (X x|e(X)) = P (D = 1|e(X)) · P (X x|e(X)).

11

Alternatively, we can prove this theorem by the following argument. Because P (D = 1|X, e(X )) = P (D = 1, X |e(X ))/P (X |e(X )), if we can show P (D = 1|X, e(X )) = P (D = 1|e(X )) the we are done. Obviously, P (D = 1|X, e(X )) = E[D|X, e(X )] = E[D|X] = e(X) = P (D = 1|e(X)).

Note that this theorem is implied by the definition of the propensity score. No distri- butional assumptions and unconfoundness conditions are needed to prove this property. Rosenbaum and Rubin (1983) further show that the propensity score is the most con- densed balancing score. Namely, the σ-algebra induced by the propensity score is the coarsest within the class of balancing scores.

Theorem 1.2 (Most Condensed Information) b(X) is a balancing score; i.e., X D|b(X) if and only if b(X) is finer than e(X) in the sense that e(X) = f (b(X)) for some function f . 3

proof:

: Suppose b(X) is finer than e(X) (has finer σ-algebra), then P (D = 1|b(X)) = E[D|b(X)] = E[E[D|X]|b(X)] = E[e(X)|b(X)] = e(X). Also, P (D = 1|X, b(X)) = E[D|X, b(X)] = E[D|X] = e(X). Therefore b(X) is a balanc- ing score following the same argument in the proof of theorem 1.1.

: Suppose b(X) is a balancing score but b(X) is not finer than e(X). Therefore, there

exists b(x 1 ) = b(x 2 ) but e(x 1 ) ̸= e(x 2 ). However, this

P(D = 1|X = x 2 ) which means that D and X are not conditional independent given b = b(x 1 ) = b(x 2 ), a contradiction.

implies P (D = 1|X = x 1 ) ̸=

Conditioning on the balancing score can equalize the covariates distribution between the treated and control units. Intuitively, the selection problem are resolved because the treatment group and control group are now comparable, after conditional on b(X). Indeed, we have a formal statement for this intuition:

Theorem 1.3 (Conditional Unconfoundness) Suppose assumption 1.3 and 1.4 hold, then (Y 0 , Y 1 ) D|b(X). Namely, instead of con- ditioning on the entire covariate X, conditioning solely on b(X) suffice for removing the selection biases.

3 f (X) will only reduce the information of X. This is because the same value of x will have the same value of f (x). However, different value of x may have same f (x). Therefore f (x) can only induces coarser σ-algebra. For example, if f (·) is a constant function, then σ(f (X)) is the trivial σ-algebra.

12

proof: By Bayesian rule we know that P (D = 1, Y 0 , Y 1 |b(X)) = P (D = 1|Y 0 , Y 1