Sei sulla pagina 1di 47

Research Method

Lecture 7 (Ch14)

Pooled Cross
Sections and
Simple Panel Data

Methods 1
An independently pooled
cross section
This type of data is obtained by
sampling randomly from a population
at different points in time (usually in
different years)
You can pool the data from different
year and run regressions.
However, you usually include year
dummies.
2
Panel data
This is the cross section data
collected at different points in time.
However, this data follow the same
individuals over time.
You can do a bit more than the
pooled cross section with Panel data.
You usually include year dummies as
well.

3
cross sections across
time.
As long as data are collected independently, it
causes little problem pooling these data over time.
However, the distribution of independent variables
may change over time. For example, the
distribution of education changes over time.
To account for such changes, you usually need to
include dummy variables for each year (year
dummies), except one year as the base year
Often the coefficients for year dummies are of
interest.

4
Example 1
Consider that you would like to see the
changes in fertility rate over time after
controlling for various characteristics.
Next slide shows the OLS estimates of
the determinants of fertility over time.
(Data: FERTIL1.dta)
The data is collected every other year.
The base year for the year dummies
are year 1972.

5
Dependent variable =# kids per woman
. reg kids educ age agesq black east northcen west farm othrural town smcity y74 y76 y80 y82 y84

Source SS df MS Number of obs = 1129


F( 16, 1112) = 10.33
Model 399.265559 16 24.9540975 Prob > F = 0.0000
Residual 2686.24374 1112 2.41568682 R-squared = 0.1294
Adj R-squared = 0.1169
Total 3085.5093 1128 2.73538059 Root MSE = 1.5542

kids Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ -.1287556 .0183209 -7.03 0.000 -.164703 -.0928081


age .535383 .1380659 3.88 0.000 .264484 .8062821
agesq -.0058384 .001561 -3.74 0.000 -.0089013 -.0027756
black 1.077747 .1733806 6.22 0.000 .7375571 1.417937
east .2180929 .1327211 1.64 0.101 -.042319 .4785049
northcen .3616071 .1207846 2.99 0.003 .1246157 .5985984
west .1989796 .1668093 1.19 0.233 -.1283168 .5262761
farm -.0553556 .146947 -0.38 0.706 -.3436803 .2329692
othrural -.1662171 .1751486 -0.95 0.343 -.5098761 .177442
town .0825938 .124396 0.66 0.507 -.1614836 .3266712
smcity .2092197 .1600797 1.31 0.191 -.1048727 .5233121
y74 .301226 .1488953 2.02 0.043 .0090786 .5933735
y76 -.0639849 .1556646 -0.41 0.681 -.3694143 .2414445
y80 -.037886 .1598956 -0.24 0.813 -.3516171 .2758452
y82 -.4892665 .1482989 -3.30 0.001 -.7802437 -.1982893
y84 -.5112715 .1496524 -3.42 0.001 -.8049044 -.2176385
_cons -7.844731 3.038574 -2.58 0.010 -13.80672 -1.882745
6
The number of children one woman
has in 1982 is 0.49 less than the
base year. Similar result is found for
year 1984.

The year dummies show significant


drops in fertility rate over time.

7
Example 2
CPS78_85.dta has wage data collected in
1978 and 1985.
we estimate the earning equation which
includes education, experience,
experience squared, union dummy, female
dummy and the year dummy for 1985.
Suppose that you want to see if gender
gap has changed over time, you include
interaction between female and 1985;
that is you estimate the following.

8
Log(wage)=0+1(educ)
+2(exper)+3(expersq)+4(Union)
+5(female)
+6(year85)
+7(year85)(female)
You can check if gender wage gap in 1985 is different
from the base year (1978) by checking if 7 is equal to
zero or not.
The gender gap in each period is given by:
-gender gap in the base year (1978) = 5
-gender gap in 1985= 5+ 7

9
. reg lwage educ exper expersq union female y85 y85fem

Source SS df MS Number of obs = 1084


F( 7, 1076) = 113.20
Model 135.328704 7 19.332672 Prob > F = 0.0000
Residual 183.762464 1076 .170782959 R-squared = 0.4241
Adj R-squared = 0.4204
Total 319.091167 1083 .29463635 Root MSE = .41326

lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .0833217 .0050646 16.45 0.000 .0733841 .0932594


exper .0294761 .0035717 8.25 0.000 .0224679 .0364844
expersq -.0003975 .0000776 -5.12 0.000 -.0005498 -.0002451
union .205237 .0302943 6.77 0.000 .1457945 .2646795
female -.3195333 .0366427 -8.72 0.000 -.3914324 -.2476341
y85 .3530916 .0333324 10.59 0.000 .2876877 .4184954
y85fem .0884046 .0513498 1.72 0.085 -.0123524 .1891616
_cons .3522088 .0763137 4.62 0.000 .2024683 .5019493

Coefficient for the interaction term (y85)(Female) is


positive and significant at 10% significance level. So
gender gap appear to have reduced over time.
gender gap in 1978 =-0.319
gender gap in 1985=-0.319+0.088 =-0.231 10
Policy analysis with
pooled cross sections:
The difference in
difference estimator
I explain a typical policy analysis
with pooled cross section data,
called the difference-in-difference
estimation, using an example.

11
Example: Effects of
garbage incinerator on
housing prices
This example is based on the studies of
housing price in North Andover in
Massachusetts
The rumor that a garbage incinerator
will be build in North Andover began
after 1978. The construction of
incinerator began in 1981.
You want to examine if the incinerator
affected the housing price.
12
Our hypothesis is the following.

Hypothesis: House located near the incinerator


would fall relative to the price of more distant
houses.

For illustration define a house to be near the


incinerator if it is within 3 miles.
So create the following dummy variables
nearinc =1 if the house is `near the incinerator
=0 if otherwise

13
Most nave analysis would be to run the
following regression using only 1981 data.
price =0+1(nearinc)+u
where the price is the real price (i.e., deflated using CPI to
express it in 1978 constant dollar).
Using the KIELMC.dta, the result is the following
. reg rprice nearinc if year==1981

Source SS df MS Number of obs = 142


F( 1, 140) = 27.73
Model 2.7059e+10 1 2.7059e+10 Prob > F = 0.0000
Residual 1.3661e+11 140 975815048 R-squared = 0.1653
Adj R-squared = 0.1594
Total 1.6367e+11 141 1.1608e+09 Root MSE = 31238

rprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

nearinc -30688.27 5827.709 -5.27 0.000 -42209.97 -19166.58


_cons 101307.5 3093.027 32.75 0.000 95192.43 107422.6

But can we say from this estimation that the incinerator has
14
negatively affected the housing price?
To see this, estimate the same equation
using 1979 data. Note this is before the
rumor of incinerator building began.
. reg rprice nearinc if year==1978

Source SS df MS Number of obs = 179


F( 1, 177) = 15.74
Model 1.3636e+10 1 1.3636e+10 Prob > F = 0.0001
Residual 1.5332e+11 177 866239953 R-squared = 0.0817
Adj R-squared = 0.0765
Total 1.6696e+11 178 937979126 Root MSE = 29432

rprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

nearinc -18824.37 4744.594 -3.97 0.000 -28187.62 -9461.117


_cons 82517.23 2653.79 31.09 0.000 77280.09 87754.37

Note that the price of the house near the place where the
incinerator is to be build is lower than houses farther from the
location.

So negative coefficient simply means that the garbage incinerator


was build in the location where the housing price is low. 15
Now, compare the two regressions.
Year 1978 regression
. reg rprice nearinc if year==1978

Source SS df MS Number of obs = 179


F( 1, 177) = 15.74
Model 1.3636e+10 1 1.3636e+10 Prob > F = 0.0001

Compared to Residual 1.5332e+11 177 866239953 R-squared


Adj R-squared
=
=
0.0817
0.0765
Total 1.6696e+11 178 937979126 Root MSE = 29432
1978, the price
penalty for rprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

nearinc -18824.37 4744.594 -3.97 0.000 -28187.62 -9461.117


houses near the _cons 82517.23 2653.79 31.09 0.000 77280.09 87754.37

incinerator is
greater in Year 1981 regression
. reg rprice nearinc if year==1981

1981. Source SS df MS Number of obs


F( 1, 140)
=
=
142
27.73
Model 2.7059e+10 1 2.7059e+10 Prob > F = 0.0000
Residual 1.3661e+11 140 975815048 R-squared = 0.1653
Adj R-squared = 0.1594
Perhaps, the Total 1.6367e+11 141 1.1608e+09 Root MSE = 31238

increase in the rprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

price penalty in nearinc


_cons
-30688.27
101307.5
5827.709
3093.027
-5.27
32.75
0.000
0.000
-42209.97
95192.43
-19166.58
107422.6
1981 is caused
by the This is the basic idea of the
incinerator difference-in-difference estimator 16
The difference-in-difference estimator in
this example may be computed as
follows. I will show you more a general
case later on.

The difference-in-difference
estimator :
1
= (coefficient for nearinc in 1981)
(coefficient for nearinc in 1979)
= 30688.27 ( 18824.37)= 11846
So, incinerator has decreased the house prices on
average by $11846. 17
Note that, in this example, the coefficient for (nearinc) in 1979
is equal to

Average price Average price of


of houses near houses not near
the incinerator the incinerator

This is because the regression includes only one dummy


variable: (Just recall Ex.1 of the homework 2).

Therefore the difference in difference estimator in this


1
example is written as.

1 (Price)1981,near (Price)1981,far (Price)1979,near (Price)1979,far
This is the reason why the estimator is called the difference
in difference estimator. 18
Difference in difference
estimator: More general
case.
The difference-in-difference estimator can be estimated by running the
following single equation using pooled sample.

price =0+1(nearinc)
+2(year81)+1(year81)(nearinc)

Difference in
difference estimator

19
. reg rprice nearinc y81 y81nrinc

Source SS df MS Number of obs = 321


F( 3, 317) = 22.25
Model 6.1055e+10 3 2.0352e+10 Prob > F = 0.0000
Residual 2.8994e+11 317 914632739 R-squared = 0.1739
Adj R-squared = 0.1661
Total 3.5099e+11 320 1.0969e+09 Root MSE = 30243

rprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

nearinc -18824.37 4875.322 -3.86 0.000 -28416.45 -9232.293


y81 18790.29 4050.065 4.64 0.000 10821.88 26758.69
y81nrinc -11863.9 7456.646 -1.59 0.113 -26534.67 2806.867
_cons 82517.23 2726.91 30.26 0.000 77152.1 87882.36

Difference in difference estimator


This form is more general since in addition to policy dummy
(nearinc), you can include more variables that affect the housing
price such as the number of bedrooms etc. When you include more
variables, 1 cannot be expressed in a simple difference-in-
difference format. However, the interpretation does not change, and
therefore, it is still called the difference-in-difference estimator 20
Natural experiment (or
quasi-experiment)
The difference in difference estimator is frequently
used to evaluate the effect of governmental policy.
Often governmental policy affects one group of
people, while it does not affect other group of
people. This type of policy change is called the
natural experiment.
For example, the change in spousal tax deduction
system in Japan which took place in 1995 has
affected married couples but did not affect single
people.

21
The group of people who are affected by
the policy is called the treatment group.
Those who are not affected by the policy is
called the control group.
Suppose that you want to know how the
change in spousal tax deduction has
affected the hours worked by women.
Suppose, you have the pooled data of
workers in 1994 and 1995.
The next slide shows the typical procedure
you follow to conduct the difference-in-
difference analysis.

22
Step 1: Create the treatment dummy
such that

Dtreat =1 if the person is affected by the policy


change
=0 otherwise.

Step 2: Run the following regression.

(Hours worked)=0+1Dtreat+ 0(year95) +1(Year95)(Dtreat)+u


Difference in difference estimator. This shows
the effect of the policy change on the womens
hours worked.
23
Two period panel data
Motivation:
analysis
Remember the effects of employee training grant
on the scrap rate. You estimated the following
model for the 1987 data.
log( Scrap) 0 1 ( grant ) 2 log( sales ) 3 log(employment ) v
. reg lscrap grant lsale lemploy if year==1988

Source SS df MS Number of obs = 50


F( 3, 46) = 1.18
Model 6.8054029 3 2.26846763 Prob > F = 0.3270
Residual 88.2852083 46 1.91924366 R-squared = 0.0716
Adj R-squared = 0.0110
Total 95.0906112 49 1.94062472 Root MSE = 1.3854

lscrap Coef. Std. Err. t P>|t| [95% Conf. Interval]

grant -.0517781 .4312869 -0.12 0.905 -.9199137 .8163574


lsales -.4548425 .3733152 -1.22 0.229 -1.206287 .2966021
lemploy .6394289 .3651366 1.75 0.087 -.095553 1.374411
_cons 4.986779 4.655588 1.07 0.290 -4.384433 14.35799

You did not find the evidence that receiving the grant will
reduce scrap rate. 24
The reason why we did not find the significant effect
is probably due to the endogeneity problem.
The company with low ability workers tend to apply
for the grant, which creates positive bias in the
estimation. If you observe the average ability of the
workers, you can eliminate the bias by including the
ability variable. But since you cannot observe
ability, you have the following situation.

log( Scrap) 0 1 ( grant ) 2 log( sales ) 3 log(employment ) ( 3 ability u )



v

where ability is in the error term v. v=(3ability+u)


is called the composite error term.
25
log( Scrap) 0 1 ( grant ) 2 log( sales ) 3 log(employment ) ( 3 ability u )

v

Because ability and grant are


correlated (negatively), this causes a
bias in the coefficient for (grant).
We predicted the direction of bias in
the following way. Effect of
ability on
~ ~ scrap rate
1


1

4 1
() () ()
True effect Sign is
()
of grant determined by
Bias term
the correlation
The true negative effect of grant is cancelled out by between ability
the bias term. Thus, the bias make it difficult to and grant
26
find the effect.
Now you know that there is a bias. Is
there anything we can do to correct
for the bias?
When you have a panel data, we can
eliminate the bias.
I will explain the method using this
example. I will generalize it later.

27
Eliminating bias using two
period panel data
Now, go back to the equation.
log( Scrap ) 0 1 ( grant ) 2 log( sales ) 3 log(employment ) ( 4 ability u )

v

The grant is administered in 1988.


Suppose that you have a panel data
of firms for two period, 1987 and
1988.
Further assume that the average
ability of workers does not change
over time. So (ability) is interpreted
as the innate ability of workers, such28
When you have the two period panel
data, the equation can be written as:
log( Scrap) it 0 1 ( grant ) it 2 log(sales ) it 3 log(employment ) it
5 ( year88) it ( 4 abilityi uit )

vit

i is the index for ith firm. t is the index for


the period.
Since ability is constant overtime, ability
has only i index.
Now, I will use a short hand notation for
4(ability)i. Since (ability) is assumed constant over
time, write 4(ability)i=ai. Then above equation can
be written as: 29
log(Scrap) it 0 1 ( grant ) it 2 log(sales ) it 3 log(employment ) it
5 ( year88) it (ai uit )

vit

ai is called, the fixed effect, or the unobserved


effect. If you want to emphasize that it is the
unobserved firm characteristic, you can call it the
firm fixed effect as well
uit is called the idiosyncratic error.
Now the bias in OLS occurs because the fixed effect is
correlated with (grant).
So if we can get rid of the fixed effect, we can
eliminate the bias. This is the basic idea.
In the next slide, I will show the procedure of what is
called the first-differenced estimation.

30
First, for each firm, take the first
difference. That is, compute the following.
log(Scrap) it log( Scrap ) it log( Scrap) it 1

It follows that,

log( Scrap ) it 0 1 ( grant ) it 2 log( sales ) it 3 log(employment ) it


5 ( year88) it (ai uit ) [ 0 1 ( grant ) it 1 2 log( sales ) it 1
3 log(employment ) it 1 5 ( year88) it 1 (ai uit 1 )]

1 ( grant ) it 2 log( sales ) it 3 log(employment ) it 5 ( year88) it uit

The first differenced equation.

31
So, by taking the first difference, you
can eliminate the fixed effect.
log( Scrap )it 1( grant ) it 2 log(sales ) it 3 log(employment ) it 5 ( year88) it uit

If uit is not correlated with (grant)it,


estimating the first differenced model using
OLS will produce unbiased estimates. If we
have controlled for enough time-varying
variables, it is reasonable to assume that
they are uncorrelated.

Note that this model does not have the


constant.
32
. **************************
. * Declare panel *
. **************************
. tsset fcode year
panel variable: fcode (strongly balanced)
time variable: year, 1987 to 1989
delta: 1 unit

. ******************************
. * Generate first differenced *
. * variables *
. ******************************
. gen difflscrap=lscrap-L.lscrap
(363 missing values generated)

. gen diffgrant=grant-L.grant
(157 missing values generated) When you use nocons
. gen difflsales=lsales-L.lsales
(226 missing values generated) option, the stata omits
. gen difflemploy=lemploy-L.lemploy
(181 missing values generated) constant term.
. gen diffd88=d88-L.d88
(157 missing values generated)

. **********************
. * Run the regression *
. **********************
. reg difflscrap diffgrant difflsales difflemploy diffd88 if year<=1988, nocons

Source SS df MS Number of obs = 47


F( 4, 43) = 1.82
Model 2.71885438 4 .679713595 Prob > F = 0.1428
Residual 16.0749657 43 .373836411 R-squared = 0.1447
Adj R-squared = 0.0651
Total 18.79382 47 .399868511 Root MSE = .61142

difflscrap Coef. Std. Err. t P>|t| [95% Conf. Interval]

diffgrant -.3223172 .1879101 -1.72 0.093 -.701274 .0566396


difflsales -.1733036 .365626 -0.47 0.638 -.9106586 .5640514
difflemploy .0233784 .5064015 0.05 0.963 -.9978775 1.044634
diffd88 -.0272418 .120639 -0.23 0.822 -.2705336 .2160501

Now, the grant is negative and significant at


10% level. 33
Note that, when you use this method in your
research, it is a good idea to tell your audience
what the potential fixed effect would be and
whether it is correlated with the
explanatory variables. In this example,
unobserved ability is potentially an important
source of the fixed effect.
Off course, one can never tell exactly what the
fixed effect is since it is the aggregate effects of
all the unobserved effects. However, if you tell
what is contained in the fixed effect, your
audience can understand the potential direction
of the bias, and why you need to use the first-
differenced method.

34
General case
First differenced model in a more general
situation can be written as follows.
Yit=0+1xit1+2xit2++kxitk+ai+uit
Fixed
effect
If ai is correlated with any of the explanatory variables,
the estimated coefficients will be biased. So take the
first difference to eliminate ai, then estimate the
following model by OLS.

Yit= 1xit1+ 2xit2++ xitk+ uit


35
Note, when you take the first
difference, the constant term will
also be eliminated. So you should
use `nocons option in STATA when
you estimate the model.
When some variables are time
invariant, these variables are also
eliminated. If the treatment variable
does not change overtime, you
cannot use this method.

36
First differencing for more
than two periods.
You can use first differencing for more
than two periods.
You just have to difference two adjacent
periods successively.
For example, suppose that you have 3
periods. Then for the dependent variable,
you compute yi2=yi2-yi1, and yi3=yi3-yi2.
Do the same for x-variables. Then run the
regression.
37
Exercise
The data ezunem.dta contains the city level
unemployment claim statistics in the state of
Indiana. This data also contains information
about whether the city has an enterprise zone
or not.
The enterprise zone is the area which
encourages businesses and investments
through reduced taxes and restrictions.
Enterprise zones are usually created in an
economically depressed area with the purpose
of increasing the economic activities and
reducing unemployment.

38
Using the data, ezunem.dta, you are asked to estimate the
effect of enterprise zones on the city-level unemployment
claim. Use the log of unemployment claim as the
dependent variable

Ex1. First estimate the following model using OLS.


log(unemployment claims)it =0+1(Enterprise zone)it
+(year dummies)it+vit
Discuss whether the coefficient for enterprise zone is biased
or not. If you think it is biased, what is the direction of
bias?

Ex2. Estimate the model using the first difference method.


Did it change the result? Was your prediction of bias
correct?

39
OLS results
. reg luclms ez d81 d82 d83 d84 d85 d86 d87 d88

Source SS df MS Number of obs = 198


F( 9, 188) = 11.44
Model 35.5700512 9 3.95222791 Prob > F = 0.0000
Residual 64.9262278 188 .345352276 R-squared = 0.3539
Adj R-squared = 0.3230
Total 100.496279 197 .510133396 Root MSE = .58767

luclms Coef. Std. Err. t P>|t| [95% Conf. Interval]

ez -.0387084 .1148501 -0.34 0.736 -.2652689 .187852


d81 -.3216319 .1771882 -1.82 0.071 -.6711645 .0279007
d82 .1354957 .1771882 0.76 0.445 -.2140369 .4850283
d83 -.2192554 .1771882 -1.24 0.217 -.568788 .1302772
d84 -.5970717 .1799355 -3.32 0.001 -.9520237 -.2421197
d85 -.6216534 .1847186 -3.37 0.001 -.986041 -.2572658
d86 -.6511313 .1847186 -3.52 0.001 -1.015519 -.2867437
d87 -.9188151 .1847186 -4.97 0.000 -1.283203 -.5544275
d88 -1.2575 .1847186 -6.81 0.000 -1.621887 -.893112
_cons 11.69439 .125291 93.34 0.000 11.44724 11.94155

40
First differencing

. reg lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87 lagd88, nocons

Source SS df MS Number of obs = 176


F( 9, 167) = 41.31
Model 17.3537634 9 1.92819594 Prob > F = 0.0000
Residual 7.79583815 167 .046681666 R-squared = 0.6900
Adj R-squared = 0.6733
Total 25.1496016 176 .142895463 Root MSE = .21606

lagluclms Coef. Std. Err. t P>|t| [95% Conf. Interval]

lagez -.1818775 .0781862 -2.33 0.021 -.3362382 -.0275169


lagd81 -.3216319 .046064 -6.98 0.000 -.4125748 -.2306891
lagd82 .1354957 .0651444 2.08 0.039 .0068831 .2641083
lagd83 -.2192554 .0797852 -2.75 0.007 -.3767731 -.0617378
lagd84 -.5580256 .0945636 -5.90 0.000 -.7447196 -.3713315
lagd85 -.5565765 .108961 -5.11 0.000 -.7716951 -.3414579
lagd86 -.5860544 .1182979 -4.95 0.000 -.8196066 -.3525023
lagd87 -.8537383 .1269499 -6.72 0.000 -1.104372 -.6031047
lagd88 -1.192423 .1350488 -8.83 0.000 -1.459046 -.9257998

41
The do file used to generate the results.

tsset city year

reg luclms ez d81 d82 d83 d84 d85 d86 d87 d88

gen lagluclms =luclms -L.luclms


gen lagez =ez -L.ez
gen lagd81 =d81 -L.d81
gen lagd82 =d82 -L.d82
gen lagd83 =d83 -L.d83
gen lagd84 =d84 -L.d84
gen lagd85 =d85 -L.d85
gen lagd86 =d86 -L.d86
gen lagd87 =d87 -L.d87
gen lagd88 =d88 -L.d88

reg lagluclms lagez lagd81 lagd82 lagd83 lagd84 lagd85 lagd86 lagd87
lagd88, nocons

42
The assumptions for the
first difference method.
Assumption FD1: Linearity

For each i, the model is written as

yit=0+1xit1++kxitk+ai+uit

43
Assumption FD2:

We have a random sample from the


cross section

Assumption FD3:
There is no perfect collinearity. In
addition, each explanatory variable
changes over time at least for some i
in the sample.

44
Assumption FD4. Strict exogeneity

E(uit|Xi,ai)=0 for each i.


Where Xi is the short hand notation for all
the explanatory variables for ith
individual for all the time period.

This means that uit is uncorrelated with


the current years explanatory variables
as well as with other years explanatory
variables.
45
The unbiasedness of first
difference method
Under FD1 through FD4, the
estimated parameters for the first
difference method are unbiased.

46
Assumption FD5: Homoskedasticity
Var(uit|Xi)=2

Assumption FD6: No serial correlation


within ith individual.
Cov(uit,uis)=0 for ts

Note that FD2 assumes random sampling across


difference individual, but does not assume
randomness within each individual. So you
need an additional assumption to rule out the
serial correlation.

47

Potrebbero piacerti anche