Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Session 2
By Ziyodullo Parpiev, PhD
Overview
Panel data
How to get to know the data
First need to tell Stata that you have panel data using
xtset
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
2001
|
2002 Employment status
Employment |
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
991
15
46 |
1,052
2 |
20
12
9 |
41
3 |
56
20
495 |
571
-----------+---------------------------------+---------Total |
1,067
47
550 |
1,664
p ij Pr{ X t j | X t 1 i )
So to calculate by hand,
n
p ij N ij / N ij
j 1
Cell count
Row total
2001
|
2002 Employment status
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+--------1 |
0.94
0.01
0.04|
1.00
2 |
0.49
0.29
0.22|
1.00
3 |
0.10
0.04
0.87|
1.00
-----------+---------------------------------+---------
empl
0.03
unemp
0.06
0.90
olf
unemp
0.03
0.26
0.49
empl
empl
unemp
0.25
olf
0.07
olf
0.10
0.03
empl
unemp
0.87
olf
Absolute mobility
1511.067
75%
90%
95%
99%
2365.493
3329.769
4062.217
6748.689
Largest
9230.818
9230.818
9230.818
9230.818
Mean
Std. Dev.
1773.253
1299.089
Variance
Skewness
Kurtosis
1687633
1.836874
8.622895
wave = 2
household income: month before interview
------------------------------------------------------------Percentiles
Smallest
1%
207.9433
0
5%
338.7431
0
10%
460.68
0
Obs
2639
25%
861.67
5
Sum of Wgt.
2639
50%
75%
90%
95%
99%
1508
2449.813
3414.511
4103.649
5824.449
Largest
8405.636
8405.636
10491.08
10491.08
Mean
Std. Dev.
1795.179
1229.827
Variance
Skewness
Kurtosis
1512476
1.352148
6.370836
Year
Boundary 1
(n)
Boundary 2
(n)
Boundary 3
(n)
Boundary 4
(n)
Size
1991
0 - 800
(580)
800 - 1500
(650)
1500 - 2200
(504)
2200 - 9231
(715)
1992
0 - 800
(580)
800 - 1500
(645)
1500 - 2200
(473)
2200 - 10491
(751)
1991
0 827
(609)
827 -1511
(615)
1511 2365
(611)
2365 9231
(614)
1992
0 862
(610)
862 1508
(612)
1508 2450
(612)
2450 10491
(615)
1991
0 887
(654)
887 -1773
(814)
1773 2660
(506)
2660 9231
(475)
1992
0 898
(652)
898 -1795
(766)
1795 2693
(501)
2693 10491
(530)
1991
0 750
(539)
750 -1500
(685)
1500 2250
(540)
2250 9231
(685)
1992
0 746
(536)
746 -1491
(686)
1491 -2237
(505)
2262 10491
(722)
Quartile
Mean
Median
Warning!
Measurement error
Overview
Types of variable
Those which vary between individuals but hardly ever over time
Sex
Ethnicity
Parents social class when you were 14
The type of primary school you attended (once youve become an adult)
Income
Health
Psychological wellbeing
Number of children you have
Marital status
Trend variables
Vary between individuals and over time, but in highly predictable ways:
Age
Year
If you have a sample with repeated observations on the same individuals, there are two
sources of variance within the sample:
The fact that individuals are systematically different from one another (between-individual variation)
The fact that individuals behaviour varies between observations over time (within-individual variation)
k
i 1
j 1
i 1
( x ij x )
( x ij x i )
j 1
i 1
( xi x)
j 1
x 21 x 22 ... x 2 m
.......... ........
.......... ........
x
x
... x km
k1 k 2
Remember:
From the variation, you get to the variance, you get to
the Standard Deviation:
SD
T/(N - 1)
xtsum in STATA
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
wave
overall
between
within
4.320605
0
4.320605
1
8
1
15
8
15
N =
n =
T =
N =
n =
T =
19410
1294
15
19410
1294
15
All variation is
between
Most variation
is between,
because its
fairly rare to
switch between
having and not
having a
partner
More on xtsum.
.
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
overall
between
within
4.320605
0
4.320605
1
8
1
15
8
15
wave
N =
n =
T =
N =
n =
T =
19410
1294
15
Observations with
non-missing
variable
Number of
individuals
Average number
of time-points
19410
1294
15
Min & max refer to individual deviation from own averages, with global averages added back in.
xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8
jbstat
Overall
Freq.
Percent
self-emp
employed
unemploy
retired
family c
ft studt
lt sick,
1388
8982
539
2687
1159
718
558
8.66
56.03
3.36
16.76
7.23
4.48
3.48
Total
16031
100.00
Between
Freq.
Percent
228
974
274
314
292
271
105
2458
(n = 1236)
Within
Percent
18.45
78.80
22.17
25.40
23.62
21.93
8.50
42.72
68.27
17.51
58.49
28.97
42.93
39.08
198.87
50.28
What is the difference in income between men and women and before
and after the birth of a child?
How does income change in the time leading up to the birth of a
child ? survival analysis later in this course!
OLS: pooled
3000
4000
OLS: cross-section
1000
2000
Income
x1
0
5
10
15
20
25
5
10
15
20
25
30
10
15
20
25
30
35
4000
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
3000
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
2000
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
1000
Income
10
20
30
40
Number of years since leaving school
pid=1
pid=2
pid=3
10
20
30
40
Number of years since leaving school
pid=1
pid=2
pid=3
4000
4000
OLS: pooled
w1
3000
2000
Income
2000
Income
3000
u1 ?
1000
1000
w3
10
20
30
40
Number of years since leaving school
pid=1
pid=3
pid=2
10
20
30
40
Number of years since leaving school
pid=1
pid=3
pid=2
y i x i1 1 x i 2 2 x i 3 3 ......... x iK K u i i
and consider that you have repeated observations over time
Individual-specific, fixed over time
y it x it u i it
.. and then reduce the complexity of the information available in some way, or
add further assumptions. Your options:
y it x it u i it
Not interested in within variation? Use the means of all observations for all persons i
y i xi ui i
( y it y i ) ( x it x i ) ( it i )
Interested in both? Well, lets treat xi_bar as imperfect to measure person fixed effect
and use between variation where within variation is poorly captured
( y it y i ) (1 ) ( x it x i ) {( 1 ) u i ( it i )}
Between estimator
y it x it u i it
y i xi ui i
It doesnt use as much information as is available in the data (only uses means)
Except to calculate the parameter for random effects, but Stata does this, not you!
Easy to see why: if they were correlated, how could one decide how much of the
variation in y to attribute to the xs (via the betas) as opposed to the correlation?
Identical to:
Least Squares Dummy Variables regression areg, y x, absorb(pid)
Include a dummy indicator for each individual; all individual level differences,
including the idiosyncratic error term, will then be captured in the person-specific
intercept.
Members of the same family, which you may come across in the literature:
First Differences regress D.(y x)
For each individual, and each time periods y and x, calculate the difference between the value in
this period and that in the last period. Then run OLS on a transformed dataset where each yit is
replaced by (yit yit-1) and each xit is replaced by (xit xit-1)
run standard OLS but add x i of each time-varying variable as additional regressors
y it x it u i it
-1000
-500
Income
500
( y it y i ) ( x it x i ) ( it i )
pid wave y x1
x i ( y yi) (x xi )
yi
1
1 2340 0 3076.7 12.5 -736.7
-12.5
1
2 2405 5 3076.7 12.5 -671.7
-7.5
1
3 2730 10 3076.7 12.5 -346.7
-2.5
1
4 3250 15 3076.7 12.5 173.3
2.5
1
5 3705 20 3076.7 12.5 628.3
7.5
1
6 4030 25 3076.7 12.5 953.3
12.5
2
1 1885 5 2442.8 17.5 -557.8
-12.5
2
2 2145 10 2442.8 17.5 -297.8
-7.5
2
3 2275 15 2442.8 17.5 -167.8
-2.5
2
4 2470 20 2442.8 17.5 27.2
2.5
2
5 2762 25 2442.8 17.5 319.2
7.5
2
6 3120 30 2442.8 17.5 677.2
12.5
3
1
780 10 1765.8 22.5 -985.8
-12.5
3
2 1170 15 1765.8 22.5 -595.8
-7.5
3
3 1365 20 1765.8 22.5 -400.8
-2.5
3
4 2405 25 1765.8 22.5 639.2
2.5
3
5 2405 30 1765.8 22.5 639.2
7.5
3
6 2470 35 1765.8 22.5 704.2
12.5
Fixed Effects
-10
0
Number of years since leaving school
pid=1
pid=3
Fixed effects:
10
pid=2
y=65*x1
y it x it u i it
y it 1 x it 2 x i 3 z i u i
residual
Hint: create
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
xi
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
yourself
x
1
2
2
2
1
1
0
1
1
1
1
0
1
1
0
0
0
0
z
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
x_bar
1.5
1.5
1.5
1.5
1.5
1.5
0.66
0.66
0.66
0.66
0.66
0.66
0.33
0.33
0.33
0.33
0.33
0.33
it
it
( y it y i ) (1 ) ( x it x i ) {( 1 ) u i ( it i )}
Uses both within- and between-group variation, so makes best use of the
data and is efficient. Starts off with the idea that using xi_bar is not the best
we can do to capture within variation.
the more imprecise the estimate of the person-level variation (as measured by the
person xi_bar) the more we should draw on the information from other units (x_bar)
E.g., when you include a location indicator in your model, you are saying that the
effect on y of moving to a new town is the same as the effect on y of living in
different towns. When you include a female dummy, you are saying that the effect
of being female on y is the same as the effect on y of changing gender.
R-square-like
R-sq:
statistic
within
= 0.0501
between = 0.1906
overall = 0.1285
corr(u_i, Xb)
Peaks at age 48
Number of obs
Number of groups
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
(dropped)
1.951485
-.298668
.1141748
-.0011833
1.230831
6.252975
sigma_u
sigma_e
rho
3.9934565
4.0525618
.49265449
24204
3317
1
7.3
14
F(5,20882)
Prob > F
= 0.1561
LIKERT
=
=
Std. Err.
.1394164
.118635
.0214403
.0002209
.0428556
.4932977
14.00
-2.52
5.33
-5.36
28.72
12.68
P>|t|
0.000
0.012
0.000
0.000
0.000
0.000
=
=
1.678218
-.5312018
.0721501
-.0016163
1.14683
5.286073
4.56
220.44
0.0000
2.224752
-.0661342
.1561994
-.0007503
1.314831
7.219877
Between regression:
Not much used, but useful to compare coefficients with fixed effects
Number of obs
Number of groups
=
=
24204
3317
R-sq:
1
7.3
14
within
= 0.0480
between = 0.2322
overall = 0.1482
sd(u_i + avg(e_i.))=
F(6,3310)
Prob > F
3.833357
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.476659
2.038192
-.0101941
.0827335
-.0009489
2.275832
3.953941
Std. Err.
.1350226
.312191
.1777423
.0219026
.0002263
.0926521
.4430909
t
10.94
6.53
-0.06
3.78
-4.19
24.56
8.92
P>|t|
0.000
0.000
0.954
0.000
0.000
0.000
0.000
=
=
166.80
0.0000
1.741395
2.650299
.3383019
.1256775
-.0005052
2.457493
4.822701
Coefficient on
partner was
negative and
significant in FE
model.
In FE, the partner
coeff really measures
the events of gaining
or losing a partner
Number of obs
Number of groups
=
=
24204
3317
R-sq:
1
7.3
14
within
= 0.0500
between = 0.2239
overall = 0.1471
min
0.1986
5%
0.1986
theta
median
0.5482
95%
0.6629
Std. Err.
Wald chi2(6)
Prob > chi2
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.493431
2.045302
-.1947691
.1058038
-.0011062
1.433115
5.181864
.1259931
.1271039
.0973734
.014544
.0001498
.0385506
.3137662
sigma_u
sigma_e
rho
3.0248563
4.0525618
.3577895
11.85
16.09
-2.00
7.27
-7.39
37.17
16.52
2013.32
0.0000
max
0.6629
=
=
P>|z|
0.000
0.000
0.045
0.000
0.000
0.000
0.000
1.740373
2.294422
-.0039207
.1343094
-.0008126
1.508673
5.796835
Source
SS
df
MS
Model
Residual
103583.505
6
591239.694 24197
17263.9175
24.4344214
Total
694823.199 24203
28.7081436
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.409466
2.031815
-.0751296
.0983746
-.0010613
1.841796
4.450393
Std. Err.
.0640651
.1240757
.0769271
.0103316
.0001049
.0357165
.2212733
t
22.00
16.38
-0.98
9.52
-10.12
51.57
20.11
Number of obs
F(
6, 24197)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.329
0.000
0.000
0.000
0.000
=
=
=
=
=
=
24204
706.54
0.0000
0.1491
0.1489
4.9431
1.535038
2.275011
.0756524
.1186252
-.0008557
1.911802
4.884102
If the ui do not vary between individuals, they can be treated as part of and OLS
is fine.
Breusch-Pagan Lagrange multiplier test
H0 Variance of ui = 0
H1 Variance of ui not equal to zero
If H0 is not rejected, you can pool the data and use OLS
Post-estimation test after random effects
xttest0
28.70814
16.42326
9.149756
sd = sqrt(Var)
5.357998
4.052562
3.024856
Var(u) = 0
chi2(1) = 10816.48
Prob > chi2 =
0.0000
Comparing models
RE
fe m ale
u e _sick
p artn e r
1.95 ***
-0.30 **
BE
O LS
1.49 ***
1.48 ***
1.41 ***
2.05 ***
2.04 ***
2.03 ***
-0.19 **
-0.01
-0.08
age
0.11 ***
0.11 ***
0.08 ***
0.10 ***
age 2
0.00 ***
0.00 ***
0.00 ***
0.00 ***
b ad h e alth
1.23 ***
1.43 ***
2.28 ***
1.84 ***
_co n s
6.25 ***
5.18 ***
3.96 ***
4.45 ***
R-2 w ith in
0.050
0.050
0.048
R-2 b e tw e e n
0.191
0.224
0.232
R-2 o v e rall
0.129
0.147
0.148
0.149