Panel Data Session2

Panel data methods in Stata
Session 2
By Ziyodullo Parpiev, PhD
Overview
Panel data
How to get to know the data
Change over time

Tabulating
Calculating transition probabilities
Using panel data in Stata
Data on n cases, over t time periods, giving a total of

n t observations
One record per observation

i.e. long format
Stata tools for analyzing panel data begin with the

prefix xt
First need to tell Stata that you have panel data using
xtset
Complete and incomplete person-wave data

+------------------------------------------------------------------+
|
pid
wave
sex
age
mastat
jbstat
fihhmn |
|------------------------------------------------------------------|
| 10019057
1
female
59
never ma
retired
780 |
| 10019057
2
female
60
never ma
retired
759.14 |
| 10019057
3
female
61
never ma
retired
923.5 |
| 10019057
4
female
62
never ma
retired
62.5 |
| 10019057
5
female
63
never ma
retired
663 |
| 10019057
6
female
64
never ma
retired
missing o |
| 10019057
7
female
65
never ma
retired
1254.963 |
| 10019057
8
female
66
never ma
retired
1270.432 |
| 10019057
9
female
67
never ma
retired
1364.555 |
| 10019057
10
female
67
never ma
retired
1479.74 |
| 10019057
11
female
68
never ma
retired
1328.25 |
| 10019057
12
female
69
never ma
retired
1371.49 |
| 10019057
13
female
71
never ma
retired
missing o |
| 10019057
14
female
71
never ma
retired
1372.333 |
| 10019057
15
female
73
never ma
retired
1475.812 |
|------------------------------------------------------------------|
| 10028005
1
male
30
never ma
employed
1501.155 |
| 10028005
2
male
31
never ma
employed
1636.259 |
| 10028005
3
male
32
never ma
employed
1943.283 |
| 10028005
6
male
35
never ma
employed
2001.54 |
| 10028005
7
male
36
never ma
employed
1634.33 |
| 10028005
9
male
38
never ma
employed
1587.945 |
+------------------------------------------------------------------+
Telling Stata you have time series data

Unique cross-wave identifier
Time variable
. xtset pid wave

panel variable:
time variable:
delta:
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Cases not observed for

every time period
. xtset pid wave

panel variable:
time variable:
delta:
pid (unbalanced)
1 unit
Period between observations in units of the time variable
Describing the patterns in panel data

. xtdes,patterns(20)
Freq. Percent
Cum. | Pattern
---------------------------+----------------1294
28.12
28.12 | 111111111111111
248
5.39
33.51 | 1..............
157
3.41
36.93 | 11.............
115
2.50
39.43 | ..............1
105
2.28
41.71 | 111............
104
2.26
43.97 | 1111...........
73
1.59
45.56 | 11111..........
69
1.50
47.05 | ............111
66
1.43
48.49 | ..........11111
62
1.35
49.84 | .............11
60
1.30
51.14 | .1.............
60
1.30
52.45 | 11111111111....
58
1.26
53.71 | 11111111.......
58
1.26
54.97 | 111111111......
57
1.24
56.21 | 11111111111111.
55
1.20
57.40 | .....1.........
54
1.17
58.57 | ........1111111
54
1.17
59.75 | .11111111111111
54
1.17
60.92 | 1111111111.....
53
1.15
62.07 | .........111111
1745
37.93 100.00 | (other patterns)
---------------------------+----------------4601
100.00
| XXXXXXXXXXXXXXX
Examining change over two waves

1991
|
1992 Employment status
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
961
35
76 |
1,072
2 |
36
38
24 |
98
3 |
40
23
524 |
587
-----------+---------------------------------+---------Total |
1,037
96
624 |
1,757
2001
|
Employment |
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
991
15
46 |
1,052
2 |
20
12
9 |
41
3 |
56
20
495 |
571
-----------+---------------------------------+---------Total |
1,067
47
550 |
1,664
Calculating transition probabilities

The transition probability is the probability of
transitioning from one state to another
p ij Pr{ X t j | X t 1 i )
So to calculate by hand,
n
p ij N ij / N ij
j 1
Cell count
Row total
Transition probability matrix

1991
|
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
0.90
0.03
0.07|
1.00
2 |
0.37
0.39
0.24|
1.00
3 |
0.07
0.04
0.89|
1.00
-----------+---------------------------------+----------
2001
|
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+--------1 |
0.94
0.01
0.04|
1.00
2 |
0.49
0.29
0.22|
1.00
3 |
0.10
0.04
0.87|
1.00
-----------+---------------------------------+---------
Transition probability matrices in Stata

Mean transition probabilities for all waves t
to t+1 when you leave out the if statement
. xttrans jbstat if wave<3,freq
current |
economic |
current economic activity
activity |
1
2
3 |
Total
-----------+---------------------------------+---------1 |
961
35
76 |
1,072
|
89.65
3.26
7.09 |
100.00
-----------+---------------------------------+---------2 |
36
38
24 |
98
|
36.73
38.78
24.49 |
100.00
-----------+---------------------------------+---------3 |
40
23
524 |
587
|
6.81
3.92
89.27 |
100.00
-----------+---------------------------------+---------Total |
1,037
96
624 |
1,757
|
59.02
5.46
35.52 |
100.00
Change in a categorical variable over time

A decision tree
0.91
empl
empl
0.03
unemp
0.06
0.90
olf
unemp
0.03
0.26
0.49
empl
empl
unemp
0.25
olf
0.07
olf
0.10
0.03
empl
unemp
0.87
olf
Change in a continuous variable over time
Size transition matrix
Quantile transition matrix
Mean transition matrix
Median transition matrix
Size transition matrix
Absolute mobility
Boundaries set exogenously i.e. predetermined
e.g. movement in and out of poverty
e.g. poverty defined a priori as an income below 5,000
Do not depend on distribution under investigation
e.g. comparing mobility in 1990s and 2000s
incorporates both movements of positions of individuals and

economic growth
Quantile transition matrix
Mobility as a relative concept
Same number of individuals in each class
Only records movements involving re-ranking
Cannot take account of economic growth, for

example when comparing matrices
Cannot draw a complete picture if comparing

mobility in different cohorts/countries/welfare
regimes
Mean/median transition matrices
Both absolute and relative approaches incorporated

into matrices
Class boundaries defined as percentages of mean or
median income of the origin and destination
distributions
Example:
25%, 50%, 75% of median income

Note that this is not the same as quartiles
Example: income 1991-1992

wave = 1
household income: month before interview
------------------------------------------------------------Percentiles
Smallest
1%
181.86
0
5%
349.82
0
10%
458.98
0
Obs
2795
25%
826.6895
0
Sum of Wgt.
2795
50%
1511.067
75%
90%
95%
99%
2365.493
3329.769
4062.217
6748.689
Largest
9230.818
9230.818
9230.818
9230.818
Mean
Std. Dev.
1773.253
1299.089
Variance
Skewness
Kurtosis
1687633
1.836874
8.622895
wave = 2
household income: month before interview
------------------------------------------------------------Percentiles
Smallest
1%
207.9433
0
5%
338.7431
0
10%
460.68
0
Obs
2639
25%
861.67
5
Sum of Wgt.
2639
50%
75%
90%
95%
99%
1508
2449.813
3414.511
4103.649
5824.449
Largest
8405.636
8405.636
10491.08
10491.08
Mean
Std. Dev.
1795.179
1229.827
Variance
Skewness
Kurtosis
1512476
1.352148
6.370836
Category boundaries for each method

Matrix
Year
Boundary 1
(n)
Boundary 2
(n)
Boundary 3
(n)
Boundary 4
(n)
Size
1991
0 - 800
(580)
800 - 1500
(650)
1500 - 2200
(504)
2200 - 9231
(715)
1992
0 - 800
(580)
800 - 1500
(645)
1500 - 2200
(473)
2200 - 10491
(751)
1991
0 827
(609)
827 -1511
(615)
1511 2365
(611)
2365 9231
(614)
1992
0 862
(610)
862 1508
(612)
1508 2450
(612)
2450 10491
(615)
1991
0 887
(654)
887 -1773
(814)
1773 2660
(506)
2660 9231
(475)
1992
0 898
(652)
898 -1795
(766)
1795 2693
(501)
2693 10491
(530)
1991
0 750
(539)
750 -1500
(685)
1500 2250
(540)
2250 9231
(685)
1992
0 746
(536)
746 -1491
(686)
1491 -2237
(505)
2262 10491
(722)
Quartile
Mean
Median
Warning!
Measurement error
Causes an over-estimation of mobility
If mothers and babys weight are reported to nearest half

pound can affect which band the observations falls in
A respondent may describe their marital status as separated in

year 1 and single in year 2
Overview
Types of questions, types of variables: time-invariant, time-varying and

trend
Between- and within-individual variation
Concept of individual heterogeneity
From OLS to models that allow causal interpretations: fixed effects and
random effects models
The basics of these models implementation in Stata
Types of variable
Those which vary between individuals but hardly ever over time
Those which vary over time, but not between individuals
The retail price index

National unemployment rates
Age, in a cohort study
Those which vary both over time and between individuals
Sex
Ethnicity
Parents social class when you were 14
The type of primary school you attended (once youve become an adult)
Income
Health
Psychological wellbeing
Number of children you have
Marital status
Trend variables
Vary between individuals and over time, but in highly predictable ways:
Age
Year
Between- and within-individual variation
If you have a sample with repeated observations on the same individuals, there are two
sources of variance within the sample:
The fact that individuals are systematically different from one another (between-individual variation)
The fact that individuals behaviour varies between observations over time (within-individual variation)
k

i 1
j 1

i 1
( x ij x )
( x ij x i )
Within variation is the sum of the squares of each

individuals observation from his or her mean
j 1

i 1
Total variation is the sum over all individuals and years,

of the square of the difference between each
observation of x and the mean
( xi x)
Between variation is the sum of squares of differences

between individual means and the whole-sample mean
j 1
x11 x12 ... x1 m
x 21 x 22 ... x 2 m
.......... ........
.......... ........
x
x
... x km
k1 k 2
Remember:
From the variation, you get to the variance, you get to
the Standard Deviation:
SD
T/(N - 1)
xtsum in STATA
Similar to ordinary sum command
xtset pid wave

panel variable:
time variable:
delta:
pid (unbalanced)
1 unit
Have chosen a balanced sample
xtsum female partner age ue_sick LIKERT wave if nwaves == 15
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
wave
overall
between
within
4.320605
0
4.320605
1
8
1
15
8
15
N =
n =
T =
N =
n =
T =
19410
1294
15
19410
1294
15
All variation is
between
Most variation
is between,
because its
fairly rare to
switch between
having and not
having a
partner
All variation is within,

because this is a
balanced sample
More on xtsum.
.
xtset pid wave

panel variable:
time variable:
delta:
pid (unbalanced)
1 unit
xtsum female partner age ue_sick LIKERT wave if nwaves == 15
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
overall
between
within
4.320605
0
4.320605
1
8
1
15
8
15
wave
N =
n =
T =
N =
n =
T =
19410
1294
15
Observations with
non-missing
variable
Number of
individuals
Average number
of time-points
Min & max refer to xi-bar
19410
1294
15
Min & max refer to individual deviation from own averages, with global averages added back in.
The xttab command

For simplicity, omitted jbstats of missing, maternity
leave, gov training and other.
.
xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8
jbstat
Overall
Freq.
Percent
self-emp
employed
unemploy
retired
family c
ft studt
lt sick,
1388
8982
539
2687
1159
718
558
8.66
56.03
3.36
16.76
7.23
4.48
3.48
Total
16031
100.00
Pooled sample, broken

down by person/years
Between
Freq.
Percent
228
974
274
314
292
271
105
2458
(n = 1236)
Within
Percent
18.45
78.80
22.17
25.40
23.62
21.93
8.50
42.72
68.27
17.51
58.49
28.97
42.93
39.08
198.87
50.28
Number of people who

spent any time in this state
Of those who spent any

time in this state, the
proportion of their time
(on average) they spent in
it.
Which statistical model for panel data?

Your research question will guide which models are most suitable
but the nature of your data is also important:
Is your research question cross-sectional or longitudinal, or both?
Cross-sectional: exploit variation between individuals

Longitudinal: exploit variation within individuals over time and permit
causal interpretation of effects
and can consider between variation if needed
What is the effect on income of having more children?
What is the difference in income between individuals who have a different

number of children?
What is the difference in income before and after the birth of a child?
What is the difference in income between men and women and before
and after the birth of a child?
How does income change in the time leading up to the birth of a
child ? survival analysis later in this course!
Longitudinal analysis is concerned with

modelling individual heterogeneity
A very simple concept: people are different!
In social science, when we talk about heterogeneity, we are really
talking about unobservable (or unobserved) heterogeneity:
Observed heterogeneity: differences in education levels, or

parental background, or anything else that we can measure and
control for in regressions
Unobserved heterogeneity: anything which is fundamentally
unmeasurable, or which is rather poorly measured, or which does
not happen to be measured in the particular data set we are
using.
With panel data we can do something about unobserved heterogeneity

as we can differentiate between person-level unobserved x that are
identical over time and those that vary over time!
OLS with panel data
OLS: pooled
3000
4000
OLS: cross-section
1000
2000
Income
x1
0
5
10
15
20
25
5
10
15
20
25
30
10
15
20
25
30
35
4000
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
3000
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
2000
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
1000
Cross-sectional effect captures may be quite misleading (omitted variable bias)!

By adding more data points from the same units at different points in time we can get
better estimates. But assumptions of OLS may be violated!
Income
10
20
30
40
Number of years since leaving school
pid=1
pid=2
pid=3
OLSt=1: y=2448 -156*x1
10
20
30
40
pid=1
pid=2
pid=3
OLSpooled: y=1925 + 29*x1
An illustration of how unobserved

heterogeneity matters
Considering this is from panel data, two problems become apparent:
Error terms for persons 1, 2 and 3 differ systematically

The association between x and y appears to be biased
OLS: unobs het
4000
4000
OLS: pooled
w1
3000
2000
Income
2000
Income
3000
u1 ?
1000
1000
w3
10
20
30
40
pid=1
pid=3
pid=2
10
20
30
40
pid=1
pid=3
pid=2
Panel data allows you to:

Break down the error term
(wi) in two components: the
unobservable characteristics
of the person (ui), and
genuine error (ei).
then model ui and ei
Expanding the OLS model to consider

unobserved heterogeneity
Analytically, think of splitting the error term into its two components ui and
y i x i1 1 x i 2 2 x i 3 3 ......... x iK K u i i
and consider that you have repeated observations over time
Individual-specific, fixed over time
y it x it u i it
Varies over time, usual assumptions apply (mean

zero, homoscedastic, uncorrelated with x or u or
itself)
.. and then reduce the complexity of the information available in some way, or
add further assumptions. Your options:
Focus on between variation: loose info on within variation

Focus on within variation: loose info on between variation
Model both types of variation making further assumptions
Within and between estimators

Individual-specific, fixed over time
y it x it u i it
Varies over time, usual assumptions apply

(mean zero, homoscedastic, uncorrelated with
x or u or itself)
Not interested in within variation? Use the means of all observations for all persons i
y i xi ui i
This is the between estimator
Not interested in between variation? Why not remove it in that case!
( y it y i ) ( x it x i ) ( it i )
And this is the within estimator fixed effects
Interested in both? Well, lets treat xi_bar as imperfect to measure person fixed effect
and use between variation where within variation is poorly captured
( y it y i ) (1 ) ( x it x i ) {( 1 ) u i ( it i )}
measures the weight given to

between-group variation, and is
derived from the variances of ui
and i
Between estimator
y it x it u i it
y i xi ui i
Interpret as how much does y change between different people

Not much used
Its inefficient compared to random effects
It doesnt use as much information as is available in the data (only uses means)
Assumption required: that ui is uncorrelated with xi
Except to calculate the parameter for random effects, but Stata does this, not you!
Easy to see why: if they were correlated, how could one decide how much of the
variation in y to attribute to the xs (via the betas) as opposed to the correlation?
Cant estimate effects of variables where mean is invariant over individuals
Age in a cohort study

Macro-level variables
Focusing on within variation the fixed

effects family
Fixed effects estimator

Basic idea: For each individual, calculate the mean of x and the
mean of y. Then run OLS on a transformed dataset where each yit
is replaced by ( x it x i ) and each xit is replaced by ( y it y i )
xtreg y x, fe
Identical to:
Least Squares Dummy Variables regression areg, y x, absorb(pid)
Include a dummy indicator for each individual; all individual level differences,
including the idiosyncratic error term, will then be captured in the person-specific
intercept.
Members of the same family, which you may come across in the literature:
First Differences regress D.(y x)
For each individual, and each time periods y and x, calculate the difference between the value in
this period and that in the last period. Then run OLS on a transformed dataset where each yit is
replaced by (yit yit-1) and each xit is replaced by (xit xit-1)
Hybrid models regress y x mean_x z
run standard OLS but add x i of each time-varying variable as additional regressors
Fixed effects estimator

1000
y it x it u i it
-1000
-500
Income
500
( y it y i ) ( x it x i ) ( it i )
pid wave y x1
x i ( y yi) (x xi )
yi
1
1 2340 0 3076.7 12.5 -736.7
-12.5
1
2 2405 5 3076.7 12.5 -671.7
-7.5
1
3 2730 10 3076.7 12.5 -346.7
-2.5
1
4 3250 15 3076.7 12.5 173.3
2.5
1
5 3705 20 3076.7 12.5 628.3
7.5
1
6 4030 25 3076.7 12.5 953.3
12.5
2
1 1885 5 2442.8 17.5 -557.8
-12.5
2
2 2145 10 2442.8 17.5 -297.8
-7.5
2
3 2275 15 2442.8 17.5 -167.8
-2.5
2
4 2470 20 2442.8 17.5 27.2
2.5
2
5 2762 25 2442.8 17.5 319.2
7.5
2
6 3120 30 2442.8 17.5 677.2
12.5
3
1
780 10 1765.8 22.5 -985.8
-12.5
3
2 1170 15 1765.8 22.5 -595.8
-7.5
3
3 1365 20 1765.8 22.5 -400.8
-2.5
3
4 2405 25 1765.8 22.5 639.2
2.5
3
5 2405 30 1765.8 22.5 639.2
7.5
3
6 2470 35 1765.8 22.5 704.2
12.5
Fixed Effects
-10
0
pid=1
pid=3
Fixed effects:
10
pid=2
y=65*x1
Ignores between-group variation so its an

inefficient estimator
However, few assumptions are required for FE to
be consistent: ui is allowed to correlate with xi
Disadvantage: cant estimate the effects of any
time-invariant variables
Need to consider change in interpretation of effects
Want to look at the effect of non-time

varying x? Use x and x in OLS
i
y it x it u i it
y it 1 x it 2 x i 3 z i u i
residual
Hint: create
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
xi
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
yourself
x
1
2
2
2
1
1
0
1
1
1
1
0
1
1
0
0
0
0
z
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
x_bar
1.5
1.5
1.5
1.5
1.5
1.5
0.66
0.66
0.66
0.66
0.66
0.66
0.33
0.33
0.33
0.33
0.33
0.33
it
it
zi: non-time varying individual characteristics for

which you do not need to include group means
the effect of any unobserved characteristic

otherwise transported in the effect x it is shifted
to the effect of x i : 1 approximates the
coefficient in the FE model, 3 gives you,
approximately, the OLS estimate for non-timevarying variables z i
Typically no interest in the effect of x i so no
need to worry about its interpretation. Note
that 1 3 is approximately equal to the effect
in the pooled OLS
Disadvantage: can only control for
unobserved heterogeneity associated with
observed time-varying variables xi; u iresidual
Random effects estimator

y it x it u i it
Random Effects Model here RE Generalised Least

Squares
( y it y i ) (1 ) ( x it x i ) {( 1 ) u i ( it i )}
Uses both within- and between-group variation, so makes best use of the
data and is efficient. Starts off with the idea that using xi_bar is not the best
we can do to capture within variation.
the more imprecise the estimate of the person-level variation (as measured by the
person xi_bar) the more we should draw on the information from other units (x_bar)
Assumption required: that ui is uncorrelated with xi

Rather heroic assumption think of examples
Will see a test for this later
Note that the within and between effect is constrained to be identical (much
more like OLS in this respect so no causal interpretation!).
E.g., when you include a location indicator in your model, you are saying that the
effect on y of moving to a new town is the same as the effect on y of living in
different towns. When you include a female dummy, you are saying that the effect
of being female on y is the same as the effect on y of changing gender.
Estimating fixed effects in STATA

.
xtreg LIKERT female ue_sick partner age age2 badh, fe
Fixed-effects (within) regression

Group variable: pid
R-square-like
R-sq:
statistic
within
= 0.0501
between = 0.1906
overall = 0.1285
corr(u_i, Xb)
Peaks at age 48
Number of obs
Number of groups
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
(dropped)
1.951485
-.298668
.1141748
-.0011833
1.230831
6.252975
sigma_u
sigma_e
rho
3.9934565
4.0525618
.49265449
F test that all u_i=0:
24204
3317
Obs per group: min =

avg =
max =
1
7.3
14
F(5,20882)
Prob > F
= 0.1561
LIKERT
=
=
Std. Err.
.1394164
.118635
.0214403
.0002209
.0428556
.4932977
14.00
-2.52
5.33
-5.36
28.72
12.68
P>|t|
0.000
0.012
0.000
0.000
0.000
0.000
=
=
[95% Conf. Interval]
1.678218
-.5312018
.0721501
-.0016163
1.14683
5.286073
(fraction of variance due to u_i)

F(3316, 20882) =
4.56
Talk about xtmixed
220.44
0.0000
2.224752
-.0661342
.1561994
-.0007503
1.314831
7.219877
u and e are the two parts

of the error term
Prob > F = 0.0000
Between regression:
Not much used, but useful to compare coefficients with fixed effects
xtreg LIKERT female ue_sick partner age age2 badh, be
Between regression (regression on group means)

Group variable: pid
Number of obs
Number of groups
=
=
24204
3317
R-sq:

avg =
max =
1
7.3
14
within
= 0.0480
between = 0.2322
overall = 0.1482
sd(u_i + avg(e_i.))=
F(6,3310)
Prob > F
3.833357
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.476659
2.038192
-.0101941
.0827335
-.0009489
2.275832
3.953941
Std. Err.
.1350226
.312191
.1777423
.0219026
.0002263
.0926521
.4430909
t
10.94
6.53
-0.06
3.78
-4.19
24.56
8.92
P>|t|
0.000
0.000
0.954
0.000
0.000
0.000
0.000
=
=
166.80
0.0000

1.211923
1.426085
-.35869
.0397895
-.0013927
2.094171
3.085181
1.741395
2.650299
.3383019
.1256775
-.0005052
2.457493
4.822701
Coefficient on
partner was
negative and
significant in FE
model.
In FE, the partner
coeff really measures
the events of gaining
or losing a partner
Random effects regression

.
xtreg LIKERT female ue_sick partner age age2 badh, re theta
Random-effects GLS regression

Group variable: pid
Number of obs
Number of groups
=
=
24204
3317
R-sq:

avg =
max =
1
7.3
14
within
= 0.0500
between = 0.2239
overall = 0.1471
Random effects u_i ~ Gaussian

corr(u_i, X)
= 0 (assumed)
min
0.1986
5%
0.1986
theta
median
0.5482
95%
0.6629
Std. Err.
Wald chi2(6)
Prob > chi2
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.493431
2.045302
-.1947691
.1058038
-.0011062
1.433115
5.181864
.1259931
.1271039
.0973734
.014544
.0001498
.0385506
.3137662
sigma_u
sigma_e
rho
3.0248563
4.0525618
.3577895
(fraction of variance due to u_i)
11.85
16.09
-2.00
7.27
-7.39
37.17
16.52
2013.32
0.0000
Option theta gives a summary

of weights
max
0.6629
=
=
P>|z|
0.000
0.000
0.045
0.000
0.000
0.000
0.000

1.246489
1.796183
-.3856175
.0772981
-.0013998
1.357558
4.566894
1.740373
2.294422
-.0039207
.1343094
-.0008126
1.508673
5.796835
Tells you how good an approximation xi_bar is of the person-level effect; or

how much of the within variation we used to determine the effect size
zero= OLS 1=FE estimators
And what about OLS?
OLS simply treats within- and between-group variation as the same

Pools data across waves
reg LIKERT female ue_sick partner age age2 badh
Source
SS
df
MS
Model
Residual
103583.505
6
591239.694 24197
17263.9175
24.4344214
Total
694823.199 24203
28.7081436
LIKERT
Coef.
female
ue_sick
partner
age
age2
badhealth
_cons
1.409466
2.031815
-.0751296
.0983746
-.0010613
1.841796
4.450393
Std. Err.
.0640651
.1240757
.0769271
.0103316
.0001049
.0357165
.2212733
t
22.00
16.38
-0.98
9.52
-10.12
51.57
20.11
Number of obs
F(
6, 24197)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
0.329
0.000
0.000
0.000
0.000
=
=
=
=
=
=
24204
706.54
0.0000
0.1491
0.1489
4.9431

1.283895
1.788619
-.2259116
.078124
-.001267
1.771789
4.016684
1.535038
2.275011
.0756524
.1186252
-.0008557
1.911802
4.884102
Test whether pooling data is valid

y it x it u i it
If the ui do not vary between individuals, they can be treated as part of and OLS
is fine.
Breusch-Pagan Lagrange multiplier test
H0 Variance of ui = 0
H1 Variance of ui not equal to zero
If H0 is not rejected, you can pool the data and use OLS
Post-estimation test after random effects
quietly xtreg LIKERT female ue_sick partner age age2 badh, re
xttest0
Breusch and Pagan Lagrangian multiplier test for random effects

LIKERT[pid,t] = Xb + u[pid] + e[pid,t]
Estimated results:
Var
LIKERT
e
u
Test:
28.70814
16.42326
9.149756
sd = sqrt(Var)
5.357998
4.052562
3.024856
Var(u) = 0
chi2(1) = 10816.48
Prob > chi2 =
0.0000
Comparing models
Compare coefficients between models

Reasonably similar differences in partner and badhealth coeffs
R-squareds are similar
Within and between estimators maximise within and between r-2 respectively.
FE
RE
fe m ale
u e _sick
p artn e r
1.95 ***
-0.30 **
BE
O LS
1.49 ***
1.48 ***
1.41 ***
2.05 ***
2.04 ***
2.03 ***
-0.19 **
-0.01
-0.08
age
0.11 ***
0.11 ***
0.08 ***
0.10 ***
age 2
0.00 ***
0.00 ***
0.00 ***
0.00 ***
b ad h e alth
1.23 ***
1.43 ***
2.28 ***
1.84 ***
_co n s
6.25 ***
5.18 ***
3.96 ***
4.45 ***
R-2 w ith in
0.050
0.050
0.048
R-2 b e tw e e n
0.191
0.224
0.232
R-2 o v e rall
0.129
0.147
0.148
0.149

Panel Data Session2

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Panel Data Session2

Caricato da

Copyright:

Formati disponibili

Panel data methods in Stata

Change over time

Using panel data in Stata

Data on n cases, over t time periods, giving a total of

One record per observation

Stata tools for analyzing panel data begin with the

Complete and incomplete person-wave data

Telling Stata you have time series data

. xtset pid wave

Cases not observed for

. xtset pid wave

Period between observations in units of the time variable

Describing the patterns in panel data

Examining change over two waves

Calculating transition probabilities

Transition probability matrix

Transition probability matrices in Stata

Change in a categorical variable over time

Change in a continuous variable over time

Size transition matrix

Quantile transition matrix

Mean transition matrix

Median transition matrix

Size transition matrix

Boundaries set exogenously i.e. predetermined

e.g. movement in and out of poverty

e.g. poverty defined a priori as an income below 5,000

Do not depend on distribution under investigation

e.g. comparing mobility in 1990s and 2000s

incorporates both movements of positions of individuals and

Quantile transition matrix

Mobility as a relative concept

Same number of individuals in each class

Only records movements involving re-ranking

Cannot take account of economic growth, for

Cannot draw a complete picture if comparing

Mean/median transition matrices

Both absolute and relative approaches incorporated

25%, 50%, 75% of median income

Example: income 1991-1992

Category boundaries for each method

Causes an over-estimation of mobility

If mothers and babys weight are reported to nearest half

A respondent may describe their marital status as separated in

Types of questions, types of variables: time-invariant, time-varying and

Those which vary over time, but not between individuals

The retail price index

Those which vary both over time and between individuals

Between- and within-individual variation

Within variation is the sum of the squares of each

Total variation is the sum over all individuals and years,

Between variation is the sum of squares of differences

x11 x12 ... x1 m

Similar to ordinary sum command

xtset pid wave

Have chosen a balanced sample

xtsum female partner age ue_sick LIKERT wave if nwaves == 15

All variation is within,

xtset pid wave

xtsum female partner age ue_sick LIKERT wave if nwaves == 15

Min & max refer to xi-bar

The xttab command

Pooled sample, broken

Number of people who