Panel Data PDF

Starting R: An Example of Panel Data
Danny Kaplan
March 11, 2009
Longitudinal versus Cross-Sectional
CAUTION: Most of this example is about data re-organization.

There is some grungy programming. The intention is just to show
you some capabilities and give you some examples for your own
reference.
The statistical analysis is mostly in one slide at the end.
I
Cross-sectional data is a snap shot of a population at one

time.
Longitudinal data repeats measurements over time for each

individual.
Other related names: repeated measures, panel data.
An Example: How Runners Age
The data set ten-mile-race.csv contains times from the Cherry

Blossom Ten Miler run in 2005 in Washington, DC. The variables
are:
I
net the time from the start line to the finish line: seconds
gun the time from the start gun to the finish line: seconds
sex
age the age of the runner
state where the runner comes from
Cross-sectional Analysis
How does the net time for runners depend on their age?
> run2005 = read.csv("/Users/kaplan/kaplanfiles/stats-book/

> m1 = lm(net ~ age, data = run2005)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5297.2192
37.6056 140.86
0.0000 .
age
8.1899
0.9806
8.35
0.0000
Sex is an obvious covariate.
> m2 = lm(net ~ age + sex, data = run2005)
(Intercept)
age
sexM
Estimate
5339.1554
16.8936
-726.6195
Std. Error
35.0487
0.9444
20.0181
t value
152.34
17.89
-36.30
Pr(>|t|)
0.0000
0.0000
0.0000
Cross-sectional Critique
I
The previous analysis didnt actually involve any individual

persons ageing. Instead, it compared different people of
different ages.
Perhaps this introduces a bias. It might be that the older

runners who continue running tend to be the faster runners.
After all, its discouraging to find yourself being passed by
more and more runners as you age. The runners thus
discouraged might drop out.
Fortunately, there is a source of longitudinal data: the race

has been run for 10 years and the results have been published
on the Internet each year. The data include the name of the
runner and so give some possibility to identify individual
runners from year to year.
What the data look like.

Each years format is slightly different, but years tend to look like
this, with separate files for men and women.
Credit Union Cherry Blossom 10 Mile Road Race
Washington, DC
Sunday, April 4, 2004
Official Men's Results
Place Div/Tot Num
===== ======== =====
1
1/2242 13997
2
2/2242
39
... and so on ...
4145
27/428 12663
4146 2241/2242 12678
Name
===============
Nelson Kiplagat
Samuel Ndereba
Ag
==
25
25
Hometown Net
G
========= ======= =
KEN
48:12
KEN
48:12
Stephen Johnson 46 Oakton VA 2:21:35 2

George Harrell 30 Laurel MD 2:22:50 2
Re-Organizing the Data
Read it in from the separate files and put them all in a

data-frame format.
Give a unique identifier to each runner, across the years, to

support the longitudinal analysis.
Count how many times each runner participated to extract

subsets of the data.
With the re-organized data, we can construct the longitudinal

analysis.
Reading in the Data I

For each years format, write a special-purpose operator that parses
the data and puts it in a data frame format.
This is nuts-and-bolts computer programming, not so interesting
to most people.
Example:
read1999 = function(n=-1){
dF = readLines('cb99f.htm',n=n)
dM = readLines('cb99m.htm',n=n)
M = breaklines1999(dM[-(1:4)])
F = breaklines1999(dF[-(1:4)])
res = data.frame( rbind(F,M),
sex=c(rep('F', nrow(F)),rep('M',nrow(M))))
}
Reading in the Data II

breaklines1999 = function(s) {
# fixed format
first = substr(s,1,5)
first = as.numeric(kill.blanks(first))
second = substr(s,6,10)
second = as.numeric(kill.blanks(second))
third =substr(s,12,15)
third= as.numeric(kill.blanks(third))
nm = substr(s,17,37)
nm = kill.blanks(nm)
age = as.numeric(substr(s,39,40))
place = substr(s,42,59)
place = kill.blanks(place)
gun = substr(s,61,67)
gun = to.minutes(gun)
net = rep(NA, length(gun))
Reading in the Data III
return(data.frame(position=first, division=second, total

name=nm, age=age,place=place, net=net,g
stringsAsFactors=FALSE) )
}
Parsing the Net and Gun Time

Translating the hh:mm:ss format into minutes. There are some
functions that can help, e.g., strptime
to.minutes = function(set){
res = rep(0,length(set))
for (k in 1:length(set)) {
s = strsplit(set[[k]], ":")[[1]]
s = as.numeric(s)
if (length(s)==3 )
res[k] = s[1]*60 + s[2] + s[3]/60
else
res[k] = s[1] + s[2]/60
}
return(res)
}
Assembling all the data I

read.all.data = function(){
r2008 = read2008()
r2007 = read2007()
and so on
r2000 = read2000()
r1999 = read1999()
res = rbind( r2008, r2007, r2006, r2005,
r2004, r2003, r2002, r2001, r2000, r1999);
res$name = kill.blanks(tolower(res$name))
res$year = c(
rep(2008, nrow(r2008)),
rep(2007, nrow(r2007)),
and so on
rep(2000, nrow(r2000)),
rep(1999, nrow(r1999))
Assembling all the data II
)
res = subset(res, age > 10 & ! is.na(age)) # eliminate th
return(res)
}
Using Name as a Unique Identifier I

The persons name is almost unique, but some names show up
more than 10 times, so there must be duplicates.
I show you this to demonstrate the power of being able to chain
computations: use the output of one computation as the input of
another.
> foo = read.all.data()
> nrow(foo)
[1] 82365
> names(foo)
[1] "position" "division" "total"
[6] "place"
"net"
"gun"
> length(unique(foo$name))
[1] 53517
"name"
"sex"
"age"
"year"
A lot of runners! But how many times does each person run?
Using Name as a Unique Identifier II

> table(foo$name)
a. gudu memon a. renee callahan
1
2
aaren pastor
aaron ahlburn
2
1
aaron alford
aaron alton
1
4
and so on for 53,517 different names.
a.j. montes
1
aaron aldridge
1
aaron ansell
1
Clearly there are people who ran multiple times, but this display is
not so useful because it is too long. BUT ... the output of table
can be the input to another computation. In this case, table
table does something very sensible.
Using Name as a Unique Identifier III
> table(table(foo$name))
1
2
3
4
38466 8580 3123 1518
5
803
6
441
7
282
8
162
9
87
10
41
11
8
12
2
13
2
14
1
22
1
Creating a Unique ID I
To help separate different people with the same name, we need
additional information. Hometown is a possibility, but people may
move.
Try Year of birth
foo$yob = foo$year - foo$age
foo$id = paste( foo$name, foo$yob )
> head(foo$id)
[1] "lineth chepkurui 1988" "angelina mutuku 1983"
[3] "lidia simon 1974"
"catherine ndereba 1973"
[5] "sharon cherop 1984"
"aziza aliyu 1986"
> table(table(foo$id))
1
41117
2
8457
3
2954
4
1384
5
750
6
401
7
247
8
143
9
73
10
25
Creating a Unique ID II
This is plausible, but not proven as a unique ID.
One possible check: There shouldnt be anyone with the same
Name-YOB who ran twice in any one year:
> foo$idWithYear = paste( foo$id, foo$year )
> table(table(foo$idWithYear))
1
82245
2
60
But there are! Does hometown help?

> foo$idHometown = paste(foo$id, foo$place)
> table(table( foo$idHometown ) )
1
2
3
4
5
6
7
8
54083 8023 1754
698
401
214
91
32
> foo$idHometownWithYear = paste(foo$idHometown, foo$year)
Creating a Unique ID III
> table(table( foo$idHometownWithYear ) )

1
82307
2
29
Extracting the Multiple Runners I

Count how many times each person runs and put this in a table
> nruns = aggregate(foo$yob, by=list(who = foo$id), length)

> head(nruns)
who x
...
7
a. gudu memon 1965 1
8 a. renee callahan 1966 2
9
a.j. montes 1964 1
10
aaren pastor 1991 2
11
aaron ahlburn 1976 1
... and so on .
Do a join (a relational database operation) with the original table
to add a new variable for each case: how many times that person
ran.
> goo = merge(foo, nruns, by.x="id", by.y="who")
Extracting the Multiple Runners II

Finally, look at the subset of runners who have run five times or
more:
> five = subset(goo, x>=5)
> nrow(five)
[1] 9936
> table(table(five$id))
5
6
7
8
750 401 247 143
9
73
10
25
Are a lot of these the duplicate names:

> table(table(five$idWithYear))
1
9920
2
8
Extracting the Multiple Runners III
Im not going to worry about these few, but I could exclude them
by using the same approach used to count how many times each
runner participated.
The Statistical Analysis I
> m3 = lm( net ~ age, data=five )
(Intercept)
age
Estimate
75.6471
0.2436
Std. Error
0.7001
0.0155
t value
108.05
15.76
Pr(>|t|)
0.0000
0.0000
> m4 = lm( net ~ age + sex, data=five )
(Intercept)
age
sexM
Estimate
78.9320
0.3472
-11.8040
Std. Error
0.6516
0.0145
0.3243
t value
121.13
23.89
-36.40
Pr(>|t|)
0.0000
0.0000
0.0000
The Statistical Analysis II

Allowing each runner to serve as his or her own control.
> m5 = lm( net ~ age + id, data=five )
(Intercept)
age
idabigail grier 1983
idabiy zewde 1967
idadam anthony 1966
idadam knapp 1977
Estimate Std.
59.50
0.83
18.33
13.59
-8.19
22.91
... and so on.
Error
2.69
0.04
3.81
3.40
3.81
4.15
t value
22.12
19.99
4.82
4.00
-2.15
5.52
Pr(>|t|)
0.00
0.00
0.00
0.00
0.03
0.00
The longitudinal analysis shows that runners slow down much

faster than indicated by the cross-sectional analysis.
A Mixed-Effects Model? I
I dont know much about mixed effects models, but the fact that
were not interested in the coefficients for individual runners
suggests that we should be treating them as random effects.
> library(lme4)
> m6 = lmer( net ~ 1 + age + (1|id), five)
A Mixed-Effects Model? II
> summary(m6)
Linear mixed model fit by REML
AIC
BIC logLik deviance REMLdev
52918 52945 -26455
52904
52910
Random effects:
Groups
Name
Variance Std.Dev.
id
(Intercept) 167.489 12.9417
Residual
35.007
5.9167
Number of obs: 7480, groups: id, 1639
Fixed effects:
Estimate Std. Error t value
(Intercept) 67.21559
1.14307
58.8
age
0.44137
0.02508
17.6
This coefficient is quite different.
Common Sense: A Separate Model for Each Runner I
> myfun = function(x){lm(net~age,data=x)$coef[2]}

> ages = group( five, five$id, myfun)
> head(ages)
group
result
1
aaron glahe 1974 3.1261905
2 abigail grier 1983 2.4666667
3
abiy zewde 1967 1.6940476
and so on
Whats the mean slope?
Common Sense: A Separate Model for Each Runner II
> t.test(ages$result)
One Sample t-test
data: ages$result
t = 11.1364, df = 1638, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6078309 0.8677138
sample estimates:
mean of x
0.7377724
Slowing and Aging I
Going a bit further: See how runners slow as they age.

We need to pull out the YOB for each runner and model the
ageing slope versus YOB.
>
>
>
>
yob = group(five$yob, five$id, mean, resname='yob')

noo = merge(yob,ages)
mm = lm( result ~ yob, data=noo)
summary(mm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.760335 12.076539
7.929 4.04e-15 ***
yob
-0.048496
0.006163 -7.868 6.47e-15 ***
Slowing and Aging II
Minutes per Year
1920
1940
1960
Year of Birth
1980
Slowing and Aging III

1920
Minutes per Year
1940
1960
1920
1940
1960
1980
Year of Birth
1980

Panel Data PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Panel Data PDF

Caricato da

Copyright:

Formati disponibili

Starting R: An Example of Panel Data

March 11, 2009

Longitudinal versus Cross-Sectional

CAUTION: Most of this example is about data re-organization.

Cross-sectional data is a snap shot of a population at one

Longitudinal data repeats measurements over time for each

Other related names: repeated measures, panel data.

An Example: How Runners Age

The data set ten-mile-race.csv contains times from the Cherry

age the age of the runner

state where the runner comes from

> run2005 = read.csv("/Users/kaplan/kaplanfiles/stats-book/

The previous analysis didnt actually involve any individual

Perhaps this introduces a bias. It might be that the older

Fortunately, there is a source of longitudinal data: the race

What the data look like.

Stephen Johnson 46 Oakton VA 2:21:35 2

Re-Organizing the Data

Read it in from the separate files and put them all in a

Give a unique identifier to each runner, across the years, to

Count how many times each runner participated to extract

With the re-organized data, we can construct the longitudinal

Reading in the Data I

Reading in the Data II

Reading in the Data III

return(data.frame(position=first, division=second, total

Parsing the Net and Gun Time

Assembling all the data I

Assembling all the data II

Using Name as a Unique Identifier I

Using Name as a Unique Identifier II

Using Name as a Unique Identifier III

But there are! Does hometown help?

Creating a Unique ID III

> table(table( foo$idHometownWithYear ) )

Extracting the Multiple Runners I

> nruns = aggregate(foo$yob, by=list(who = foo$id), length)

Extracting the Multiple Runners II

Are a lot of these the duplicate names:

Extracting the Multiple Runners III

The Statistical Analysis I

> m3 = lm( net ~ age, data=five )

> m4 = lm( net ~ age + sex, data=five )

The Statistical Analysis II

The longitudinal analysis shows that runners slow down much

Common Sense: A Separate Model for Each Runner I

> myfun = function(x){lm(net~age,data=x)$coef[2]}

Common Sense: A Separate Model for Each Runner II

Slowing and Aging I

Going a bit further: See how runners slow as they age.

yob = group(five$yob, five$id, mean, resname='yob')

Slowing and Aging II

Minutes per Year

Slowing and Aging III

Minutes per Year

Potrebbero piacerti anche