Sei sulla pagina 1di 31

Starting R: An Example of Panel Data

Danny Kaplan

March 11, 2009

Longitudinal versus Cross-Sectional

CAUTION: Most of this example is about data re-organization.


There is some grungy programming. The intention is just to show
you some capabilities and give you some examples for your own
reference.
The statistical analysis is mostly in one slide at the end.
I

Cross-sectional data is a snap shot of a population at one


time.

Longitudinal data repeats measurements over time for each


individual.

Other related names: repeated measures, panel data.

An Example: How Runners Age

The data set ten-mile-race.csv contains times from the Cherry


Blossom Ten Miler run in 2005 in Washington, DC. The variables
are:
I

net the time from the start line to the finish line: seconds

gun the time from the start gun to the finish line: seconds

sex

age the age of the runner

state where the runner comes from

Cross-sectional Analysis
How does the net time for runners depend on their age?

> run2005 = read.csv("/Users/kaplan/kaplanfiles/stats-book/


> m1 = lm(net ~ age, data = run2005)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5297.2192
37.6056 140.86
0.0000 .
age
8.1899
0.9806
8.35
0.0000
Sex is an obvious covariate.
> m2 = lm(net ~ age + sex, data = run2005)
(Intercept)
age
sexM

Estimate
5339.1554
16.8936
-726.6195

Std. Error
35.0487
0.9444
20.0181

t value
152.34
17.89
-36.30

Pr(>|t|)
0.0000
0.0000
0.0000

Cross-sectional Critique
I

The previous analysis didnt actually involve any individual


persons ageing. Instead, it compared different people of
different ages.

Perhaps this introduces a bias. It might be that the older


runners who continue running tend to be the faster runners.
After all, its discouraging to find yourself being passed by
more and more runners as you age. The runners thus
discouraged might drop out.

Fortunately, there is a source of longitudinal data: the race


has been run for 10 years and the results have been published
on the Internet each year. The data include the name of the
runner and so give some possibility to identify individual
runners from year to year.

What the data look like.


Each years format is slightly different, but years tend to look like
this, with separate files for men and women.
Credit Union Cherry Blossom 10 Mile Road Race
Washington, DC
Sunday, April 4, 2004
Official Men's Results
Place Div/Tot Num
===== ======== =====
1
1/2242 13997
2
2/2242
39
... and so on ...
4145
27/428 12663
4146 2241/2242 12678

Name
===============
Nelson Kiplagat
Samuel Ndereba

Ag
==
25
25

Hometown Net
G
========= ======= =
KEN
48:12
KEN
48:12

Stephen Johnson 46 Oakton VA 2:21:35 2


George Harrell 30 Laurel MD 2:22:50 2

Re-Organizing the Data

Read it in from the separate files and put them all in a


data-frame format.

Give a unique identifier to each runner, across the years, to


support the longitudinal analysis.

Count how many times each runner participated to extract


subsets of the data.

With the re-organized data, we can construct the longitudinal


analysis.

Reading in the Data I


For each years format, write a special-purpose operator that parses
the data and puts it in a data frame format.
This is nuts-and-bolts computer programming, not so interesting
to most people.
Example:
read1999 = function(n=-1){
dF = readLines('cb99f.htm',n=n)
dM = readLines('cb99m.htm',n=n)
M = breaklines1999(dM[-(1:4)])
F = breaklines1999(dF[-(1:4)])
res = data.frame( rbind(F,M),
sex=c(rep('F', nrow(F)),rep('M',nrow(M))))
}

Reading in the Data II


breaklines1999 = function(s) {
# fixed format
first = substr(s,1,5)
first = as.numeric(kill.blanks(first))
second = substr(s,6,10)
second = as.numeric(kill.blanks(second))
third =substr(s,12,15)
third= as.numeric(kill.blanks(third))
nm = substr(s,17,37)
nm = kill.blanks(nm)
age = as.numeric(substr(s,39,40))
place = substr(s,42,59)
place = kill.blanks(place)
gun = substr(s,61,67)
gun = to.minutes(gun)
net = rep(NA, length(gun))

Reading in the Data III

return(data.frame(position=first, division=second, total


name=nm, age=age,place=place, net=net,g
stringsAsFactors=FALSE) )
}

Parsing the Net and Gun Time


Translating the hh:mm:ss format into minutes. There are some
functions that can help, e.g., strptime
to.minutes = function(set){
res = rep(0,length(set))
for (k in 1:length(set)) {
s = strsplit(set[[k]], ":")[[1]]
s = as.numeric(s)
if (length(s)==3 )
res[k] = s[1]*60 + s[2] + s[3]/60
else
res[k] = s[1] + s[2]/60
}
return(res)
}

Assembling all the data I


read.all.data = function(){
r2008 = read2008()
r2007 = read2007()
and so on
r2000 = read2000()
r1999 = read1999()
res = rbind( r2008, r2007, r2006, r2005,
r2004, r2003, r2002, r2001, r2000, r1999);
res$name = kill.blanks(tolower(res$name))
res$year = c(
rep(2008, nrow(r2008)),
rep(2007, nrow(r2007)),
and so on
rep(2000, nrow(r2000)),
rep(1999, nrow(r1999))

Assembling all the data II

)
res = subset(res, age > 10 & ! is.na(age)) # eliminate th
return(res)
}

Using Name as a Unique Identifier I


The persons name is almost unique, but some names show up
more than 10 times, so there must be duplicates.
I show you this to demonstrate the power of being able to chain
computations: use the output of one computation as the input of
another.
> foo = read.all.data()
> nrow(foo)
[1] 82365
> names(foo)
[1] "position" "division" "total"
[6] "place"
"net"
"gun"
> length(unique(foo$name))
[1] 53517

"name"
"sex"

"age"
"year"

A lot of runners! But how many times does each person run?

Using Name as a Unique Identifier II


> table(foo$name)
a. gudu memon a. renee callahan
1
2
aaren pastor
aaron ahlburn
2
1
aaron alford
aaron alton
1
4
and so on for 53,517 different names.

a.j. montes
1
aaron aldridge
1
aaron ansell
1

Clearly there are people who ran multiple times, but this display is
not so useful because it is too long. BUT ... the output of table
can be the input to another computation. In this case, table
table does something very sensible.

Using Name as a Unique Identifier III

> table(table(foo$name))
1
2
3
4
38466 8580 3123 1518

5
803

6
441

7
282

8
162

9
87

10
41

11
8

12
2

13
2

14
1

22
1

Creating a Unique ID I
To help separate different people with the same name, we need
additional information. Hometown is a possibility, but people may
move.
Try Year of birth
foo$yob = foo$year - foo$age
foo$id = paste( foo$name, foo$yob )
> head(foo$id)
[1] "lineth chepkurui 1988" "angelina mutuku 1983"
[3] "lidia simon 1974"
"catherine ndereba 1973"
[5] "sharon cherop 1984"
"aziza aliyu 1986"
> table(table(foo$id))
1
41117

2
8457

3
2954

4
1384

5
750

6
401

7
247

8
143

9
73

10
25

Creating a Unique ID II
This is plausible, but not proven as a unique ID.
One possible check: There shouldnt be anyone with the same
Name-YOB who ran twice in any one year:
> foo$idWithYear = paste( foo$id, foo$year )
> table(table(foo$idWithYear))
1
82245

2
60

But there are! Does hometown help?


> foo$idHometown = paste(foo$id, foo$place)
> table(table( foo$idHometown ) )
1
2
3
4
5
6
7
8
54083 8023 1754
698
401
214
91
32
> foo$idHometownWithYear = paste(foo$idHometown, foo$year)

Creating a Unique ID III

> table(table( foo$idHometownWithYear ) )


1
82307

2
29

Extracting the Multiple Runners I


Count how many times each person runs and put this in a table

> nruns = aggregate(foo$yob, by=list(who = foo$id), length)


> head(nruns)
who x
...
7
a. gudu memon 1965 1
8 a. renee callahan 1966 2
9
a.j. montes 1964 1
10
aaren pastor 1991 2
11
aaron ahlburn 1976 1
... and so on .
Do a join (a relational database operation) with the original table
to add a new variable for each case: how many times that person
ran.
> goo = merge(foo, nruns, by.x="id", by.y="who")

Extracting the Multiple Runners II


Finally, look at the subset of runners who have run five times or
more:
> five = subset(goo, x>=5)
> nrow(five)
[1] 9936
> table(table(five$id))
5
6
7
8
750 401 247 143

9
73

10
25

Are a lot of these the duplicate names:


> table(table(five$idWithYear))
1
9920

2
8

Extracting the Multiple Runners III

Im not going to worry about these few, but I could exclude them
by using the same approach used to count how many times each
runner participated.

The Statistical Analysis I

> m3 = lm( net ~ age, data=five )

(Intercept)
age

Estimate
75.6471
0.2436

Std. Error
0.7001
0.0155

t value
108.05
15.76

Pr(>|t|)
0.0000
0.0000

> m4 = lm( net ~ age + sex, data=five )

(Intercept)
age
sexM

Estimate
78.9320
0.3472
-11.8040

Std. Error
0.6516
0.0145
0.3243

t value
121.13
23.89
-36.40

Pr(>|t|)
0.0000
0.0000
0.0000

The Statistical Analysis II


Allowing each runner to serve as his or her own control.
> m5 = lm( net ~ age + id, data=five )

(Intercept)
age
idabigail grier 1983
idabiy zewde 1967
idadam anthony 1966
idadam knapp 1977

Estimate Std.
59.50
0.83
18.33
13.59
-8.19
22.91
... and so on.

Error
2.69
0.04
3.81
3.40
3.81
4.15

t value
22.12
19.99
4.82
4.00
-2.15
5.52

Pr(>|t|)
0.00
0.00
0.00
0.00
0.03
0.00

The longitudinal analysis shows that runners slow down much


faster than indicated by the cross-sectional analysis.

A Mixed-Effects Model? I

I dont know much about mixed effects models, but the fact that
were not interested in the coefficients for individual runners
suggests that we should be treating them as random effects.
> library(lme4)
> m6 = lmer( net ~ 1 + age + (1|id), five)

A Mixed-Effects Model? II
> summary(m6)
Linear mixed model fit by REML
AIC
BIC logLik deviance REMLdev
52918 52945 -26455
52904
52910
Random effects:
Groups
Name
Variance Std.Dev.
id
(Intercept) 167.489 12.9417
Residual
35.007
5.9167
Number of obs: 7480, groups: id, 1639
Fixed effects:
Estimate Std. Error t value
(Intercept) 67.21559
1.14307
58.8
age
0.44137
0.02508
17.6
This coefficient is quite different.

Common Sense: A Separate Model for Each Runner I

> myfun = function(x){lm(net~age,data=x)$coef[2]}


> ages = group( five, five$id, myfun)
> head(ages)
group
result
1
aaron glahe 1974 3.1261905
2 abigail grier 1983 2.4666667
3
abiy zewde 1967 1.6940476
and so on
Whats the mean slope?

Common Sense: A Separate Model for Each Runner II

> t.test(ages$result)
One Sample t-test
data: ages$result
t = 11.1364, df = 1638, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6078309 0.8677138
sample estimates:
mean of x
0.7377724

Slowing and Aging I

Going a bit further: See how runners slow as they age.


We need to pull out the YOB for each runner and model the
ageing slope versus YOB.
>
>
>
>

yob = group(five$yob, five$id, mean, resname='yob')


noo = merge(yob,ages)
mm = lm( result ~ yob, data=noo)
summary(mm)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.760335 12.076539
7.929 4.04e-15 ***
yob
-0.048496
0.006163 -7.868 6.47e-15 ***

Slowing and Aging II

Minutes per Year

1920

1940

1960

Year of Birth

1980

Slowing and Aging III


1920

Minutes per Year

1940

1960

1920

1940

1960

1980

Year of Birth

1980

Potrebbero piacerti anche