Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Outline
Resources And Books Survival Analysis
Kaplan-Meier Estimator Cox Regression
The STATA journal reviewed papers, regular columns, user-written software: http://www.stata-journal.com/ STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support Books: http://www.stata.com/bookstore/
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg For software installation: TAN Suh Wen: swtan@smu.edu.sg
Survival Analysis
Survival Data
Data: survival data is time-to-event data. Its quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint. Reason of using survival model:
The distribution of survival data tends to be positively skewed and not likely to be normal distribution and it may not be possible to find a transformation. Time-varying covariates could not be handled. In addition, some duration is censored.
Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc.
Survival Analysis
Survival Model
Survival function:
S (t ) = P (T t ) = 1 F (t )
=> d log( S (t )) = h (t ) dt
f (t ) h (t ) = Hazard function: S (t )
Survival Analysis
Kaplan-Meier Estimator:
(t ) = S
j |t ( j ) t
(1
dj nj
The number of individuals who experience the event t ( j )at time The number of individuals who have not yet experienced the eventt ( at j )time
Cox Regression:
h i ( t ) = h 0 ( t ) exp( T x i ) => S i ( t ) = S 0 ( t ) exp( => log( H i ( t )) = log H 0 ( t ) + T x i
T xi )
h0 (t )
Survival Analysis
Survival Analysis in STATA telco.csv
Variable name age marital address income ed Variable information Age in years Marital status 0=unmarried 1=married Years in current address Household income in thousands Level of educations 1= didnt complete high school 2= high school degree 3= college degree 4= undergraduate 5= postgraduate Years with current employer Number of people in household Gender 0=male 1=female Months with service Churn within last month 0 = No 1=Yes Customer categories 1= basic service 2= E-service 3= plus service 4=total service
Survival Analysis
Declaring and summarizing survival-time data: insheet using telco.csv d stset tenure, failure(churn)
_st: 1 if the record is to be used, 0 if ignored; _d: 1 if failure, 0 if censored; _t: analysis time when record ends; _t0: analysis time when record begins.
Survival Analysis
stsum ltable tenure churn sts graph, by(custcat)
0.00
0.25
0.50
0.75
1.00
20
60 custcat = 2 custcat = 4
80
Survival Analysis
Fitting regression models: sw stcox age marital address income ed employ retire, pe(0.05) xi : stcox employ address marital income i.custcat test _Icustcat_2 _Icustcat_3 _Icustcat_4
Survival Analysis
char marital [omit] 1 char custcat [omit] 4 xi:stcox employ address i.marital income i.custcat, basesurv(s) basehc(h)
Survival Analysis
stcurve, survival stcurve, hazard stcurve, survival at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) stcurve, hazard at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) Cox proportional hazards regression Cox proportional hazards regression
1 .9
Survival .7 .8
.6
.5
.5 0
0 20 40 analysis time 60 80
.6
Survival .7 .8
.9
20
60 _Icustcat_2=1
80
Survival Analysis
Examining the proportional hazards assumption: stphplot, by(custcat)
The proportionalhazards assumption is not violated when the curves are parallel.
6 0 0 -ln[-ln(Survival Probability)] 2 4
3 custcat = 2 custcat = 4
Survival Analysis
Examining time-varying covariates: xi : stcox employ address i.marital income i.custcat, tvc(employ) estimates store model1 xi : quietly stcox employ address i.marital income i.custcat lrtest model1 .
Survival Analysis
Exercise 1
Repeat the above analysis by treating customer category as stratifying variable instead of a covariate.
Techniques:
Exponential Smoothing ARIMA Models
0 1 0 1
bt = ( S t S t 1 ) + (1 )bt 1 Lt =
Ft + m = ( S t + mbt ) Lt L + m
seasonal_facors_men Seasonal Factors for Sales of Men's Clothing year_ month_ Year of the date Month of the date
10000
men 20000
30000
Q-test for white noise: if the test is significant, the residuals are correlated.
1990m1
1992m1
1994m1 mdate
1996m1
1998m1
2000m1
30000
1999m4
1999m7 mdate
1999m10 men
2000m1
0 1988m1
10000
20000
1990m1
1992m1
1994m1 mdate
1996m1
1998m1 men
2000m1
MA(q) model:
X t = + At 1 At 1 2 At 2 ... q At q
p q
ARIMA(p, d, q) model:
(1 i Li )(1 L ) d X t = (1 + i Li ) At
i =1 i =1
30000
men 20000
10 Lag
15
20
25
10000
0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
DS12.men 0
10000
10 Lag
15
20
25
-10000
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
-0.60 0
-20000
0.20
10
20 Lag
30
40
50
10000
20000
30000
xb prediction, one-step
men
predict fit1,xb predict fit2,xb dyn(m(1990m1)) line fit1 fit2 men mdate
20000 -40000 -20000
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1 1999m1 xb prediction, one-step men xb prediction, dyn(m(1990m1))
yit = + xit + i + vit i is the fixed or random effect, vit is the overall residual
Panel data: also called cross-sectional time series data with multiple cases (people, nations, firms, etc) for two or more time periods. Cross sectional information: difference btw subjects, btw subject effects. Time series: changes within subjects over time, withinsubject effects.
Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases.
STATA command xtreg with the be option
Random effect model: some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.
STATA command xtreg with the re option
Data: airfare.dta
Exercise 2: Keep obs for id=1 & id=2 only. Separate regression for each flight route. Draw scatters to observe the air fare changes over years for these two routes.
Fixed effect model answers: what is the effect of x when x changes within routes over time.
Between effect model answers: what is the effect of x when x is different btw routes.
Source of Variation
BtwVariation Fixed Effects Btw Effects Random Effects No Yes Yes WithinVariation Yes No Yes ldistsq / 0.034 0.029
Coefficients
lpassen -0.316 -0.066 -0.2235 bmktshr 0.0647 0.316 0.096
To correct autocorrelation we fit FE-model with AR(1) disturbances. Run such model in STATA using command xtregar In this data, we suspect based on the results from Exercise 1, that there is a period effect, i.e., after 1997 airfare gets increased in every flight route. Such systematic shock introduces endogeneity. The FE estimator would be biased. To solve this problem, an intuitive way is to create dummy variables for year t>1997
2.
Thanks!