STATA Training Session 3

STATA Training Session 3
Advanced Topics in STATA

Sun Li Centre for Academic Computing lsun@smu.edu.sg
Outline
Resources And Books Survival Analysis
Kaplan-Meier Estimator Cox Regression
Time Series and Forecasting

Exponential Smoothing ARIMA Models
Introduction to Panel Regression with STATA
Resources And Books

CAC Computing Resources for STATA users Windows:
STATA/SE version 10.0 10-user network perpetual license Installation guide (http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STAT A-Software Questions.aspx)
Linux CAC Beowulf Cluster:

STATA/SE version 10.0 Unlimited users About CAC Beowulf Cluster: (http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And Books

Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written software: http://www.stata-journal.com/ STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support Books: http://www.stata.com/bookstore/
CAC STATA support:

Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg For software installation: TAN Suh Wen: swtan@smu.edu.sg
Resources And Books

Additional recommended readings:
Econometric Analysis of Cross Section and Panel Data, Jeffrey M. Wooldridge An Introduction to Modern Econometrics Using Stata, Christopher F. Baum New Introduction to Multiple Time Series Analysis, Helmut Ltkepohl Applied Survival Analysis: Regression Modeling of Time to Event Data, 2nd Edition, David W. Hosmer, Jr., Stanley Lemeshow, and Susanne May An Introduction to Survival Analysis Using Stata, Revised Edition, Mario Cleves, William W. Gould, and Roberto G. Gutierrez
Download Training Slides , data and Syntax: http://research2.smu.edu.sg/CAC/StatisticalComputing /Wiki/Training%20Slides%20and%20Syntax.aspx
Survival Analysis
Survival Data
Data: survival data is time-to-event data. Its quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint. Reason of using survival model:
The distribution of survival data tends to be positively skewed and not likely to be normal distribution and it may not be possible to find a transformation. Time-varying covariates could not be handled. In addition, some duration is censored.
Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc.
Survival Analysis
Survival Model
Survival function:
S (t ) = P (T t ) = 1 F (t )
=> d log( S (t )) = h (t ) dt
f (t ) h (t ) = Hazard function: S (t )
S (t ) = exp( H (t )) H (t ) is cumulative hazard function.
Survival Analysis
Kaplan-Meier Estimator:
(t ) = S
j |t ( j ) t
(1
dj nj
The number of individuals who experience the event t ( j )at time The number of individuals who have not yet experienced the eventt ( at j )time
t ( 1 ) < t ( 2 ) .... < t ( n )
Cox Regression:
h i ( t ) = h 0 ( t ) exp( T x i ) => S i ( t ) = S 0 ( t ) exp( => log( H i ( t )) = log H 0 ( t ) + T x i
T xi )
h0 (t )
is the baseline hazard function.
exp( T ( xi x j )) is the hazard ratio (HR) or incident rate ratio.
Survival Analysis
Survival Analysis in STATA telco.csv
Variable name age marital address income ed Variable information Age in years Marital status 0=unmarried 1=married Years in current address Household income in thousands Level of educations 1= didnt complete high school 2= high school degree 3= college degree 4= undergraduate 5= postgraduate Years with current employer Number of people in household Gender 0=male 1=female Months with service Churn within last month 0 = No 1=Yes Customer categories 1= basic service 2= E-service 3= plus service 4=total service
employ reside gender tenure churn custcat
Survival Analysis
Declaring and summarizing survival-time data: insheet using telco.csv d stset tenure, failure(churn)
_st: 1 if the record is to be used, 0 if ignored; _d: 1 if failure, 0 if censored; _t: analysis time when record ends; _t0: analysis time when record begins.
Survival Analysis
stsum ltable tenure churn sts graph, by(custcat)
0.00
0.25
0.50
0.75
1.00
Kaplan-Meier survival estimates
20
40 analysis time custcat = 1 custcat = 3
60 custcat = 2 custcat = 4
80
Survival Analysis
Fitting regression models: sw stcox age marital address income ed employ retire, pe(0.05) xi : stcox employ address marital income i.custcat test _Icustcat_2 _Icustcat_3 _Icustcat_4
Survival Analysis
char marital [omit] 1 char custcat [omit] 4 xi:stcox employ address i.marital income i.custcat, basesurv(s) basehc(h)
Survival Analysis
stcurve, survival stcurve, hazard stcurve, survival at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) stcurve, hazard at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) Cox proportional hazards regression Cox proportional hazards regression
1 .9
Survival .7 .8
.6
.5
.5 0
0 20 40 analysis time 60 80
.6
Survival .7 .8
.9
20
40 analysis time _Icustcat_1=1 _Icustcat_3=1
60 _Icustcat_2=1
80
Survival Analysis
Examining the proportional hazards assumption: stphplot, by(custcat)
The proportionalhazards assumption is not violated when the curves are parallel.
6 0 0 -ln[-ln(Survival Probability)] 2 4
2 ln(analysis time) custcat = 1 custcat = 3
3 custcat = 2 custcat = 4
Survival Analysis
Examining time-varying covariates: xi : stcox employ address i.marital income i.custcat, tvc(employ) estimates store model1 xi : quietly stcox employ address i.marital income i.custcat lrtest model1 .
Survival Analysis
Exercise 1
Repeat the above analysis by treating customer category as stratifying variable instead of a covariate.
Time Series Analysis & Forecasting

Definitions, Applications and Techniques
Time series data: each case represents a point in time. Each cell gives a value for each variable for each time period.
Stationarity: Data are stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. Seasonality: By seasonality, we mean periodic fluctuations.
The usage of time series models is:

to obtain an understanding of underlying forces and structures that produce the observed data. to fit a model and proceed to forecasting and monitoring.
Techniques:
Exponential Smoothing ARIMA Models

Exponential Smoothing
Four available model types:
Simple. The simple model assumes that the series has no trend and no seasonal variation. Holt. The Holt model assumes that the series has a linear trend and no seasonal variation. Winters. The Winters model assumes that the series has a linear trend and multiplicative seasonal variation (its magnitude increases or decreases with the overall level of the series). Custom. A custom model allows you to specify the trend and seasonality components.

General form of models:
t 2 S = ( 1 ) y + ( 1 ) S2 , t 2 t i Single Exponential Smoothing: t i =1 t 2 i 1
Double Exponential Smoothing: S t = yt + (1 )( S t 1 + bt 1 ) bt = ( S t S t 1 ) + (1 )bt 1 Triple Exponential Smoothing:

St = yt + (1 - )(S t-1 + bt 1 ) Lt L yt + (1 ) Lt L St overall smoothing Trend smoothing Seasonal smoothing Forecast
0 1 0 1
bt = ( S t S t 1 ) + (1 )bt 1 Lt =
Ft + m = ( S t + mbt ) Lt L + m

Example
Data: seasfac.csv
Variable name date men mail page phone print Variable information Date Sales of mens clothing Number of catalogs mailed Number of pages in catalogs Number of phone lines open for ordering Amount of spent on print advertising
seasonal_facors_men Seasonal Factors for Sales of Men's Clothing year_ month_ Year of the date Month of the date

Step 1: Understand your data
insheet using seasfac.csv d gen mdate=ym(year_, month_) scatter men mdate, c(l) sort xlabel(, grid) ylabel(,grid)
40000 0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
10000
men 20000
30000

Step 2: Declare time series data & Test stationarity assumption tsset mdate, monthly list men d.men l.men in 1/10 dfuller men, regress trend

Step 3: Detect seasonality with autocorrelations reg d.men l.men predict res1,r corrgram res1 wntestq res1
Both autocor and partial autocor charts show the structure of the annual seasonality of the time series.
Q-test for white noise: if the test is significant, the residuals are correlated.

Step 4: Construct Holt-Winters seasonal smoothing model Single-exponential smoothing tssmooth exponential men1=men Holt-Winters nonseasonal smoothing tssmooth hwinters men2=men, from(.1 .1) iterate(100) Holt-Winters seasonal smoothing tssmooth shwinters men3=men, sn0_0(seasonal_factors_men) from(.1 .1 .1) iterate(100)

line men1 men2 men3 men mdate
40000 0 1988m1 10000 20000 30000
1990m1
1992m1
1994m1 mdate
1996m1
1998m1
2000m1
parms(0.1057) = men shw parms(0.013 0.138 0.000) = men
hw parms(0.000 0.000) = men men

Step 5: Predictions line men3 men mdate line men3 men mdate if mdate>467
40000
40000 20000 1999m1 25000 30000 35000 45000
30000
1999m4
1999m7 mdate
1999m10 men
2000m1
shw parms(0.013 0.138 0.000) = men
0 1988m1
10000
20000
1990m1
1992m1
1994m1 mdate
1996m1
1998m1 men
2000m1
shw parms(0.013 0.138 0.000) = men

ARIMA Model ARIMA(p, d, q)
Autoregression (AR): p is the order of autoregression Integration (I): d is the order of integration (differencing) Moving-Average (MA): q is the order of moving-average AR(p) model:
X t = + 1 X t 1 + 2 X t 2 + ... + p X t p + At
MA(q) model:
X t = + At 1 At 1 2 At 2 ... q At q
p q
ARIMA(p, d, q) model:
(1 i Li )(1 L ) d X t = (1 + i Li ) At
i =1 i =1

Example
Step 1: Identification of orders of ARIMA model
SHAPE Exponential, decaying to zero Alternating positive and negative, decaying to zero One or more spikes, rest are essentially zero Decay, starting after a few lags All zero or close to zero High values at fixed intervals No decay to zero INDICATED MODEL Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model. Autoregressive model. Use the partial autocorrelation plot to help identify the order. Moving average model, order identified by where plot becomes zero. Mixed autoregressive and moving average model. Data is essentially random. Include seasonal autoregressive term. Series is not stationary.

ac men, lags(24)
40000
Autocorrelations of men -0.20 0.00 0.20 0.40 -0.40 0 0.60
30000
men 20000
10 Lag
15
20
25
Bartlett's formula for MA(q) 95% confidence bands
10000
scatter men mdate, c(l)
0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

20000
Autocorrelations of S12.men -0.20 0.00 0.20 -0.40 0 0.40
DS12.men 0
10000
10 Lag
15
20
25
-10000
Bartlett's formula for MA(q) 95% confidence bands
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
-0.60 0
scatter ds12.men mdate, c(l) ac s12.men, lags(24) pac s12.men, lags(24)
Partial autocorrelations of S12.men -0.40 -0.20 0.00
-20000
0.20
10
20 Lag
30
40
50
95% Confidence bands [se = 1/sqrt(n)]

We have ARIMA(0,0,0)(0,0,1,12) model. Step 2: Estimation & Diagnostics arima men, arima(0,0,0) sarima(0,0,1,12) noconstant

arima men mail page phone print service, arima(0,0,0) sarima(0,0,1,12) noconstant

predict res, res corrgram res, lags(36)

Step 3: Prediction predict fit, xb line fit men mdate
40000 0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1
10000
20000
30000
xb prediction, one-step
men

arima men mail page phone print service if tin(, 1990m1), arima(0,0,0) sarima(0,0,1,12) noconstant
40000
predict fit1,xb predict fit2,xb dyn(m(1990m1)) line fit1 fit2 men mdate
20000 -40000 -20000
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1 1999m1 xb prediction, one-step men xb prediction, dyn(m(1990m1))

Panel Regression Model (Linear)
yit = + xit + i + vit i is the fixed or random effect, vit is the overall residual
Panel data: also called cross-sectional time series data with multiple cases (people, nations, firms, etc) for two or more time periods. Cross sectional information: difference btw subjects, btw subject effects. Time series: changes within subjects over time, withinsubject effects.

Variables Cases(nt) 11 12 1t 21 22 2t 31 32 3t nt x1 . . . . . . . . . . . . . . . x2 . . . . . . . . . . . . . . . x3 . . . . . . . . . . . . . . . xj . . . . . . . . . . . . . . .

Fixed, Between and Random Effect Models (Linear)
Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time.
STATA command xtreg with the fe option
Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases.
STATA command xtreg with the be option
Random effect model: some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.
STATA command xtreg with the re option

Example
Variable name year origin destin id dist passen fare bmktshr Variable information 1997, 1998, 1999, 200 Flights origin Flights destination Route identifier Distance, in miles Avg. passengers per day Avg. one-way fare, $ Fraction market, biggest carrier
Data: airfare.dta

reg lfare ldistsq lpassen bmktshr
Exercise 2: Keep obs for id=1 & id=2 only. Separate regression for each flight route. Draw scatters to observe the air fare changes over years for these two routes.

-define dataset as panel data tsset id year -summarize panel data xtsum lfare ldistsq lpassen bmktshr

Fixed effects: -regression with fixed effects command xtreg lfare ldistsq lpassen bmktshr, fe
Fixed effect model answers: what is the effect of x when x changes within routes over time.

Between effects: -regression with between effect command xtreg lfare ldistsq lpassen bmktshr, be
Between effect model answers: what is the effect of x when x is different btw routes.

Random effects: -regression with random effect command
( yit yi ) = (1 ) + (xit xi ) + {(1 ) i + (vit i )}

is a function of ui and vi Random effect model answers: 1. what is the effect of x when x changes within routes over time 2. what is the effect of x when x is different btw routes

xtreg lfare ldistsq lpassen bmktshr, re theta
Source of Variation
BtwVariation Fixed Effects Btw Effects Random Effects No Yes Yes WithinVariation Yes No Yes ldistsq / 0.034 0.029
Coefficients
lpassen -0.316 -0.066 -0.2235 bmktshr 0.0647 0.316 0.096
Question: whats the right model, fixed or random effect?

Hausman Test: H0: coefficients estimated by RE estimator are the same as those estimated by FE estimator. xtreg lfare ldistsq lpassen bmktshr, fe estimates store fixed xtreg lfare ldistsq lpassen bmktshr, re estimates store random hausman fixed random

Breusch and Pagan LM test: H0: sd(ui) = 0, where sd(ui) is the standard deviation of the ui terms xttest0

Serial test of autocorrelation: findit xtserial xtserial lfare ldistsq lpassen bmktshr, output

Exercise 3
1.
To correct autocorrelation we fit FE-model with AR(1) disturbances. Run such model in STATA using command xtregar In this data, we suspect based on the results from Exercise 1, that there is a period effect, i.e., after 1997 airfare gets increased in every flight route. Such systematic shock introduces endogeneity. The FE estimator would be biased. To solve this problem, an intuitive way is to create dummy variables for year t>1997
2.
Thanks!

STATA Training Session 3

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

STATA Training Session 3

Caricato da

Copyright:

Formati disponibili

STATA Training Session 3

Advanced Topics in STATA

Time Series and Forecasting

Introduction to Panel Regression with STATA

Resources And Books

Linux CAC Beowulf Cluster:

New features in STATA 10.0 (http://www.stata.com/stata10)

Resources And Books

CAC STATA support:

Resources And Books

Download Training Slides , data and Syntax: http://research2.smu.edu.sg/CAC/StatisticalComputing /Wiki/Training%20Slides%20and%20Syntax.aspx

S (t ) = exp( H (t )) H (t ) is cumulative hazard function.

t ( 1 ) < t ( 2 ) .... < t ( n )

is the baseline hazard function.

exp( T ( xi x j )) is the hazard ratio (HR) or incident rate ratio.

employ reside gender tenure churn custcat

Kaplan-Meier survival estimates

40 analysis time custcat = 1 custcat = 3

40 analysis time _Icustcat_1=1 _Icustcat_3=1

2 ln(analysis time) custcat = 1 custcat = 3

Time Series Analysis & Forecasting

The usage of time series models is:

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Double Exponential Smoothing: S t = yt + (1 )( S t 1 + bt 1 ) bt = ( S t S t 1 ) + (1 )bt 1 Triple Exponential Smoothing:

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

parms(0.1057) = men shw parms(0.013 0.138 0.000) = men

hw parms(0.000 0.000) = men men

Time Series Analysis & Forecasting

shw parms(0.013 0.138 0.000) = men

shw parms(0.013 0.138 0.000) = men

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Bartlett's formula for MA(q) 95% confidence bands

scatter men mdate, c(l)

Time Series Analysis & Forecasting

Bartlett's formula for MA(q) 95% confidence bands

scatter ds12.men mdate, c(l) ac s12.men, lags(24) pac s12.men, lags(24)

Partial autocorrelations of S12.men -0.40 -0.20 0.00

95% Confidence bands [se = 1/sqrt(n)]

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Time Series Analysis & Forecasting

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA

( yit yi ) = (1 ) + (xit xi ) + {(1 ) i + (vit i )}

Introduction to Panel Regression with STATA

Introduction to Panel Regression with STATA