Sei sulla pagina 1di 53

STATA Training Session 3

Advanced Topics in STATA


Sun Li Centre for Academic Computing lsun@smu.edu.sg

Outline
Resources And Books Survival Analysis
Kaplan-Meier Estimator Cox Regression

Time Series and Forecasting


Exponential Smoothing ARIMA Models

Introduction to Panel Regression with STATA

Resources And Books


CAC Computing Resources for STATA users Windows:
STATA/SE version 10.0 10-user network perpetual license Installation guide (http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STAT A-Software Questions.aspx)

Linux CAC Beowulf Cluster:


STATA/SE version 10.0 Unlimited users About CAC Beowulf Cluster: (http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)

New features in STATA 10.0 (http://www.stata.com/stata10)

Resources And Books


Website resources:
The STATA website: http://www.stata.com

The STATA journal reviewed papers, regular columns, user-written software: http://www.stata-journal.com/ STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support Books: http://www.stata.com/bookstore/

CAC STATA support:


Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx

Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg For software installation: TAN Suh Wen: swtan@smu.edu.sg

Resources And Books


Additional recommended readings:
Econometric Analysis of Cross Section and Panel Data, Jeffrey M. Wooldridge An Introduction to Modern Econometrics Using Stata, Christopher F. Baum New Introduction to Multiple Time Series Analysis, Helmut Ltkepohl Applied Survival Analysis: Regression Modeling of Time to Event Data, 2nd Edition, David W. Hosmer, Jr., Stanley Lemeshow, and Susanne May An Introduction to Survival Analysis Using Stata, Revised Edition, Mario Cleves, William W. Gould, and Roberto G. Gutierrez

Download Training Slides , data and Syntax: http://research2.smu.edu.sg/CAC/StatisticalComputing /Wiki/Training%20Slides%20and%20Syntax.aspx

Survival Analysis
Survival Data
Data: survival data is time-to-event data. Its quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint. Reason of using survival model:
The distribution of survival data tends to be positively skewed and not likely to be normal distribution and it may not be possible to find a transformation. Time-varying covariates could not be handled. In addition, some duration is censored.

Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc.

Survival Analysis
Survival Model
Survival function:

S (t ) = P (T t ) = 1 F (t )
=> d log( S (t )) = h (t ) dt

f (t ) h (t ) = Hazard function: S (t )

S (t ) = exp( H (t )) H (t ) is cumulative hazard function.

Survival Analysis
Kaplan-Meier Estimator:
(t ) = S

j |t ( j ) t

(1

dj nj

The number of individuals who experience the event t ( j )at time The number of individuals who have not yet experienced the eventt ( at j )time

t ( 1 ) < t ( 2 ) .... < t ( n )

Cox Regression:
h i ( t ) = h 0 ( t ) exp( T x i ) => S i ( t ) = S 0 ( t ) exp( => log( H i ( t )) = log H 0 ( t ) + T x i
T xi )

h0 (t )

is the baseline hazard function.

exp( T ( xi x j )) is the hazard ratio (HR) or incident rate ratio.

Survival Analysis
Survival Analysis in STATA telco.csv
Variable name age marital address income ed Variable information Age in years Marital status 0=unmarried 1=married Years in current address Household income in thousands Level of educations 1= didnt complete high school 2= high school degree 3= college degree 4= undergraduate 5= postgraduate Years with current employer Number of people in household Gender 0=male 1=female Months with service Churn within last month 0 = No 1=Yes Customer categories 1= basic service 2= E-service 3= plus service 4=total service

employ reside gender tenure churn custcat

Survival Analysis
Declaring and summarizing survival-time data: insheet using telco.csv d stset tenure, failure(churn)

_st: 1 if the record is to be used, 0 if ignored; _d: 1 if failure, 0 if censored; _t: analysis time when record ends; _t0: analysis time when record begins.

Survival Analysis
stsum ltable tenure churn sts graph, by(custcat)

0.00

0.25

0.50

0.75

1.00

Kaplan-Meier survival estimates

20

40 analysis time custcat = 1 custcat = 3

60 custcat = 2 custcat = 4

80

Survival Analysis
Fitting regression models: sw stcox age marital address income ed employ retire, pe(0.05) xi : stcox employ address marital income i.custcat test _Icustcat_2 _Icustcat_3 _Icustcat_4

Survival Analysis
char marital [omit] 1 char custcat [omit] 4 xi:stcox employ address i.marital income i.custcat, basesurv(s) basehc(h)

Survival Analysis
stcurve, survival stcurve, hazard stcurve, survival at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) stcurve, hazard at1( _Icustcat_1=1) at2( _Icustcat_2=1) at3( _Icustcat_3=1) Cox proportional hazards regression Cox proportional hazards regression
1 .9

Survival .7 .8

.6

.5

.5 0
0 20 40 analysis time 60 80

.6

Survival .7 .8

.9

20

40 analysis time _Icustcat_1=1 _Icustcat_3=1

60 _Icustcat_2=1

80

Survival Analysis
Examining the proportional hazards assumption: stphplot, by(custcat)
The proportionalhazards assumption is not violated when the curves are parallel.
6 0 0 -ln[-ln(Survival Probability)] 2 4

2 ln(analysis time) custcat = 1 custcat = 3

3 custcat = 2 custcat = 4

Survival Analysis
Examining time-varying covariates: xi : stcox employ address i.marital income i.custcat, tvc(employ) estimates store model1 xi : quietly stcox employ address i.marital income i.custcat lrtest model1 .

Survival Analysis
Exercise 1
Repeat the above analysis by treating customer category as stratifying variable instead of a covariate.

Time Series Analysis & Forecasting


Definitions, Applications and Techniques
Time series data: each case represents a point in time. Each cell gives a value for each variable for each time period.
Stationarity: Data are stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. Seasonality: By seasonality, we mean periodic fluctuations.

The usage of time series models is:


to obtain an understanding of underlying forces and structures that produce the observed data. to fit a model and proceed to forecasting and monitoring.

Techniques:
Exponential Smoothing ARIMA Models

Time Series Analysis & Forecasting


Exponential Smoothing
Four available model types:
Simple. The simple model assumes that the series has no trend and no seasonal variation. Holt. The Holt model assumes that the series has a linear trend and no seasonal variation. Winters. The Winters model assumes that the series has a linear trend and multiplicative seasonal variation (its magnitude increases or decreases with the overall level of the series). Custom. A custom model allows you to specify the trend and seasonality components.

Time Series Analysis & Forecasting


General form of models:
t 2 S = ( 1 ) y + ( 1 ) S2 , t 2 t i Single Exponential Smoothing: t i =1 t 2 i 1

Double Exponential Smoothing: S t = yt + (1 )( S t 1 + bt 1 ) bt = ( S t S t 1 ) + (1 )bt 1 Triple Exponential Smoothing:


St = yt + (1 - )(S t-1 + bt 1 ) Lt L yt + (1 ) Lt L St overall smoothing Trend smoothing Seasonal smoothing Forecast

0 1 0 1

bt = ( S t S t 1 ) + (1 )bt 1 Lt =

Ft + m = ( S t + mbt ) Lt L + m

Time Series Analysis & Forecasting


Example
Data: seasfac.csv
Variable name date men mail page phone print Variable information Date Sales of mens clothing Number of catalogs mailed Number of pages in catalogs Number of phone lines open for ordering Amount of spent on print advertising

seasonal_facors_men Seasonal Factors for Sales of Men's Clothing year_ month_ Year of the date Month of the date

Time Series Analysis & Forecasting


Step 1: Understand your data
insheet using seasfac.csv d gen mdate=ym(year_, month_) scatter men mdate, c(l) sort xlabel(, grid) ylabel(,grid)
40000 0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

10000

men 20000

30000

Time Series Analysis & Forecasting


Step 2: Declare time series data & Test stationarity assumption tsset mdate, monthly list men d.men l.men in 1/10 dfuller men, regress trend

Time Series Analysis & Forecasting


Step 3: Detect seasonality with autocorrelations reg d.men l.men predict res1,r corrgram res1 wntestq res1
Both autocor and partial autocor charts show the structure of the annual seasonality of the time series.

Q-test for white noise: if the test is significant, the residuals are correlated.

Time Series Analysis & Forecasting


Step 4: Construct Holt-Winters seasonal smoothing model Single-exponential smoothing tssmooth exponential men1=men Holt-Winters nonseasonal smoothing tssmooth hwinters men2=men, from(.1 .1) iterate(100) Holt-Winters seasonal smoothing tssmooth shwinters men3=men, sn0_0(seasonal_factors_men) from(.1 .1 .1) iterate(100)

Time Series Analysis & Forecasting


line men1 men2 men3 men mdate
40000 0 1988m1 10000 20000 30000

1990m1

1992m1

1994m1 mdate

1996m1

1998m1

2000m1

parms(0.1057) = men shw parms(0.013 0.138 0.000) = men

hw parms(0.000 0.000) = men men

Time Series Analysis & Forecasting


Step 5: Predictions line men3 men mdate line men3 men mdate if mdate>467
40000
40000 20000 1999m1 25000 30000 35000 45000

30000

1999m4

1999m7 mdate

1999m10 men

2000m1

shw parms(0.013 0.138 0.000) = men

0 1988m1

10000

20000

1990m1

1992m1

1994m1 mdate

1996m1

1998m1 men

2000m1

shw parms(0.013 0.138 0.000) = men

Time Series Analysis & Forecasting


ARIMA Model ARIMA(p, d, q)
Autoregression (AR): p is the order of autoregression Integration (I): d is the order of integration (differencing) Moving-Average (MA): q is the order of moving-average AR(p) model:
X t = + 1 X t 1 + 2 X t 2 + ... + p X t p + At

MA(q) model:

X t = + At 1 At 1 2 At 2 ... q At q
p q

ARIMA(p, d, q) model:

(1 i Li )(1 L ) d X t = (1 + i Li ) At
i =1 i =1

Time Series Analysis & Forecasting


Example
Step 1: Identification of orders of ARIMA model
SHAPE Exponential, decaying to zero Alternating positive and negative, decaying to zero One or more spikes, rest are essentially zero Decay, starting after a few lags All zero or close to zero High values at fixed intervals No decay to zero INDICATED MODEL Autoregressive model. Use the partial autocorrelation plot to identify the order of the autoregressive model. Autoregressive model. Use the partial autocorrelation plot to help identify the order. Moving average model, order identified by where plot becomes zero. Mixed autoregressive and moving average model. Data is essentially random. Include seasonal autoregressive term. Series is not stationary.

Time Series Analysis & Forecasting


ac men, lags(24)
40000
Autocorrelations of men -0.20 0.00 0.20 0.40 -0.40 0 0.60

30000

men 20000

10 Lag

15

20

25

Bartlett's formula for MA(q) 95% confidence bands

10000

scatter men mdate, c(l)

0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

Time Series Analysis & Forecasting


20000
Autocorrelations of S12.men -0.20 0.00 0.20 -0.40 0 0.40

DS12.men 0

10000

10 Lag

15

20

25

-10000

Bartlett's formula for MA(q) 95% confidence bands

1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

-0.60 0

scatter ds12.men mdate, c(l) ac s12.men, lags(24) pac s12.men, lags(24)

Partial autocorrelations of S12.men -0.40 -0.20 0.00

-20000

0.20

10

20 Lag

30

40

50

95% Confidence bands [se = 1/sqrt(n)]

Time Series Analysis & Forecasting


We have ARIMA(0,0,0)(0,0,1,12) model. Step 2: Estimation & Diagnostics arima men, arima(0,0,0) sarima(0,0,1,12) noconstant

Time Series Analysis & Forecasting


arima men mail page phone print service, arima(0,0,0) sarima(0,0,1,12) noconstant

Time Series Analysis & Forecasting


predict res, res corrgram res, lags(36)

Time Series Analysis & Forecasting


Step 3: Prediction predict fit, xb line fit men mdate
40000 0
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1

10000

20000

30000

xb prediction, one-step

men

Time Series Analysis & Forecasting


arima men mail page phone print service if tin(, 1990m1), arima(0,0,0) sarima(0,0,1,12) noconstant
40000

predict fit1,xb predict fit2,xb dyn(m(1990m1)) line fit1 fit2 men mdate
20000 -40000 -20000
1989m1 1990m1 1991m1 1992m1 1993m1 1994m1 1995m1 1996m1 1997m1 1998m1 1999m1 xb prediction, one-step men xb prediction, dyn(m(1990m1))

Introduction to Panel Regression with STATA


Panel Regression Model (Linear)

yit = + xit + i + vit i is the fixed or random effect, vit is the overall residual
Panel data: also called cross-sectional time series data with multiple cases (people, nations, firms, etc) for two or more time periods. Cross sectional information: difference btw subjects, btw subject effects. Time series: changes within subjects over time, withinsubject effects.

Introduction to Panel Regression with STATA


Variables Cases(nt) 11 12 1t 21 22 2t 31 32 3t nt x1 . . . . . . . . . . . . . . . x2 . . . . . . . . . . . . . . . x3 . . . . . . . . . . . . . . . xj . . . . . . . . . . . . . . .

Introduction to Panel Regression with STATA


Fixed, Between and Random Effect Models (Linear)
Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time.
STATA command xtreg with the fe option

Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases.
STATA command xtreg with the be option

Random effect model: some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.
STATA command xtreg with the re option

Introduction to Panel Regression with STATA


Example
Variable name year origin destin id dist passen fare bmktshr Variable information 1997, 1998, 1999, 200 Flights origin Flights destination Route identifier Distance, in miles Avg. passengers per day Avg. one-way fare, $ Fraction market, biggest carrier

Data: airfare.dta

Introduction to Panel Regression with STATA


reg lfare ldistsq lpassen bmktshr

Exercise 2: Keep obs for id=1 & id=2 only. Separate regression for each flight route. Draw scatters to observe the air fare changes over years for these two routes.

Introduction to Panel Regression with STATA


-define dataset as panel data tsset id year -summarize panel data xtsum lfare ldistsq lpassen bmktshr

Introduction to Panel Regression with STATA


Fixed effects: -regression with fixed effects command xtreg lfare ldistsq lpassen bmktshr, fe

Fixed effect model answers: what is the effect of x when x changes within routes over time.

Introduction to Panel Regression with STATA


Between effects: -regression with between effect command xtreg lfare ldistsq lpassen bmktshr, be

Between effect model answers: what is the effect of x when x is different btw routes.

Introduction to Panel Regression with STATA


Random effects: -regression with random effect command

( yit yi ) = (1 ) + (xit xi ) + {(1 ) i + (vit i )}


is a function of ui and vi Random effect model answers: 1. what is the effect of x when x changes within routes over time 2. what is the effect of x when x is different btw routes

Introduction to Panel Regression with STATA


xtreg lfare ldistsq lpassen bmktshr, re theta

Introduction to Panel Regression with STATA

Source of Variation
BtwVariation Fixed Effects Btw Effects Random Effects No Yes Yes WithinVariation Yes No Yes ldistsq / 0.034 0.029

Coefficients
lpassen -0.316 -0.066 -0.2235 bmktshr 0.0647 0.316 0.096

Question: whats the right model, fixed or random effect?

Introduction to Panel Regression with STATA


Hausman Test: H0: coefficients estimated by RE estimator are the same as those estimated by FE estimator. xtreg lfare ldistsq lpassen bmktshr, fe estimates store fixed xtreg lfare ldistsq lpassen bmktshr, re estimates store random hausman fixed random

Introduction to Panel Regression with STATA


Breusch and Pagan LM test: H0: sd(ui) = 0, where sd(ui) is the standard deviation of the ui terms xttest0

Introduction to Panel Regression with STATA


Serial test of autocorrelation: findit xtserial xtserial lfare ldistsq lpassen bmktshr, output

Introduction to Panel Regression with STATA


Exercise 3
1.

To correct autocorrelation we fit FE-model with AR(1) disturbances. Run such model in STATA using command xtregar In this data, we suspect based on the results from Exercise 1, that there is a period effect, i.e., after 1997 airfare gets increased in every flight route. Such systematic shock introduces endogeneity. The FE estimator would be biased. To solve this problem, an intuitive way is to create dummy variables for year t>1997

2.

Thanks!

Potrebbero piacerti anche