Sei sulla pagina 1di 29

Time Series Analysis in Python with statsmodels

Wes McKinney1 Josef Perktold2 Skipper Seabold3

1 Departmentof Statistical Science


Duke University
2 Department of Economics

University of North Carolina at Chapel Hill


3 Departmentof Economics
American University

10th Python in Science Conference, 13 July 2011

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29
What is statsmodels?

A library for statistical modeling, implementing standard statistical


models in Python using NumPy and SciPy
Includes:
Linear (regression) models of many forms
Descriptive statistics
Statistical tests
Time series analysis
...and much more

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29
What is Time Series Analysis?

Statistical modeling of time-ordered data observations


Inferring structure, forecasting and simulation, and testing
distributional assumptions about the data
Modeling dynamic relationships among multiple time series
Broad applications e.g. in economics, finance, neuroscience, signal
processing...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29
Talk Overview

Brief update on statsmodels development


Aside: user interface and data structures
Descriptive statistics and tests
Auto-regressive moving average models (ARMA)
Vector autoregression (VAR) models
Filtering tools (Hodrick-Prescott and others)
Near future: Bayesian dynamic linear models (DLMs), ARCH /
GARCH volatility models and beyond

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29
Statsmodels development update

We’re now on GitHub! Join us:

http://github.com/statsmodels/statsmodels

Check out the slick Sphinx docs:

http://statsmodels.sourceforge.net

Development focus has been largely computational, i.e. writing


correct, tested implementations of all the common classes of
statistical models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29
Statsmodels development update

Major work to be done on providing a nice integrated user interface


We must work together to close the gap between R and Python!
Some important areas:
Formula framework, for specifying model design matrices
Need integrated rich statistical data structures (pandas)
Data visualization of results should always be a few keystrokes away
Write a “Statsmodels for R users” guide

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29
Aside: statistical data structures and user interface

While I have a captive audience...


Controversial fact: pandas is the only Python library currently
providing data structures matching (and in many places exceeding)
the richness of R’s data structures (for statistics)
Let’s have a BoF session so I can justify this statement
Feedback I hear is that end users find the fragmented, incohesive set
of Python tools for data analysis and statistics to be confusing,
frustrating, and certainly not compelling them to use Python...
(Not to mention the packaging headaches)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29
Aside: statistical data structures and user interface

We need to “commit” ASAP (not 12 months from now) to a high


level data structure(s) as the “primary data structure(s) for statistical
data analysis” and communicate that clearly to end users
Or we might as well all start programming in R...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29
Example data: EEG trace data

300

200

100

100

200

300

400

500

600
0 500 0 0 0 0 0 0 0
100 150 200 250 300 350 400

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29
Example data: Macroeconomic data

5.5
5.0 cpi
4.5
4.0
3.5
3.0
7.5
7.0 m1
6.5
6.0
5.5
5.0
4.5
9.5
9.0
realgdp
8.5
8.0
0 4 8 2 6 0 4 8 2 6 0 4 8
196 196 196 197 197 198 198 198 199 199 200 200 200

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29
Example data: Stock data

800
AAPL
700 GOOG
MSFT
600 YHOO
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9
200 200 200 200 200 200 200 200 200

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29
Descriptive statistics
Autocorrelation, partial autocorrelation plots
Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)
models
acf = tsa . acf ( eeg , 50)
pacf = tsa . pacf ( eeg , 50)

1.0 Autocorrelation 1.0 Partial Autocorrelation

0.5 0.5

0.0 0.0

0.5 0.5

1.00 10 20 30 40 50 1.00 10 20 30 40 50

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29
Statistical tests

Ljung-Box test for zero autocorrelation


Unit root test for cointegration (Augmented Dickey-Fuller test)
Granger-causality
Whiteness (iid-ness) and normality
See our conference paper (when the proceedings get published!)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29
Autoregressive moving average (ARMA) models
One of most common univariate time series models:

yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q


where E (t , s ) = 0, for t 6= s and t ∼ N (0, σ 2 )

Exact log-likelihood can be evaluated via the Kalman filter, but the
“conditional” likelihood is easier and commonly used
statsmodels has tools for simulating ARMA processes with known
coefficients ai , bi and also estimation given specified lag orders
import scikits.statsmodels.tsa.arima_process as ap
ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
nobs = 100
y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
y += 4 # add in constant

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29
ARMA Estimation

Several likelihood-based estimators implemented (see docs)


model = tsa.ARMA(y)
result = model.fit(order=(2, 1), trend=’c’,
method=’css-mle’, disp=-1)
result.params
# array([ 3.97, -0.97, -0.05, -0.13])

Standard model diagnostics, standard errors, information criteria


(AIC, BIC, ...), etc available in the returned ARMAResults object

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29
Vector Autoregression (VAR) models

Widely used model for modeling multiple (K -variate) time series,


especially in macroeconomics:

Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ)

Matrices Ai are K × K .
Yt must be a stationary process (sometimes achieved by
differencing). Related class of models (VECM) for modeling
nonstationary (including cointegrated) processes

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29
Vector Autoregression (VAR) models

>>> model = VAR(data); model.select_order(8)


VAR Order Selection
=====================================================
aic bic fpe hqic
-----------------------------------------------------
0 -27.83 -27.78 8.214e-13 -27.81
1 -28.77 -28.57 3.189e-13 -28.69
2 -29.00 -28.64* 2.556e-13 -28.85
3 -29.10 -28.60 2.304e-13 -28.90*
4 -29.09 -28.43 2.330e-13 -28.82
5 -29.13 -28.33 2.228e-13 -28.81
6 -29.14* -28.18 2.213e-13* -28.75
7 -29.07 -27.96 2.387e-13 -28.62
=====================================================
* Minimum

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29
Vector Autoregression (VAR) models

>>> result = model.fit(2)


>>> result.summary() # print summary for each variable
<snip>
Results for equation m1
====================================================
coefficient std. error t-stat prob
----------------------------------------------------
const 0.004968 0.001850 2.685 0.008
L1.m1 0.363636 0.071307 5.100 0.000
L1.realgdp -0.077460 0.092975 -0.833 0.406
L1.cpi -0.052387 0.128161 -0.409 0.683
L2.m1 0.250589 0.072050 3.478 0.001
L2.realgdp -0.085874 0.092032 -0.933 0.352
L2.cpi 0.169803 0.128376 1.323 0.188
====================================================
<snip>

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29
Vector Autoregression (VAR) models

>>> result = model.fit(2)


>>> result.summary() # print summary for each variable
<snip>
Correlation matrix of residuals
m1 realgdp cpi
m1 1.000000 -0.055690 -0.297494
realgdp -0.055690 1.000000 0.115597
cpi -0.297494 0.115597 1.000000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29
VAR: Impulse Response analysis
Analyze systematic impact of unit “shock” to a single variable

irf = result.irf(10)
irf.plot()

Impulse responses
m1 → m1 realgdp → m1 cpi → m1
1.0 0.2 0.4
0.8 0.1 0.3
0.2
0.6 0.0 0.1
0.4 0.1 0.0
0.2 0.2 0.1
0.2
0.0 0.3 0.3
0.20 0.4 10 0.40
2 m14→ realgdp
6 8 10 0 4 → realgdp
2 realgdp 6 8 2 cpi4→ realgdp
6 8 10
0.20 1.0 0.2
0.15 0.8 0.1
0.10 0.6 0.0
0.05
0.4 0.1
0.00
0.05 0.2 0.2
0.10 0.0 0.3
0.150 2 4 → cpi6 8 10 0.20 2 4 →6cpi 8 10 0.40 2 4cpi → cpi6 8 10
m1 realgdp
0.20 0.15 1.0
0.15 0.10 0.8
0.10 0.05 0.6
0.05 0.00
0.00 0.05 0.4
0.05 0.10 0.2
0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29
VAR: Forecast Error Variance Decomposition
Analyze contribution of each variable to forecasting error

fevd = result.fevd(20)
fevd.plot()

Forecast error variance decomposition (FEVD) m1


1.0 m1 realgdp
0.8 cpi
0.6
0.4
0.2
0.00 5 10 15 20
1.2 realgdp
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20
1.2 cpi
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29
VAR: Statistical tests

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])


Granger causality f-test
=========================================================
Test statistic Critical Value p-value df
---------------------------------------------------------
1.248787 2.387325 0.289 (4, 579)
=========================================================
H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
Conclusion: fail to reject H_0 at 5.00% significance level

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29
Filtering

Hodrick-Prescott (HP) filter separates a time series yt into a trend τt


and a cyclical component ζt , so that yt = τt + ζt .

14
Inflation
12 Cyclical component
10 Trend component
8
6
4
2
0
2
4
2 6 0 4 8 2 6 0 4 8 2 6
196 196 197 197 197 198 198 199 199 199 200 200

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29
Filtering

In addition to the HP filter, 2 other filters popular in finance and


economics, Baxter-King and Christiano-Fitzgerald, are available
We refer you to our paper and the documentation for details on these:

Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered


INFL INFL
4 4 UNEMP
UNEMP

2 2

0 0

2 2

4 4
63

73

83

93
68

78

88

98

03
71

81

91

08
66

76

86

96

01

06

19

19

19

19
19

19

19

19
19

19

19

20
19

19

19

19

20
20

20

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29
Preview: Bayesian dynamic linear models (DLM)

A state space model by another name:

yt = Ft0 θt + νt , νt ∼ N (0, Vt )
θt = G θt−1 + ωt , ωt ∼ N (0, Wt )

Estimation of basic model by Kalman filter recursions. Provides


elegant way to do time-varying linear regressions for forecasting
Extensions: multivariate DLMs, stochastic volatility (SV) models,
MCMC-based posterior sampling, mixtures of DLMs

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29
Preview: DLM Example (Constant+Trend model)

model = Polynomial(2)
dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
m0=m0, C0=C0, n0=n0, s0=s0, # priors
state_discount=.95) # discount factor
Constant + Trend DLM

200

150

100

50
8 9 9 009 009 9 9
200 200 200 2 Jul 2 200 200
Nov Jan Mar May Sep Nov

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29
Preview: Stochastic volatility models

1.6 JPY-USD Exchange Rate Volatility Process

1.4

1.2

1.0

0.8

0.6

0.4

0.20 200 400 600 800 1000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29
Future: sandbox and beyond

ARCH / GARCH models for volatility


Structural VAR and error correction models (ECM) for cointegrated
processes
Models with non-normally distributed errors
Better data description, visualization, and interactive research tools
More sophisticated Bayesian time series models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29
Conclusions

We’ve implemented many foundational models for time series


analysis, but the field is very broad
User interface can and should be much improved
Repo: http://github.com/statsmodels/statsmodels
Docs: http://statsmodels.sourceforge.net
Contact: pystatsmodels@googlegroups.com

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29

Potrebbero piacerti anche