Time Series Analysis in Python with Statsmodels

Time Series Analysis in Python with statsmodels
Wes McKinney1 Josef Perktold2 Skipper Seabold3
1 Departmentof Statistical Science

Duke University
2 Department of Economics
University of North Carolina at Chapel Hill

3 Departmentof Economics
American University
10th Python in Science Conference, 13 July 2011
McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29
What is statsmodels?
A library for statistical modeling, implementing standard statistical

models in Python using NumPy and SciPy
Includes:
Linear (regression) models of many forms
Descriptive statistics
Statistical tests
Time series analysis
...and much more
What is Time Series Analysis?
Statistical modeling of time-ordered data observations

Inferring structure, forecasting and simulation, and testing
distributional assumptions about the data
Modeling dynamic relationships among multiple time series
Broad applications e.g. in economics, finance, neuroscience, signal
processing...
Talk Overview
Brief update on statsmodels development

Aside: user interface and data structures
Descriptive statistics and tests
Auto-regressive moving average models (ARMA)
Vector autoregression (VAR) models
Filtering tools (Hodrick-Prescott and others)
Near future: Bayesian dynamic linear models (DLMs), ARCH /
GARCH volatility models and beyond
Statsmodels development update
We’re now on GitHub! Join us:
http://github.com/statsmodels/statsmodels
Check out the slick Sphinx docs:
http://statsmodels.sourceforge.net
Development focus has been largely computational, i.e. writing

correct, tested implementations of all the common classes of
statistical models
Statsmodels development update
Major work to be done on providing a nice integrated user interface

We must work together to close the gap between R and Python!
Some important areas:
Formula framework, for specifying model design matrices
Need integrated rich statistical data structures (pandas)
Data visualization of results should always be a few keystrokes away
Write a “Statsmodels for R users” guide
Aside: statistical data structures and user interface
While I have a captive audience...

Controversial fact: pandas is the only Python library currently
providing data structures matching (and in many places exceeding)
the richness of R’s data structures (for statistics)
Let’s have a BoF session so I can justify this statement
Feedback I hear is that end users find the fragmented, incohesive set
of Python tools for data analysis and statistics to be confusing,
frustrating, and certainly not compelling them to use Python...
(Not to mention the packaging headaches)
Aside: statistical data structures and user interface
We need to “commit” ASAP (not 12 months from now) to a high

level data structure(s) as the “primary data structure(s) for statistical
data analysis” and communicate that clearly to end users
Or we might as well all start programming in R...
Example data: EEG trace data
300
200
100
100
200
300
400
500
600
0 500 0 0 0 0 0 0 0
100 150 200 250 300 350 400
Example data: Macroeconomic data
5.5
5.0 cpi
4.5
4.0
3.5
3.0
7.5
7.0 m1
6.5
6.0
5.5
5.0
4.5
9.5
9.0
realgdp
8.5
8.0
0 4 8 2 6 0 4 8 2 6 0 4 8
196 196 196 197 197 198 198 198 199 199 200 200 200
Example data: Stock data
800
AAPL
700 GOOG
MSFT
600 YHOO
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9
200 200 200 200 200 200 200 200 200
Descriptive statistics
Autocorrelation, partial autocorrelation plots
Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)
models
acf = tsa . acf ( eeg , 50)
pacf = tsa . pacf ( eeg , 50)
1.0 Autocorrelation 1.0 Partial Autocorrelation
0.5 0.5
0.0 0.0
0.5 0.5
1.00 10 20 30 40 50 1.00 10 20 30 40 50
Statistical tests
Ljung-Box test for zero autocorrelation

Unit root test for cointegration (Augmented Dickey-Fuller test)
Granger-causality
Whiteness (iid-ness) and normality
See our conference paper (when the proceedings get published!)
Autoregressive moving average (ARMA) models
One of most common univariate time series models:
yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q

where E (t , s ) = 0, for t 6= s and t ∼ N (0, σ 2 )
Exact log-likelihood can be evaluated via the Kalman filter, but the
“conditional” likelihood is easier and commonly used
statsmodels has tools for simulating ARMA processes with known
coefficients ai , bi and also estimation given specified lag orders
import scikits.statsmodels.tsa.arima_process as ap
ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
nobs = 100
y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
y += 4 # add in constant
ARMA Estimation
Several likelihood-based estimators implemented (see docs)

model = tsa.ARMA(y)
result = model.fit(order=(2, 1), trend=’c’,
method=’css-mle’, disp=-1)
result.params
# array([ 3.97, -0.97, -0.05, -0.13])
Standard model diagnostics, standard errors, information criteria

(AIC, BIC, ...), etc available in the returned ARMAResults object
Vector Autoregression (VAR) models
Widely used model for modeling multiple (K -variate) time series,

especially in macroeconomics:
Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ)
Matrices Ai are K × K .
Yt must be a stationary process (sometimes achieved by
differencing). Related class of models (VECM) for modeling
nonstationary (including cointegrated) processes
>>> model = VAR(data); model.select_order(8)

VAR Order Selection
=====================================================
aic bic fpe hqic
-----------------------------------------------------
0 -27.83 -27.78 8.214e-13 -27.81
1 -28.77 -28.57 3.189e-13 -28.69
2 -29.00 -28.64* 2.556e-13 -28.85
3 -29.10 -28.60 2.304e-13 -28.90*
4 -29.09 -28.43 2.330e-13 -28.82
5 -29.13 -28.33 2.228e-13 -28.81
6 -29.14* -28.18 2.213e-13* -28.75
7 -29.07 -27.96 2.387e-13 -28.62
=====================================================
* Minimum
>>> result = model.fit(2)

>>> result.summary() # print summary for each variable
<snip>
Results for equation m1
====================================================
coefficient std. error t-stat prob
----------------------------------------------------
const 0.004968 0.001850 2.685 0.008
L1.m1 0.363636 0.071307 5.100 0.000
L1.realgdp -0.077460 0.092975 -0.833 0.406
L1.cpi -0.052387 0.128161 -0.409 0.683
L2.m1 0.250589 0.072050 3.478 0.001
L2.realgdp -0.085874 0.092032 -0.933 0.352
L2.cpi 0.169803 0.128376 1.323 0.188
====================================================
<snip>
>>> result = model.fit(2)

>>> result.summary() # print summary for each variable
<snip>
Correlation matrix of residuals
m1 realgdp cpi
m1 1.000000 -0.055690 -0.297494
realgdp -0.055690 1.000000 0.115597
cpi -0.297494 0.115597 1.000000
VAR: Impulse Response analysis
Analyze systematic impact of unit “shock” to a single variable
irf = result.irf(10)
irf.plot()
Impulse responses
m1 → m1 realgdp → m1 cpi → m1
1.0 0.2 0.4
0.8 0.1 0.3
0.2
0.6 0.0 0.1
0.4 0.1 0.0
0.2 0.2 0.1
0.2
0.0 0.3 0.3
0.20 0.4 10 0.40
2 m14→ realgdp
6 8 10 0 4 → realgdp
2 realgdp 6 8 2 cpi4→ realgdp
6 8 10
0.20 1.0 0.2
0.15 0.8 0.1
0.10 0.6 0.0
0.05
0.4 0.1
0.00
0.05 0.2 0.2
0.10 0.0 0.3
0.150 2 4 → cpi6 8 10 0.20 2 4 →6cpi 8 10 0.40 2 4cpi → cpi6 8 10
m1 realgdp
0.20 0.15 1.0
0.15 0.10 0.8
0.10 0.05 0.6
0.05 0.00
0.00 0.05 0.4
0.05 0.10 0.2
0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10
VAR: Forecast Error Variance Decomposition
Analyze contribution of each variable to forecasting error
fevd = result.fevd(20)
fevd.plot()
Forecast error variance decomposition (FEVD) m1

1.0 m1 realgdp
0.8 cpi
0.6
0.4
0.2
0.00 5 10 15 20
1.2 realgdp
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20
1.2 cpi
1.0
0.8
0.6
0.4
0.2
0.00 5 10 15 20
VAR: Statistical tests
In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])

Granger causality f-test
=========================================================
Test statistic Critical Value p-value df
---------------------------------------------------------
1.248787 2.387325 0.289 (4, 579)
=========================================================
H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
Conclusion: fail to reject H_0 at 5.00% significance level
Filtering
Hodrick-Prescott (HP) filter separates a time series yt into a trend τt

and a cyclical component ζt , so that yt = τt + ζt .
14
Inflation
12 Cyclical component
10 Trend component
8
6
4
2
0
2
4
2 6 0 4 8 2 6 0 4 8 2 6
196 196 197 197 197 198 198 199 199 199 200 200
Filtering
In addition to the HP filter, 2 other filters popular in finance and

economics, Baxter-King and Christiano-Fitzgerald, are available
We refer you to our paper and the documentation for details on these:
Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered

INFL INFL
4 4 UNEMP
UNEMP
2 2
0 0
2 2
4 4
63
73
83
93
68
78
88
98
03
71
81
91
08
66
76
86
96
01
06
19
19
19
19
19
19
19
19
19
19
19
20
19
19
19
19
20
20
20
Preview: Bayesian dynamic linear models (DLM)
A state space model by another name:
yt = Ft0 θt + νt , νt ∼ N (0, Vt )
θt = G θt−1 + ωt , ωt ∼ N (0, Wt )
Estimation of basic model by Kalman filter recursions. Provides

elegant way to do time-varying linear regressions for forecasting
Extensions: multivariate DLMs, stochastic volatility (SV) models,
MCMC-based posterior sampling, mixtures of DLMs
Preview: DLM Example (Constant+Trend model)
model = Polynomial(2)
dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
m0=m0, C0=C0, n0=n0, s0=s0, # priors
state_discount=.95) # discount factor
Constant + Trend DLM
200
150
100
50
8 9 9 009 009 9 9
200 200 200 2 Jul 2 200 200
Nov Jan Mar May Sep Nov
Preview: Stochastic volatility models
1.6 JPY-USD Exchange Rate Volatility Process
1.4
1.2
1.0
0.8
0.6
0.4
0.20 200 400 600 800 1000
Future: sandbox and beyond
ARCH / GARCH models for volatility

Structural VAR and error correction models (ECM) for cointegrated
processes
Models with non-normally distributed errors
Better data description, visualization, and interactive research tools
More sophisticated Bayesian time series models
Conclusions
We’ve implemented many foundational models for time series

analysis, but the field is very broad
User interface can and should be much improved
Repo: http://github.com/statsmodels/statsmodels
Docs: http://statsmodels.sourceforge.net
Contact: pystatsmodels@googlegroups.com

Time Series Analysis in Python with Statsmodels

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Time Series Analysis in Python with Statsmodels

Caricato da

Copyright:

Formati disponibili

Time Series Analysis in Python with statsmodels

Wes McKinney1 Josef Perktold2 Skipper Seabold3

1 Departmentof Statistical Science

University of North Carolina at Chapel Hill

10th Python in Science Conference, 13 July 2011

A library for statistical modeling, implementing standard statistical

Statistical modeling of time-ordered data observations

Brief update on statsmodels development

We’re now on GitHub! Join us:

Check out the slick Sphinx docs:

Development focus has been largely computational, i.e. writing

Major work to be done on providing a nice integrated user interface

While I have a captive audience...

We need to “commit” ASAP (not 12 months from now) to a high

1.0 Autocorrelation 1.0 Partial Autocorrelation

Ljung-Box test for zero autocorrelation

yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q

Several likelihood-based estimators implemented (see docs)

Standard model diagnostics, standard errors, information criteria

Widely used model for modeling multiple (K -variate) time series,

Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ)

>>> model = VAR(data); model.select_order(8)

>>> result = model.fit(2)

>>> result = model.fit(2)

Forecast error variance decomposition (FEVD) m1

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])

Hodrick-Prescott (HP) filter separates a time series yt into a trend τt

In addition to the HP filter, 2 other filters popular in finance and

Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered

A state space model by another name:

Estimation of basic model by Kalman filter recursions. Provides

1.6 JPY-USD Exchange Rate Volatility Process

0.20 200 400 600 800 1000

ARCH / GARCH models for volatility

We’ve implemented many foundational models for time series

Potrebbero piacerti anche

yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q

Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ)